Entity Veracity: From Big Data Concept to Knowledge Graph Verification

Eureka Research

Abstract

Entity veracity has emerged as a formalized concept in information retrieval and knowledge graph research. Academic researchers at the University of Padua published the first explicit entity veracity frameworks at ACM CIKM 2024–2025, directly building on IBM's foundational 2012 data veracity work. This analysis traces the intellectual bridge: Big Data's concern with "is this data accurate?" has been systematically extended to "is this entity real and trustworthy?" through Google's verification systems, academic research, and knowledge graph quality frameworks. The term now appears in peer-reviewed ACM proceedings with formal definitions, computational methodologies, and empirical validation.

1. The Big Data Foundation: IBM Establishes Veracity (2012)

Data veracity entered the Big Data lexicon in 2012 when IBM's Institute for Business Value published "Analytics: The Real-World Use of Big Data," adding veracity as the 4th V to Doug Laney's original 2001 framework of volume, velocity, and variety.[1] The IBM report, authored by Schroeck, Shockley, Smart, and colleagues in collaboration with Oxford's Saïd Business School, defined veracity as "the level of reliability associated with certain types of data," emphasizing that even rigorous cleansing cannot eliminate inherent unpredictability.

IBM's formal definition encompasses multiple dimensions: accuracy (correctness and precision), trustworthiness (source reliability), consistency (uniform quality), credibility (believability), and provenance (data lineage).[2]

The 2014 academic paper by Lukoianova and Rubin in Advances in Classification Research Online further operationalized this through a "Big Data Veracity Index" with three theoretical dimensions—objectivity/subjectivity, truthfulness/deception, and credibility/implausibility.[3]

Key Finding

Value was subsequently added as the 5th V, creating the now-standard 5 V's model. IBM's 2018 white paper "The Path to Data Veracity" noted that 80% of every Big Data project involves finding, cleansing, and understanding data, with poor data quality costing U.S. businesses an estimated $600 billion annually.

2. Academic Literature Explicitly Defines Entity Veracity

The most significant contribution to entity veracity comes from researchers Stefano Marchesin and Gianmaria Silvello at the University of Padua, collaborating with Omar Alonso (Amazon). Their CIKM 2024 paper, "Veracity Estimation for Entity-Oriented Search with Knowledge Graphs," represents the first formal treatment of entity-level veracity as a distinct dimension in information retrieval.[4]

The framework introduces three key innovations:

Innovation	Description
Utility Model	Defines fact utility based on entity popularity: `u(t) = u(s) + u(o)`, where entity utility equals web search result count
Graph Partitioning	Uses stratification to divide knowledge graphs into subgraphs for scalable assessment
Partition Veracity	Employs sampling, active learning, and statistical estimators to compute veracity without exhaustive fact-checking

"Entity-level veracity is defined as the mean of the veracity estimates of its triples."
— Marchesin, Silvello, & Alonso, CIKM 2024[4]

Their methodology, implemented in computeEntityVeracity.py and published on GitHub, aggregates fact-level accuracy scores into entity-level trust metrics.[5] A critical empirical finding: veracity and popularity are nearly orthogonal—no correlation exists between how prominent an entity is and how accurate its data are.

The follow-up paper at CIKM 2025, "Scaling Trust: Veracity-Driven Defect Detection in Entity Search" by Irrera, Marchesin, Silvello, and Alonso, introduces ALADDIN (Active Learning-based VerAcity-Driven Defect IdentificatioN)—a lightweight framework applying entity veracity to trust-based ranking through the eRank strategy.[6]

Earlier foundational work includes Esteves, Rula, Reddy, and Lehmann's 2018 paper in ACM Journal of Data and Information Quality, "Toward Veracity Assessment in RDF Knowledge Bases," which explicitly frames knowledge base quality through Big Data's 4 V's: "Knowledge base quality assessment poses a number of big data challenges such as high volume, variety, velocity, and veracity."[7]

3. Google Patents Reveal Verification Infrastructure

Google's approach to entity verification is distributed across multiple patent families rather than concentrated in dedicated "entity veracity" patents. The most relevant recent filing is US11769017B1 ("Generative Summaries for Search Results," September 2023), which explicitly addresses veracity:[8]

"Providing such confidence annotation(s) can enable a user to quickly ascertain veracity of the NL based summary and/or portion(s) thereof."
— Google Patent US11769017B1

This patent describes embedding-based verification—processing content through encoder models, comparing embeddings to determine if information is "verifiable based on document content," and using distance measures as verification metrics. It establishes three trustworthiness measures:

Measure Type	Factors Considered
Query-Independent	Author reputation, domain authority, inbound links
Query-Dependent	Relevance rankings, contextual alignment
User-Dependent	Profile relations, personalization factors

US10331706B2 ("Automatic Discovery of New Entities Using Graph Reconciliation," 2019) directly addresses entity verification by "corroborating at least a quantity of determinative facts about the potential entity" before adding entities to knowledge graphs.[9]

DeepMind's US12094474 (September 2024) tackles provenance verification for neural network outputs, demonstrating watermarking-based authenticity verification.[10]

The foundational US7603350B1/US8818995B1 patent family on "Search Result Ranking Based on Trust" establishes entity trust ranks based on accumulated trust relationships between entities, indicating "whether (or the degree to which) one entity trusts another entity."[11]

4. Operational Systems Bridge Concept to Practice

Google has operationalized entity verification through multiple production systems:

Enterprise Knowledge Graph

The Entity Reconciliation API reconciles records of organizations, businesses, and persons with confidence scores indicating "how likely it is that these entities belong to this group." Documentation explicitly lists "Know your customer: anti-money laundering, identity verification" as core use cases—directly confirming entity verification as an operational concern.[12]

Knowledge Panel Verification

Requires entities to prove authenticity through sign-in to official sites, submission of official documentation, or screenshots demonstrating admin access. Verified entities receive "priority for review," creating a tiered trust system.[13]

Google Business Profile Verification

Includes live video calls where business owners demonstrate physical location, signage, and proof of operations to Google representatives, with verified businesses receiving trust badges.[14]

Wikidata ProVe System

Wikidata's ProVe (Provenance Verification) system represents the clearest implementation of entity-level veracity scoring. It returns quality scores from -1 to 1 indicating "the overall level of support for a Wikidata entity's references," transforming claims into natural language and comparing against external sources.[15]

The Semantic Web Journal paper "Can You Trust Wikidata?" (Santos, Schwabe, Lifschitz, 2024) directly addresses this: "the user should, in principle, ensure that s/he trusts the veracity of the claim."[16]

5. The Conceptual Lineage

The evolution from data veracity to entity veracity follows a coherent intellectual path:

Data Veracity Concept	Entity Verification Application
Accuracy of facts	Correctness of entity attributes
Source trustworthiness	Authority/authenticity of entity claims
Consistency	Entity identity resolution across sources
Provenance tracking	Entity verification audit trail
Confidence scores	Entity reconciliation confidence

The IEEE Transactions on Knowledge and Data Engineering survey by Xue and Zou (2023), "Knowledge Graph Quality Management," defines quality dimensions including accuracy (semantic correctness of entity assertions), trustworthiness (how much entities/facts can be trusted), and security—which "evaluates the degree to which the KG uses a digital signature and verifies the identity of the publisher."[17]

6. Conclusion

Entity veracity has crossed from emergent concept to formalized research domain. The 2024 CIKM paper by Marchesin, Silvello, and Alonso provides the authoritative academic foundation, explicitly defining entity-level veracity as aggregated fact veracity with computational frameworks. This builds directly on IBM's 2012 data veracity work, applying the same principles—accuracy, trustworthiness, provenance—at the entity level rather than the dataset level.

Google's patent portfolio demonstrates operational commitment to entity verification through embedding-based validation, trust-based ranking, and entity reconciliation confidence scoring. Production systems including Knowledge Panel verification, Business Profile verification, and Enterprise Knowledge Graph operationalize these concepts at scale.

Summary

The term "entity veracity" is no longer speculative—it appears in peer-reviewed ACM proceedings with formal definitions, computational frameworks, and empirical validation. The conceptual bridge is complete: Big Data's veracity dimension has been systematically extended to address whether entities themselves are real, verified, and trustworthy.