Data

Datasets & Preprocessing

LIAR-PLUS

We used the LIAR-PLUS dataset to train our factuality factor predictive models. LIAR-PLUS contains over 12,800 short political statements that have been fact-checked by PolitiFact. It is an extension of the original LIAR dataset, released by Alhindi et al. (2018) in Where is Your Evidence: Improving Fact-Checking by Justification Modeling (FEVER Workshop, EMNLP 2018). The dataset and code are available on GitHub.

  • Labels: six categorical truth ratings from pants-on-fire to true.
  • Metadata: speaker name, occupation, political party, state, subject, and justification text.
  • Splits: predefined train / validation / test splits provided by the dataset authors. All of our models were trained strictly on the train split

These metadata fields allow us to model factuality using both linguistic features (e.g., wording, sentiment, repetition) and socio-political context (e.g., speaker, party, topic).

PolitiFact Scraping

In order to reduce temporal bias and better match the current misinformation landscape, we merged in more recent data scraped from PolitiFact.

These additional statements:

  • Follow the same short-claim format as LIAR-PLUS.
  • Reuse PolitiFact’s truth ratings and justification text.
  • Expand coverage to more recent political topics and narratives, making our models less anchored to old news.

Preprocessing

Before training, we performed several preprocessing steps to create a clean, uniform dataset across original and scraped data:

  • Standardize column names across splits and sources.
  • Remove redundant index or ID columns that do not carry information.
  • Normalize text: strip whitespace, convert to lowercase during tokenization, and remove file suffixes such as “.json” from IDs.
  • Convert metadata fields into consistent formats (e.g., categorical encodings for party, binary indicators for subject domains).

The final result is a dataset that can be fed directly into our feature engineering pipelines for the four factuality factors.

Feature Engineering

From this dataaset, we derived features tailored to each of the four factuality factor predictive models:

Frequency heuristic
Captures how strongly a statement implies truth by repetition, buzzwords, or appeals to common belief. Features include: TF-IDF mean, average token frequency across the corpus, a buzzword score based on generalized phrases (“always”, “everyone”, “experts agree”), and a repetition ratio.
Sensationalism
Measures emotional exaggeration and dramatic framing. We engineered features like exclamation counts, all-caps tokens, and frequencies of sensational terms (“shocking”, “explosive”, “unprecedented”), plus sentiment polarity and subjectivity scores. These are combined with TF-IDF vectors.
Malicious account
Approximates language associated with inauthentic or spam-like behavior using features such as average token length, repetition score, link count, mention count, punctuation ratio, and uppercase ratio alongside TF-IDF mean.
Naive realism
Estimates how stronglt a statement presents its viewpoint as the absolute truth. Features include ratios of absolutist terms (“always”, “never”, “clearly”), cautious terms (“maybe”, “might”, “possibly”), and dismissive language (“biased”, “brainwashed”), plus sentiment subjectivity and polarity.

Limitations & Validity Considerations

While LIAR-PLUS is a widely used dataset we have to consider that the dataset focuses on short political statements, which may not capture the full complexity of long articles found on news outlets.