Datasets & Preprocessing
LIAR-PLUS
We used the LIAR-PLUS dataset to train our factuality factor predictive models. LIAR-PLUS contains over 12,800 short political statements that have been fact-checked by PolitiFact. It is an extension of the original LIAR dataset, released by Alhindi et al. (2018) in Where is Your Evidence: Improving Fact-Checking by Justification Modeling (FEVER Workshop, EMNLP 2018). The dataset and code are available on GitHub.
- Labels: six categorical truth ratings from pants-on-fire to true.
- Metadata: speaker name, occupation, political party, state, subject, and justification text.
- Splits: predefined train / validation / test splits provided by the dataset authors. All of our models were trained strictly on the train split
These metadata fields allow us to model factuality using both linguistic features (e.g., wording, sentiment, repetition) and socio-political context (e.g., speaker, party, topic).
PolitiFact Scraping
In order to reduce temporal bias and better match the current misinformation landscape, we merged in more recent data scraped from PolitiFact.
These additional statements:
- Follow the same short-claim format as LIAR-PLUS.
- Reuse PolitiFact’s truth ratings and justification text.
- Expand coverage to more recent political topics and narratives, making our models less anchored to old news.
Preprocessing
Before training, we performed several preprocessing steps to create a clean, uniform dataset across original and scraped data:
- Standardize column names across splits and sources.
- Remove redundant index or ID columns that do not carry information.
- Normalize text: strip whitespace, convert to lowercase during tokenization, and remove file suffixes such as “.json” from IDs.
- Convert metadata fields into consistent formats (e.g., categorical encodings for party, binary indicators for subject domains).
The final result is a dataset that can be fed directly into our feature engineering pipelines for the four factuality factors.
Feature Engineering
From this dataaset, we derived features tailored to each of the four factuality factor predictive models:
Limitations & Validity Considerations
While LIAR-PLUS is a widely used dataset we have to consider that the dataset focuses on short political statements, which may not capture the full complexity of long articles found on news outlets.