Results

Key Results & Evaluation

Carefully structured prompting, particularly Fractal Chain-of-Thought reasoning, substantially improves factuality factor accuracy compared to naive baselines, while reducing hallucinations and misclassifications.

Our primary evaluation metric was accuracy, measured against our ground-truth dataset of 40 hand-labeled articles. To rigorously measure the impact of our prompt engineering, we established a zero-shot baseline for each factuality factor. This baseline asked the model to score the article using only the core definitions of the factor, without any examples or complex reasoning steps. As shown in the data below, the baseline performance was generally poor, with the Malicious Account agent scoring as low as 0.1 accuracy. By progressively layering techniques, moving from In-Context Learning (ICL) to Function Calling, and finally to Fractal Chain-of-Thought (FCoT), we were able to see which reasoning structures drove performance improvements, ultimately pushing our Malicious Account accuracy to 0.95.

Summary of accuracies

Prompt Frequency heuristic Malicious account Sensationalism Naive realism
Base 0.3 0.1 0.425 0.375
Chain-of-Thought 0.4 0.05 0.575 0.45
In-Context Learning (ICL) 0.675 0.55 0.575 0.45
Function Calling 0.525 0.125 0.5 0.4
FCoT 1 (Gemini 2.5 Flash-Lite) 0.2 0.775 0.45 0.4
FCoT 1 0.55 0.225 0.525 0.475
FCoT 3 0.525 0.95 0.575 0.475
FCoT 4 (final) 0.65 0.95 0.675 0.40

Every prompt was evaluated using Gemini 3-Flash-Preview unless otherwise specified.

Factor-wise findings

Frequency heuristic
Accuracy jumps from 0.3 (Base) to 0.675 with ICL and stabilizes around 0.65 with FCoT 4. The big gain from ICL suggests that clearly labeled examples are especially effective for teaching the model the difference between legitimate consensus and manipulative "everyone knows" rhetoric.
Malicious account
The Malicious Account factor sees the most dramatic improvement, from 0.10 (base) to 0.95 (FCoT 4). Early prompts often over-predicted deception and hallucinated article sources. FCoT's hallucination constraints and multi-iteration analysis made the agent much more accurate.
Sensationalism
Accuracy improved steadily from 0.425 (base) to 0.675 (FCoT 4). Early prompts struggled with penalizing neutral journalists who simply quoted a politician's highly exaggerated or sensational statements. By the final FCoT 4 prompt, we explicitly instructed the agent to differentiate between the author’s own narrative voice and direct quotes. This allowed the system to better detect when an article was weaponizing emotionally charged superlatives or prioritizing high-shock-value anecdotes over objective reporting.
Naive realism
Naive Realism is the hardest factor to optimize. Accuracy improves slightly from 0.38 (base) to the mid-0.40s for more complex prompts. This suggests that dogmatic perspective-taking is more subtle and context-dependent than the other heuristics.

Limitations

While our multi-agent framework successfully improved factuality evaluation, we encountered several limitations during deployment:

  • Computational overhead & token consumption. The most successful prompting strategy, Fractal Chain-of-Thought (FCoT), requires three internal reasoning loops per agent, running in parallel across four different factuality factors. This multi-iterative approach drastically increases both token usage and processing time compared to a standard zero-shot prompt, introducing a trade-off between accuracy and computational efficiency.
  • Quotation vs. author voice. Early iterations of our system suffered from false positives by penalizing journalists who simply quoted a politician's exaggerated statements. While FCoT 4 largely corrected this by instructing the agent to isolate the author's narrative voice, highly blended articles can still occasionally confuse the agents.

Article Results Example

Multi-Agent Misinformation Analysis System interface showing article input, agent performance table, and factuality analysis output with synopsis and scores
Example input article results in project UI