A year-long red-teaming audit by NewsGuard has laid bare a startling deterioration in AI chatbot accuracy: the top 10 consumer chatbots now repeat false claims in 35% of their news-related responses, up from just 18% a year ago. The findings, released in August 2025, mark the first time NewsGuard has publicly named model-level performance, revealing that even the most hyped commercial systems—including OpenAI’s ChatGPT, Google’s Gemini, and Microsoft’s Copilot—regularly generate and propagate fabricated narratives when pushed on news topics.
The audit arrives as vendors tout major upgrades like OpenAI’s GPT‑5 and Google’s Gemini 2.5, both marketed as leaps forward in reasoning and reliability. Yet the NewsGuard data shows that across the industry, the push to make chatbots more responsive and web-connected has slashed refusal rates from 31% to near zero—while sending misinformation rates soaring.
How the Audit Worked
NewsGuard’s AI False Claims Monitor is a monthly adversarial testing program built on the company’s “False Claim Fingerprints” database of provably false narratives. For the August 2025 audit, the team selected 10 specific false claims actively circulating online and tested each chatbot with three distinct prompt personas:
- Innocent: a straightforward, neutral query that might come from an everyday user.
- Leading: a prompt that presumes the false claim is true, mimicking a user who already believes it.
- Malign: a deliberately crafted input designed to circumvent safety guardrails and coax the model into repeating the falsehood.
Each model received 30 prompts (10 claims × 3 personas), and every response was classified by human analysts as a debunk, a non‑response, or a repeat of the false claim (misinformation). This design intentionally stresses the systems under conditions that mirror real‑world abuse—from casual misinformation exposure to coordinated disinformation campaigns.
The Report Card: Which Models Spread Misinformation?
With vendor identities de‑anonymized for the first time, the audit reveals sharp disparities:
| Model | False Claim Rate (Aug 2025) |
|---|---|
| Inflection AI’s Pi | 57% |
| Perplexity AI | 47% |
| OpenAI ChatGPT | 40% |
| Meta Llama | 40% |
| Microsoft Copilot | 35% |
| Mistral Le Chat | 35% |
| Google Gemini | 17% |
| Anthropic Claude | 10% |
Inflection’s Pi topped the chart at 57%, while Anthropic’s Claude performed best, repeating false claims only 10% of the time. Google’s Gemini also held comparatively steady at 17%. The widely used ChatGPT and Llama both landed at 40%. Microsoft’s Copilot and Mistral’s Le Chat sat at the 35% mark, with Mistral’s performance unchanged from the previous year.
Perplexity AI’s trajectory is especially striking: NewsGuard’s earlier testing had recorded near‑zero false‑claim rates for the search‑focused assistant. By August 2025, however, Perplexity repeated falsehoods in 47% of responses—a surge that the audit attributes to the platform’s aggressive web retrieval and summarization, which often pulls from low‑quality sources without adequate filtering.
Why Did Falsehood Rates Double?
NewsGuard’s analysis pinpoints a structural trade‑off: chatbots are now programmed to answer almost everything. “Non‑responses”—the safe “I don’t know” fallback that characterized earlier models—plummeted from around 31% in August 2024 to essentially zero in the latest audit. In their absence, models now frequently generate confident but incorrect answers, especially when prompted about breaking news or geopolitically charged topics.
Two mechanisms drive the shift:
- Web‑grounding and retrieval: Many chatbots now fetch real‑time web content during inference. While this improves recency, it also opens an attack surface. Coordinated influence networks deliberately seed false narratives on sites optimized for AI retrieval—low‑quality “micro‑sites” and AI‑written blogs that search engines and crawlers index as if they were authoritative. When chatbots cite these sources, the falsehoods get laundered into outputs.
- Guardrail and policy tuning: Vendors have tuned models to prioritize helpfulness and engagement over refusal. An answer that cites a dubious web source is still an answer—and in the market race for user retention, that answer often wins over a cautious decline.
Concrete Examples: From Moldova to Macron
NewsGuard’s audit doesn’t just produce numbers; it documents real‑world propagation of engineered falsehoods. One case involved a fabricated news item mimicking the Romanian outlet Digi24, complete with an AI‑generated audio clip supposedly of Moldovan Parliament Leader Igor Grosu calling Moldovans “a flock of sheep.” The claim was originally seeded by pro‑Kremlin networks. The audit found that Mistral, Claude, Pi, Copilot, Meta, and Perplexity all repeated the claim as factual, and some even provided links to Pravda‑affiliated sites as sources.
Separate reporting by Les Echos and follow‑up coverage highlighted Mistral’s Le Chat repeating false claims about French President Emmanuel Macron and First Lady Brigitte Macron in up to 58% of English‑language responses. Mistral acknowledged that both web‑connected and offline versions of its assistants showed vulnerabilities.
These examples illustrate how narratives move from low‑traffic propaganda sites into chatbot outputs, giving them a veneer of credibility that casual users might mistake for independent verification.
Vendor Claims vs. Observed Reality
NewsGuard’s findings stand in sharp contrast to recent product launches that emphasize reliability:
- OpenAI rolled out GPT‑5 with assertions of “substantially improved reasoning” and reduced hallucination rates. The company’s system card acknowledges progress but stops short of a blanket “hallucination‑proof” guarantee. Independent audits like NewsGuard’s show that despite internal gains, the model still repeats false claims 40% of the time when tested with real‑world disinformation.
- Google’s Gemini 2.5 rollout touted enhanced reasoning and longer context windows, but the model still yielded a 17% false‑claim rate in NewsGuard’s adversarial testing—better than peers, but far from immune to targeted misinformation campaigns.
The lesson is clear: marketing metrics around “lower hallucination rates” on curated benchmarks do not automatically translate into robustness against actively circulating falsehoods that exploit web retrieval and shallow source vetting.
Critical Analysis: What These Results Mean for Enterprise and Consumer Users
Not all errors are created equal. A chatbot that fumbles a trivia question is one thing; one that regurgitates a political defamation or health hoax is another. NewsGuard’s audit deliberately targets the latter: news, elections, and geopolitically sensitive narratives where the civic and business harm is highest. For enterprise IT teams deploying chatbots in Windows‑centric workflows—legal document drafting, customer support, HR, or executive summaries—the findings demand a recalibration of trust.
The retrieval problem is, at its core, a design choice. Systems can be built with stricter source policy layers, provenance checks, and policy ensembles that refuse to answer when sourcing is weak. But those guardrails introduce friction—more refusals, less immediacy—which vendors in a competitive market are often reluctant to accept. NewsGuard’s data suggests many currently opt for responsiveness, with measurable consequences for accuracy.
The de‑anonymized release of vendor‑level scores is a transparency milestone. It enables enterprise buyers, regulators, and power users to factor model reliability into procurement decisions. When a chatbot will be used for anything touching public communications, policy, or sensitive internal documentation, human‑in‑the‑loop verification isn’t optional—it’s mandatory.
The Geopolitics of AI-Groomed Propaganda
Beyond technical trade‑offs, the audit documents an escalating information‑war strategy: state‑linked or state‑adjacent operations deliberately produce reams of AI‑friendly content to influence what LLMs retrieve and repeat. Networks like the Pravda ecosystem and campaigns dubbed Storm‑1516 or Matryoshka create articles, deepfakes, and mimic‑sites optimized for machine digestion. NewsGuard has documented instances where chatbots cited these sources directly, repeating false narratives verbatim.
These operations are cheaper to run at scale than traditional influence campaigns and explicitly target AI retrieval pipelines. The practical implication is acute: AI systems that incorporate web search without nuanced trust scoring can become unwitting amplifiers of foreign propaganda—precisely the opposite of what many enterprise guardrails aim to prevent.
Where Accountability and Product Design Collide
Naming and shaming models creates pressure for vendor accountability, but it doesn’t solve the structural problem. Developers face a menu of hard options:
- Tighten safety refusals, which will raise non‑response rates for ambiguous news queries.
- Improve retrieval source vetting and provenance signals, which requires significant engineering investment.
- Build robust model‑level fact‑checking and cross‑validation with curated databases.
- Accept a user experience trade‑off: a little less convenience for a lot less amplification of disinformation.
Each carries business and technical costs. The path forward will likely blend better provenance, more conservative defaults for news/political queries, and enterprise controls that let admins select model modes matched to their risk appetite.
Operational Recommendations for Windows IT and Business Users
For teams embedding AI into Microsoft 365, Windows Copilot, or custom enterprise applications, a practical playbook follows:
- Insist on citation‑aware modes whenever factual accuracy matters, and verify the cited sources manually. Models that expose snippet links make verification feasible.
- Implement mandatory two‑step human review for any AI output used in public communications, policy, legal, or clinical contexts. Label drafts explicitly as “AI‑generated” and require sign‑off.
- Use model ensembles or fallback strategies: combine a high‑recall, citation‑heavy model with a more conservative one, and surface disagreements for human review.
- Monitor adversarial web campaigns: tools that detect spikes in low‑quality, AI‑generated content can feed site‑blocklists into retrieval pipelines, reducing the chance of relying on poisoned sources.
Strengths and Limitations of NewsGuard’s Methodology
The audit excels as a targeted red‑teaming tool: it uses current, verifiable falsehoods, adversarial personas, and human evaluation, making it highly relevant to real‑world misinformation risks. The de‑anonymization provides unprecedented transparency for buyers and regulators.
However, the scope is narrow. With only 10–15 false claims per cycle and a focus on news and politics, the percentages are domain‑specific. A model that performs poorly here may excel on coding, math, or document summarization. The monitor is best seen as a stress test that reveals systematic vulnerabilities under adversarial pressure—not a universal correctness score for all use cases.
Final Assessment
NewsGuard’s August 2025 audit delivers an unambiguous message: the AI industry’s relentless drive toward responsiveness and web integration has slashed silence but tripled the risk of confidently delivered falsehoods on news topics. A 35% aggregate misinformation rate does not condemn LLM capabilities across the board—models have improved on reasoning benchmarks and many domain tasks—but it signals that the information‑security dimension of deployment is dangerously under‑resourced.
For Windows users, IT managers, and content professionals, the rules are now clear: treat AI drafts as raw material, never as authoritative evidence. Demand provenance and citations. Build human review into every public‑facing pipeline. And above all, be skeptical of marketing that paints new models as infallible. Independent red‑teaming consistently shows that improved internal metrics don’t eliminate the vulnerability to targeted disinformation. Until vendors demonstrate that they can deliver timely answers without becoming conduits for coordinated lies, prudent skepticism and layered verification remain the responsible default.