AI assistants including Microsoft Copilot and Google Gemini can fabricate analysis and cite non-existent evidence when left on their default model settings, according to a controlled experiment conducted in May 2026. The study, which processed identical datasets through multiple mainstream AI platforms, revealed that so-called “auto” modes—where the system chooses how to respond without explicit tool-based verification—frequently produced convincing but entirely fictional conclusions. For Windows users who increasingly rely on Copilot integrated into Microsoft 365 and the Edge browser, the findings underscore the critical need for manual oversight and the use of built-in verification tools.

Security researchers and enterprise IT teams have long warned about AI hallucination, but the experiment’s stark results put a measurable spotlight on the gap between casual prompt-and-response interactions and robust, tool-assisted analysis. When the assistants were forced to execute code, query live databases, or run spreadsheet functions to derive their answers, the fabrication rate dropped to near zero. Without those checks, however, models generated plausible-sounding summaries that included invented data points, misattributed sources, and starkly incorrect statistical results.

The Experiment: Identical Data, Divergent Falsehoods

In the May 2026 study, three identical structured datasets were fed to Microsoft Copilot, Google Gemini, and a leading open-weight model, all configured in their default interface modes. Each dataset contained financial, operational, or clinical records with clear ground-truth outcomes. Researchers asked each assistant to “analyze the data and provide key insights” without specifying a methodology or tool use.

Across 30 trials per platform, Copilot’s auto mode delivered at least one major factual error in 68% of responses. Gemini’s error rate was 71%. Both platforms frequently invented column names, misstated totals by orders of magnitude, and referred to “published studies” that did not exist. In several cases, the assistants even generated realistic DOI links that led to 404 pages when tested.

“It was the digital equivalent of an overconfident intern winging a presentation,” said Dr. Lena Park, a research scientist involved in the experiment. “The models weren’t just wrong—they were wrong with conviction, complete with fake footnotes.”

When researchers repeated the queries using explicit tool-use instructions—such as “perform this analysis using Python” or “calculate the mean with Excel functions”—the error rate fell below 4% across all platforms. Even simple guardrails like asking the model to “explain step by step and show your work” before drawing conclusions reduced hallucination by more than half, though not as effectively as enforced code execution.

Why “Auto” Mode Encourages Fabrication

Modern AI assistants often operate in a dual-mode architecture. In “auto” or “determine” mode, the model decides whether to answer purely from its training data or to invoke external tools like a code interpreter, web search, or plugin. This decision is itself made by the model, which may opt for the faster, lower-cost path of answering from pretrained knowledge—even when that knowledge is inaccurate for the specific dataset at hand.

The fundamental problem lies in the next-token prediction nature of large language models. When presented with a table of numbers and asked for analysis, the model does not “calculate.” Instead, it predicts the most statistically likely sequence of words that constitute a plausible analysis. For common data patterns, this often produces correct results. But for edge cases, ambiguous formatting, or datasets containing outliers, the model may generate an analysis that sounds right but is divorced from the actual computations.

Microsoft’s own documentation acknowledges this limitation, stating that Copilot “may generate content that is inaccurate or inappropriate” and recommending that users “review all content generated by Copilot before using it.” Yet the default experience in Word, Excel, and the Copilot pane in Edge encourages users to accept generated summaries at face value.

The Tool-Check Divide: Code Execution as a Truth Anchor

When Copilot is explicitly given access to a Python or Excel environment, it transforms from a text generator into a computation orchestrator. The model translates the user’s query into code, executes it against the data, and then summarizes the actual output. This pipeline introduces a crucial deterministic step: the numbers that appear in the final summary are not predicted; they are calculated.

In Excel, for example, Copilot can generate formulas, run them, and reference exact cell values. The same holds for Copilot in Power BI or when using the Advanced Data Analysis (ADA) plugin in ChatGPT Enterprise. In these tool-augmented modes, the model’s role shifts from fabricator to translator and narrator of computed results.

Google Gemini offers a comparable feature called “Code Execution,” which the user must enable or that Gemini can activate automatically when it detects a computational request. However, the experiment showed that Gemini’s automatic detection was inconsistent—skipping code execution in 40% of cases where it was clearly warranted, leading to unsupported claims.

Real-World Consequences for Windows and Enterprise Users

For Windows enthusiasts who have embraced Copilot as a daily driver in Office apps, Edge, and even Windows 11’s system search, the implications are both unsettling and actionable. A financial analyst trusting an auto-generated trend report in Excel could present fabricated quarterly figures to stakeholders. A researcher using Copilot in Edge to summarize clinical trial PDFs might circulate conclusions based on misread statistics. A student answering a problem set with Copilot could submit solutions that are mathematically impossible.

Microsoft has positioned Copilot as a productivity revolution, embedding it deeply into Windows 11’s upcoming 24H2 update with a dedicated Copilot key on new keyboards and a persistent sidebar. The experiment’s findings suggest that without user education and enforceable tool-check policies, the very friction that AI promises to remove is actually a vital safety mechanism.

“We need to stop calling it ‘auto’ mode and start calling it ‘unverified mode’,” said Jamie Coleman, a senior solutions architect at a Fortune 500 firm. “The branding tricks users into trusting it.”

How to Protect Yourself: Enforce Tool Use and Verification

Based on the experiment’s results and current platform capabilities, users and administrators can take several practical steps to reduce hallucination risk:

  • Always prefer tool-augmented queries. When working with data in Copilot, begin your prompt with “Using Excel formulas…” or “Write and execute Python code to…” This forces the model into a path that relies on computed outputs rather than predictions.
  • Turn on explicit tool confirmation. In environments where it’s available, enable settings that require the assistant to confirm which tools it will use before responding. In Microsoft 365, this can sometimes be achieved through sensitivity labels and DLP policies that restrict certain auto-reply behaviors.
  • Audit generated content religiously. Before sharing any Copilot-generated analysis, spot-check the underlying data. If a summary mentions a specific value, locate it in the source dataset. Verify any cited references by attempting to access them.
  • Deploy enterprise guardrails. IT administrators can use Microsoft Purview compliance controls to limit which connectors Copilot can access and to enforce review workflows. For Gemini, Workspace admins can disable automatic code execution and require user initiation.
  • Educate your team. Incorporate the experiment’s findings into internal training materials. Emphasize that the default mode is the riskiest mode, and that a few extra seconds of prompt engineering can prevent hours of damage control.

Microsoft and Google Respond

In statements following the experiment’s publication, both Microsoft and Google acknowledged the work and reiterated existing guidance. A Microsoft spokesperson pointed to the company’s Responsible AI dashboard and the “content credentials” feature that labels AI-generated material. Google highlighted Gemini’s “Gems” and custom instructions that let users specify when to use code execution.

Neither company announced changes to default modes or interface design. However, Microsoft did note that the Copilot experience in Excel and Power BI already prefers tool-based analysis when it detects a structured data request—though the experiment revealed that detection still fails silently in some scenarios.

The Broader AI Governance Picture

The hallucination problem in auto mode is not unique to Copilot or Gemini. It reflects a core challenge in AI governance: how to balance seamless user experience with the need for factual integrity. As AI becomes embedded in everything from Windows search to healthcare diagnostics, the cost of an undetected fabrication escalates.

Regulatory bodies in the EU and US have begun signaling that AI-generated content must be clearly labeled and, when used for consequential decisions, subject to human review. The EU AI Act, for instance, classifies many business intelligence applications as high-risk, requiring transparency and accuracy measures. The experimental data from the May 2026 study provides concrete evidence that tool-enforced verification is one effective technical safeguard.

For Windows watchers, the takeaway is clear: Copilot’s magic has limits. The same transformer architecture that crafts elegant emails in Outlook can turn a simple data query into a minefield of plausible falsehoods. By understanding the split between auto prediction and tool-based computation, users can harness Copilot’s true power—not as an oracle, but as a clever interface to the deterministic tools that actually deliver accurate analysis.

Until default behaviors change, the burden remains on the human in the loop. And as the experiment warns, when the tool is left to its own devices, that loop can break without anyone noticing.