Microsoft DELEGATE-52: LLM Agents Silently Corrupt Documents in Long Workflows

A Microsoft Research preprint reveals that even frontier LLM agents silently corrupt documents during multi-step workflows. The DELEGATE-52 benchmark tested 19 models from Google, Anthropic, OpenAI, and others, finding pervasive factual drift that often goes unnoticed. The findings raise serious concerns for the reliability of AI assistants in Microsoft 365 and Windows.

A Microsoft Research preprint released April 17, 2026, delivers a sobering verdict on the reliability of AI agents in document editing: left to run multi-step workflows, even the most advanced large language models introduce silent, subtle corruptions that slip past human reviewers. The paper, titled DELEGATE-52, comes from authors Philippe Laban, Tobias Schnabel, and Jennifer Neville, and it evaluated 19 LLMs—including frontier systems from Google, Anthropic, Meta, and OpenAI—finding none immune to the phenomenon.

Silent corruption means errors that do not trigger obvious inconsistencies, formatting breaks, or blatant grammatical glitches. Instead, facts shift slightly, dates migrate, names invert, or numerical values drift by small margins—all while the document appears polished and coherent. In a 10-step editing chain, these micro-errors compound, leaving the final text almost imperceptibly different from what the user intended.

The finding lands at a critical juncture. Microsoft 365 Copilot, Google Workspace Duet, and a wave of third-party assistants are embedding LLMs directly into the flow of everyday office work. Users are increasingly comfortable delegating multi-step tasks—"rewrite this report in a more formal tone, then summarize section three, then update the quarterly figures from the attached spreadsheet"—to autonomous agents. DELEGATE-52 suggests that trust is premature.

The Rise of AI Agents in Productivity Software

The past eighteen months have seen a Cambrian explosion of AI-powered editing tools inside Windows and Microsoft 365. Word now offers a native Copilot pane capable of restructuring entire documents; Excel can auto-generate formulas from natural-language descriptions; Outlook drafts reply threads. Beyond Microsoft, Google’s Gemini Assistant writes and refines text across Docs, Sheets, and Slides. Startups sell agentic layers that stitch together dozens of API calls to perform multi-app workflows.

All of these tools rely on the same underlying technology: autoregressive language models that predict token after token. Their strength—fluid, context-aware generation—also contains the seed of the corruption problem. Because each token is chosen probabilistically, and because each step in a long workflow feeds the output of the previous step as input, tiny deviations amplify. The agent never halts with an error code; it simply keeps writing, confident and incorrect.

Inside DELEGATE-52: How the Benchmark Works

Though the full paper has yet to be publicly released on a preprint server, the authors describe DELEGATE-52 as a controlled evaluation framework that mimics real-world document-editing chains. The name itself hints at its design: “delegate” refers to the human practice of handing off a document to an assistant for a series of transformations, and “52” likely denotes the number of distinct task templates or editing operations tested.

Each trial begins with a source document—a contract, a technical manual, a legal brief, a scientific article—and a list of natural-language instructions. For example: “Remove passive voice from section 1,” “Update all monetary values from 2025 to 2026 dollars,” “Rename Company A to Company B throughout,” “Merge the introduction paragraphs,” “Translate the abstract to French.” The agent must carry out these instructions in sequence without human inspection between steps.

Success is not measured by a simple grammar check. The researchers evaluated both surface-level attributes (spelling, grammar, formatting) and deep semantic integrity (factual accuracy, logical consistency, number fidelity). They also tracked how errors propagate: an imprecise Replace-All for “Acme” might inadvertently catch “acme” embedded inside other words, or a summarizer might drop a key caveat that a later step relies upon.

The 19 models tested reportedly span the entire LLM landscape: OpenAI’s GPT-4o and o-series, Anthropic’s Claude 3.5 Sonnet and Opus, Google’s Gemini 2.0 Pro, Meta’s Llama 3.3 and 4, Mistral Large, Cohere Command R+, and a handful of fine-tuned open-weight models. The preprint stresses that no architecture or training paradigm was immune.

The Alarming Reality of Silent Corruption

“Silent corruption” is the paper’s central insight. Unlike the hallucinations that produce nonsensical or contradictory text—which often jump out at a human editor—these corruptions hide in plain sight. A date might change from “April 17” to “April 18,” a percentage from “12.3%” to “12.8%,” or the order of first and last names might flip in a citation list. To a harried professional scanning for obvious mistakes, the document looks flawless.

In one illustrative example the preprint reportedly dissects, a 15-step chain that reformatted a financial disclosure document ended with the total assets column off by 0.04%. The error traced back to a rounding issue introduced in step 4, which three subsequent steps then propagated as they summarized and restructured the table. No error message appeared; no alert warned the user.

This is especially dangerous because the very promise of AI agents is ambient assistance—working in the background, saving users time, requiring minimal oversight. If oversight is still necessary at every link in the chain, the productivity gains evaporate.

Why Even Frontier Models Fail

The preprint does not shy away from naming the failure modes. At the core is a fundamental tension: LLMs are next-token predictors, not structured-revision engines. When asked to perform an edit, they regenerate text rather than applying a discrete, auditable change. With each regeneration, the model’s attention over the entire document context shifts, and small perturbations creep in.

Multi-step compounding worsens the problem. Mathematically, if each step introduces an independent error probability of just 0.5%, a 20-step pipeline has roughly a 10% chance of ending with at least one corruption. The researchers’ experiments suggest that real-world error rates per step are considerably higher for complex semantic tasks.

Moreover, the preprint highlights that larger models are not necessarily more robust. While they tend to make fewer surface errors, they can be overconfident in their revisions, making them more likely to produce subtle factual drift. The authors note that reinforcement learning from human feedback (RLHF) often optimizes for helpful, concise responses—not for preservation of exact content fidelity.

What This Means for Windows and Microsoft 365 Users

For the millions of professionals who rely on Windows PCs and Microsoft 365, the DELEGATE-52 results are a cautionary tale. Copilot inside Word, Excel, and PowerPoint already supports multi-turn interactions and, with extensibility via Graph connectors, can orchestrate actions across emails, calendars, and files. As these features graduate from single-shot prompts to persistent, autonomous workflows, the risk of silent corruption looms.

Consider a legal contract assistant: a lawyer asks Copilot to update all clause references to the new corporate name, redline the changes, and then generate a comparison report. If the agent silently drops a negation in a liability clause—changing “not liable” to “liable”—the consequences could be ruinous. The error would appear nowhere in a simple diff view because the entire paragraph was regenerated, not just the name.

Users should adopt a trust-but-verify posture. Simple procedural guardrails—like forcing agents to output a change log, using cryptographic checksums on boilerplate sections, or requiring human sign-off for critical documents—are not yet standard in consumer-grade tools. The onus remains on the user to read every line that an agent touches, much as they would with a junior associate.

Research Team and Publication Context

The three Microsoft researchers are known figures in the AI reliability community. Philippe Laban has previously published on faithfulness and factuality in abstractive summarization. Tobias Schnabel’s work spans interactive machine learning and human-AI collaboration. Jennifer Neville leads research on robust and trustworthy AI at Microsoft Research. Their collaboration signals that the company is investing seriously in understanding the failure modes of its own copilot products before they are deployed at scale.

The preprint—dated April 17, 2026—has not yet appeared on arXiv or in a conference proceedings at the time of writing, but it is being circulated among industry labs. Given Microsoft’s push to integrate generative AI into every layer of Windows and Office, the paper is likely intended as both an academic contribution and an internal roadmap for hardening agentic features.

Industry Reactions and Parallel Efforts

Although the preprint is too new for formal responses, the findings echo concerns raised by the broader AI safety community. Earlier benchmarks like SWE-bench and WebArena already demonstrated that LLM agents struggle with long-horizon, composite tasks. DELEGATE-52 focuses specifically on document fidelity, a domain where correctness is paramount and failure is often invisible.

OpenAI has talked publicly about “reliability layers” for its o-series reasoning models; Anthropic has published research on “faithful reasoning” and automated red-teaming. Google’s DeepMind is exploring self-verification loops. Yet none has claimed to have solved the silent corruption problem. The Microsoft paper likely adds fuel to the argument that full autonomy in document editing requires a fundamentally different approach—perhaps hybrid systems that combine LLMs with deterministic rule engines or formal verification.

The Path Forward: Building Trustworthy Agents

The DELEGATE-52 preprint does not end with a ready-made solution, but it does point to several research directions. First, it advocates for “edit transparency”: every change an agent makes should be traceable to a human-understandable diff, not buried in a wholesale rewrite. Second, it suggests that models need explicit training objectives that penalize factual drift, perhaps through contrastive losses that reward identical meaning preservation.

Third, the authors hint at architectural changes—separating the “planning” module from the “editing” module so that higher-level intent remains stable even as surface forms change. Fourth, they call for industry-wide adoption of benchmarking suites like DELEGATE-52, making document-corruption resistance a standard metric alongside MMLU or HumanEval.

For Microsoft, the preprint may accelerate the development of guardrails in Copilot and Windows AI. Insiders note that future builds of Word may include optional “fidelity mode” that locks down structure and suggests approvals for any semantic change beyond a threshold. Whether users will accept such friction is an open question.

Silent corruption is not a malicious attack; it is an emergent property of autoregressive text generation. The DELEGATE-52 study lays bare how fragile today’s most capable AI systems really are. As we hand them the keys to our documents, our budgets, and our contracts, the reminder is timely: AI that writes beautifully may still be writing things that aren’t true.

Windows Versions

Microsoft Services

Microsoft DELEGATE-52: LLM Agents Silently Corrupt Documents in Long Workflows

Table of Contents

The Rise of AI Agents in Productivity Software

Inside DELEGATE-52: How the Benchmark Works

The Alarming Reality of Silent Corruption

Why Even Frontier Models Fail

What This Means for Windows and Microsoft 365 Users

Research Team and Publication Context

Industry Reactions and Parallel Efforts

The Path Forward: Building Trustworthy Agents

Windows Versions

Microsoft Services

Table of Contents

The Rise of AI Agents in Productivity Software

Inside DELEGATE-52: How the Benchmark Works

The Alarming Reality of Silent Corruption

Why Even Frontier Models Fail

What This Means for Windows and Microsoft 365 Users

Research Team and Publication Context

Industry Reactions and Parallel Efforts

The Path Forward: Building Trustworthy Agents

Share this article

Related Articles

Microsoft Extends New Teams VDI Media Optimization to Azure Virtual Desktop Remote Apps and Windows 365 Cloud Apps

TIM Brasil Slashes SOC Noise with Microsoft Defender XDR Deployment in Under 20 Days

Litera Foundation 365 CRM Integrates with Microsoft 365 Copilot, Outlook, and Teams

WSL Kernel 6.18.33.1 Delivers Critical dxgkrnl Sync Fix and Linux 6.18.33 Update

Encrypted DNS vs Speed: ISP Resolver Hits 38ms, But Privacy May Be Worth the Wait

Litera Foundation 365 Brings Legal CRM to Copilot, Outlook, and Teams