DBT's Microsoft Copilot Trial Exposes Hallucination Risks and a Missing Business Case

The UK's Department for Business and Trade (DBT) has released findings from a three-month Microsoft 365 Copilot pilot that reveal a familiar gap between AI promise and performance: 71.7% of users were satisfied, yet time savings were modest, some tasks took longer, and the evaluation conspicuously avoided any financial cost-benefit analysis. The trial, which ran from October to December 2024, put Copilot in the hands of staff across routine office workflows—and while it earned high marks for meeting transcription and email drafting, the report warned that hallucinated outputs, environmental concerns, and the absence of measurable productivity gains demand serious governance before any wider rollout.

DBT's report lands as governments worldwide scramble to inject generative AI into public services, often without settled policies on data security, accuracy, or environmental cost. The pilot's raw numbers—80% of participants said the tool was useful to some degree, and neurodiverse users and non-native English speakers reported significant accessibility benefits—paint Copilot as a warmly received digital assistant. But the deeper analysis shows a product that shaves minutes off templated tasks while risking factual errors that could erode trust in government communications. For Windows enterprise administrators and IT decision-makers watching this space, the trial is a case study in what Copilot can actually do today—and what it cannot.

The Trial at a Glance: High-Volume, Low-Risk Workloads

DBT limited its pilot to standard office productivity scenarios. Participants used Copilot integrated into Word, Outlook, Teams, Excel, and PowerPoint, leaning heavily on the same few use cases that have emerged in other pilots:

Transcribing and summarising Teams meetings
Drafting and rewriting emails
Summarising long documents and written communications
Generating initial drafts for reports and templated content

The integration with Microsoft Graph—which lets Copilot pull context from emails, calendars, and files—lowered the barrier to adoption. Staff didn't need to switch tools; the assistant lived inside the apps they already used. That seamlessness, however, also meant that when Copilot stumbled, the friction hit directly inside core workflows.

What DBT Measured—and What It Skipped

The report tracked user satisfaction, perceived usefulness, and self-reported time savings. But DBT stopped short of the metrics that procurement teams really need: no total cost of ownership model, no comparison of license fees against hours saved, and no quantified environmental impact. The evaluation explicitly flagged these gaps, recommending further study before any department-wide commitment.

Time Savings: Real but Razor-Thin

Across the pilot, time savings were small and highly task-dependent. Written tasks—drafting emails, summarising documents—showed the biggest gains. But DBT also recorded instances where using Copilot took longer than doing the task manually. Scheduling and image generation were singled out as slower, likely because the back-and-forth of prompt refinement and verification ate up any theoretical efficiency.

This patchy performance aligns with findings from Australia's Treasury Copilot trial and several university pilots. Those studies also saw minutes, not hours, saved per week. The Australian Treasury's analysis suggested that even a saving of 13 minutes per week per user could justify licensing costs at scale—but that threshold is highly sensitive to local salary rates, the actual proportion of users who realise gains, and hidden governance and training costs. DBT's report didn't attempt a similar calculation, leaving value-for-money as an open question.

Hallucinations: The Trust Problem No One Can Ignore

Perhaps the most sobering part of DBT's evaluation is the frank acknowledgment of hallucinations—outputs that are fluent, confident, and wrong. Participants flagged this as a material risk, especially for any AI-generated text that might be forwarded without human scrub. In a government setting, where citizens rely on accurate official communications, a single fabricated figure or misattributed policy snippet could cause disproportionate reputational harm.

The report's recommendation is unequivocal: human review must be mandatory for any substantive output. That position is now standard across public-sector pilots. It also means that the net time savings from Copilot must be calculated after subtracting review overhead. When verification cycles are added, the productivity equation gets murkier, and some tasks that appeared to be accelerated may actually cost more effort end-to-end.

Environmental and Ethical Frictions

Staff raised ethical concerns about the carbon footprint of large language models, and DBT noted these without quantifying them. The pilot used Microsoft's cloud infrastructure, but the department did not measure energy consumption or emissions attributable to its Copilot usage. For public-sector IT leaders under pressure to meet net-zero targets, this is a critical blind spot. Without transparent data from Microsoft—and without lifecycle assessments that account for training and inference—environmental claims remain qualitative and impossible to cost.

Other pilots have faced similar hurdles. Some organisations have begun asking vendors for energy attribution per query or per user, but Microsoft does not publicly provide such granular metrics for Copilot. DBT's report recommended further evaluation, but the phrasing suggests the department knows it cannot scale AI services without hard numbers on both costs and carbon.

How DBT's Findings Fit the Bigger Picture

DBT's results are not an outlier. Multiple public-sector and enterprise trials have converged on the same set of conclusions:

Copilot excels at high-frequency, templated, and communication-heavy tasks.
It struggles with complex, high-context, or deeply analytical work.
User satisfaction is consistently high when the tool is applied to its sweet spots.
Measurable productivity uplift is modest and concentrated in specific roles.

Australia's Treasury trial, for instance, reported remarkably similar use-case patterns and recommended role-based licensing rather than blanket enablement. University trials have emphasised the accessibility benefits for neurodiverse and non-native English-speaking staff—a finding DBT echoed explicitly. Across the board, governance and training investments are the single biggest determinant of whether a pilot matures into a net-positive deployment.

Critical Analysis: The Three-Way Split of Strengths, Weaknesses, and Risks

Where Copilot Shines Today

Accessibility and inclusion. The ability to automatically generate meeting summaries and simplify language reduces cognitive load and helps staff who might otherwise struggle with rapid-fire written communication. For DBT, this was one of the most tangible benefits.
Seamless integration. Copilot doesn't require new logins or separate windows; it's baked into Word, Outlook, and Teams. That immediacy lowers the activation energy for adoption and lets IT teams avoid another round of app sprawl.
Rapid drafting. For first-draft email replies, report outlines, and slide decks, Copilot accelerates the blank-page problem and lets users iterate quickly.

Where Expectations Crumble

Hallucinations are not a corner case. Every pilot, including DBT's, reports plausible-sounding falsehoods. The risk isn't just the error itself—it's the erosion of trust in the tool and the added cognitive burden of constant fact-checking.
Uneven time gains. DBT found that some tasks took longer; so did others. If users have to learn prompt engineering, verify every claim, and still sometimes lose time, the net ROI can turn negative for roles that don't do bulk templated writing.
Training and culture costs are real. Microsoft's marketing often glosses over the effort needed to build prompt libraries, train staff on verification workflows, and maintain governance guardrails. These costs are front-loaded and recurring.

Governance and Operational Risks

Data spillage. Copilot's access to Microsoft Graph means it can surface sensitive information from emails, SharePoint, and Teams if permissions are not locked down with least privilege and proper data loss prevention policies. DBT's report did not disclose specific data governance configurations, but other public-sector pilots have flagged this as a top concern.
Vendor lock-in. Deep embedding in the Microsoft 365 ecosystem makes it harder to switch to alternative AI assistants or to evaluate them side-by-side. Over time, institutional muscle memory can atrophy, leaving staff reliant on a single vendor's evolving toolset.
Hidden long-term costs. Trial periods often mask full licensing fees, Azure consumption charges, and the cost of internal governance boards, audits, and environmental reporting. Organisations that don't model these from the start may find themselves trapped in an expensive commitment with poor exit ramps.

Practical Lessons for IT Leaders

DBT's experience—backed by parallel trials—offers a clear, if demanding, roadmap for Windows enterprise shops considering Copilot:

Define success metrics before launch. Don't settle for vague satisfaction scores. Measure minutes saved per task, reduction in review cycles, and actual user adoption across the pilot cohort.
Limit the pilot to two to four high-frequency, low-risk use cases. Email triage, meeting summaries, and slide first drafts are proven starting points. Avoid stretching into scheduling or creative image generation early on.
Adopt role-based licensing. Blanket deployment dilutes ROI. Target administrative, communications, and policy roles where repetitive drafting makes up a significant portion of weekly work.
Invest in training and a shared prompt library. Short workshops, validated prompt templates, and an internal repository reduce the trial-and-error that eats time and frustrates users.
Harden governance from day one. Configure DLP, enforce least-privilege access, enable audit logging, and mandate human-review gates for any output that will leave the drafts folder. Secure legal, privacy, and security sign-off before handling sensitive or citizen-facing material.
Demand transparent cost and environmental metrics. Before scaling, ask Microsoft for granular energy consumption data and commit to independently measuring your own pilot's footprint. DBT's report implicitly recognises that without this, accountability is impossible.

Policy Implications for Public-Sector AI

DBT's trial reinforces a growing consensus on how governments should adopt generative AI tools:

Stage everything. Start with a timeboxed, narrow-scope trial, measure relentlessly, and let the data drive expansion decisions.
Insist on contractual transparency. Procurement frameworks must require written confirmation that tenant data isn't used to train external models, explicit trial-duration terms, and environmental disclosures wherever measurable.
Stand up an AI governance board. Include legal, privacy, security, and mission-area leads. Define acceptable use cases, mandatory review workflows, audit trails, and attribution requirements for AI-assisted outputs.

These steps are not bureaucratic overhead—they are the difference between a Copilot deployment that quietly saves minutes and one that generates a newspaper headline for disseminating hallucinated policy advice.

Conclusion: A Measured Advance, Not a Leap

DBT's trial of Microsoft Copilot is neither a condemnation nor an endorsement. It is a precise, deliberately narrow measurement of where the technology stands in late 2024: useful for defined, templated tasks; genuinely beneficial for accessibility; but still burdened by factual unreliability, uneven efficiency gains, and opaque environmental costs. The 71.7% satisfaction rate reflects real user appreciation, yet the absence of a financial business case prevents any claim that Copilot delivers measurable value for money.

For Windows enterprise leaders, the takeaway is clear: Copilot can be a productivity lever, but only if you grip it with both hands on the governance controls. Target it at the right workflows, invest in your people, and measure what matters—minutes saved, errors caught, carbon emitted. Until Microsoft closes the gaps in transparency and reliability, the decision to scale remains a calculated risk, not a certain win. DBT's report has done the public sector a service by documenting that risk in unsparing detail.