The UK government’s largest-ever test of Microsoft 365 Copilot has delivered a striking paradox: staff overwhelmingly love the AI assistant, reporting an average of 26 minutes saved per day, yet rigorous departmental measurement found no clear evidence that overall productivity improved. The gulf between glowing self-reports and conservative analytical reality challenges the notion that generative AI can be simply dropped into an enterprise to produce instant, organisation-wide efficiency gains.
The most dramatic data point came from a cross-government experiment coordinated by the Government Digital Service (GDS). Over 20,000 employees across a dozen departments used Copilot between September 30 and December 31, 2024. The headline finding—an average daily time saving of 26 minutes per user—was seized upon by AI advocates. If extrapolated across a working year, that translates to roughly two weeks per employee. Meanwhile, a separate, more methodologically stringent evaluation by the Department for Business and Trade (DBT), which licensed 1,000 staff over the same period, warned that “there was not robust evidence that measured productivity improved at the departmental level.”
Together, the reports provide the most balanced public-sector assessment yet of Microsoft’s flagship AI tool. They show where Copilot delivers real value today—and where enterprise IT leaders must recalibrate expectations before signing multi-year licensing deals.
Two Pilot Methodologies, Two Very Different Narratives
The cross-government trial cast a wide net. GDS collected quantitative adoption data from 14,500 users and 7,115 survey responses, making it the largest study of Copilot in any organisation at the time. Its conclusions were unambiguous: 82% of participants said they did not want to return to pre-Copilot working conditions, and satisfaction and recommendation scores hit 7.7 and 8.2 out of 10, respectively. The most common benefits cited were “improves productivity” and “reduces time spent on mundane tasks.”
DBT’s evaluation, however, took a deliberately sceptical approach. It combined telemetry from Microsoft’s dashboard, diary studies, observed timed tasks comparing Copilot users with a control group, and qualitative interviews. Crucially, the analysis adjusted self-reported time savings to exclude outputs that users discarded and subtracted time spent on “novel” tasks that only existed because Copilot made them possible. After these conservative corrections, the department could not demonstrate that the pilot made the organisation measurably more productive. Satisfaction remained high—about 72% of DBT respondents were satisfied or very satisfied—but that enthusiasm did not translate into hard, verifiable output gains during the three-month window.
The contrast is instructive. Self-reported surveys capture perception, convenience, and emotional relief from drudgery, which are genuine benefits but do not always mirror bottom-line productivity. “Self-reporting measures perception as much as performance,” the DBT report noted. A user who receives a helpful draft feels they’ve saved time, but if that draft requires substantial rewriting or triggers additional review steps, the net effect may be nil.
Where Copilot Excels—And Where It Stumbles
Across both studies, Copilot’s strengths are consistent and narrowly defined. It shines on high-frequency, templated tasks that involve summarizing, drafting, or retrieving information. The most popular applications were Teams (71% adoption among active users) and Outlook, where staff used it daily to recap meetings, polish emails, and schedule appointments. In Word, a quarter of users turned to Copilot daily, and another 43% weekly, mostly for document drafting. Survey data pegged average time savings at 24 minutes per document creation session and 19 minutes for presentation building.
Accessibility gains emerged as an unexpected but powerful benefit. Neurodiverse staff and non-native English speakers repeatedly praised Copilot for refining tone, catching errors, and transcribing conversations. One dyspraxic employee in the cross-government trial said, “Emails are so much easier when using Copilot. It has saved me so much time and effort in creating work.”
But the same evaluations reveal hard limits. Copilot’s performance deteriorates when faced with data-heavy tasks, nuanced policy analysis, or strategic judgment. In Excel, adoption hovered at just 23%, and timed task comparisons showed that Copilot could slow users down or produce outputs that required heavy rework. Hallucinations—plausible but fabricated information—were flagged as a persistent risk across all departments. A policy participant in the GDS focus groups captured the frustration: “M365 Copilot’s ability to extract key themes and insights from documents is strong, but it struggles with nuanced or context-heavy data requiring human judgement.”
The verification overhead is the crucial variable that transforms perceived savings into real costs. Both reports stress that human oversight is mandatory, especially for legal, financial, or reputational outputs. DBT documented inconsistent quality assurance across teams and recorded multiple hallucination incidents during the pilot. In sensitive areas like grievance handling or performance evaluations, an HR participant warned that “any errors could lead to reputational risks.”
The Productivity Paradox: When 26 Minutes Isn’t 26 Minutes
The GDS average of 26 minutes saved per day is a compelling number, but it obscures wide variance. Eight of 15 professions tracked in the cross-government trial saved at least that much, while others saved far less. Policy teams, for instance, saved less time and reported lower satisfaction precisely because their work demands contextual depth that Copilot cannot yet provide. Digital and data professionals were among the heaviest users, but DBT’s more controlled experiment suggests that even heavy use may not lift departmental output.
One reason is task substitution. Copilot’s output often creates new work—fact-checking, refining, or integrating suggestions—that can cancel out initial time gains. DBT’s conservative adjustments are a model for how enterprises should evaluate AI: count only outputs that are actually used, and subtract any novel or corrective tasks that the tool introduces. Without such adjustments, productivity claims risk becoming a mirage.
Scale compounds the illusion. A department with many roles centred on drafting and summarising will show large aggregate savings; a department heavy with analysts and decision-makers may see none. The GDS report acknowledges that it could not track how saved time was spent, only that users felt they were redirecting effort to more strategic work. That’s a promising signal but not yet a provable outcome.
Governance, Security, and the Hidden Costs
Beyond productivity, the trials surfaced pressing concerns about data security, permissions, and environmental impact. Copilot integrates with Microsoft Graph and OneDrive, meaning it respects a user’s existing access rights when searching internal documents. That convenience can also surface data governance problems: if an employee has access to files they shouldn’t, Copilot will happily surface them. Most participating organisations disabled internet access for Copilot during the trial, relying solely on internal data sources to limit exposure.
Hallucinations further complicate trust. “All official evaluations flagged hallucinations as a persistent risk,” the forums note, and the consequence is mandatory human-in-the-loop review for substantive outputs. For procurement teams, this means factoring in the cost of verification workflows, not just license fees.
Environmental sustainability also registered. DBT participants raised ethical concerns about the energy consumption of large language models and called for lifecycle assessments before scaling. These considerations rarely appear in vendor demos but are now part of the public-sector purchasing calculus.
What This Means for IT Leaders and Procurement
For Windows and Microsoft 365 administrators, the UK pilots provide a practical playbook. Blanket enablement is a mistake. Instead, run tightly scoped pilots targeting teams with high volumes of templated writing, meeting summaries, or routine communication. At DBT, the report recommends “role-targeted pilots, not blanket enablement.”
Measurement must be rigorous. Combine telemetry from the Viva dashboard with diary studies and observed timed tasks. Critically, apply the same conservative adjustments DBT used: discount unused outputs and account for new tasks generated by the tool. Only then can you produce a defensible return-on-investment case.
Governance is non-negotiable. Enforce human-in-the-loop checks on any output with legal, financial, or reputational weight. Audit data access rights before rollout to ensure Copilot does not inadvertently expose sensitive material. The GDS report notes that “organisations should ensure that current information and knowledge management practices are up to date.”
Training cannot be an afterthought. The DBT evaluation found that self-directed learning boosted satisfaction more than formal sessions, but both are essential. Users need to know when to trust a draft and when to spend time verifying. The correlation between AI familiarity and time savings is strong: professions with the lowest confidence in AI tools saw the smallest gains.
Procurement decisions should demand vendor transparency on energy consumption and data practices. The licensing cost—per user, on top of existing Microsoft 365 subscriptions—must be weighed against net productivity gains in the specific roles being targeted, not headline averages.
Beyond the Pilot: The Road Ahead for Government AI
The UK findings mirror patterns seen in other public-sector experiments worldwide. Australia and other nations have reported similar dynamics: Copilot and similar tools deliver clear wins on high-frequency, low-context tasks and offer vital accessibility benefits, but they are not yet reliable substitutes for skilled labour on complex analytical work.
Microsoft is already pushing more advanced “agentic” features and tighter integration, but the UK government’s evidence-first posture is likely to become the norm. Policy makers are tempted by the narrative that 26 minutes per day equals two weeks a year of reclaimed capacity, but the DBT counter-narrative insists that hard, observed gains must precede any claim of budgetary savings.
The real value of these trials may be their honesty. They demonstrate that AI assistants can delight users and streamline specific workflows, while simultaneously proving that satisfaction alone is not productivity. For enterprise IT, the lesson is clear: treat Copilot as a targeted accelerator, invest in measurement and governance, and resist the urge to extrapolate self-reported minutes into org-chart magic. When those conditions are met, Copilot can shift time from routine chores to higher-value work—but the shift is incremental, conditional, and measurable, not miraculous.