Nearly 2,000 public-sector engineers saved almost an hour per working day during a three-month trial of AI coding assistants across 50 UK government departments, but the data reveals a stark gap between raw productivity gains and the hidden burden of manual remediation, security vulnerabilities, and rising technical debt.

The Government Digital Service (GDS) ran the pilot from November 2024 to February 2025, distributing 2,500 licences for GitHub Copilot and Google’s Gemini Code Assist. A total of 1,900 licences were assigned, with 1,100 Copilot licences and 173 Gemini Code Assist licences actually redeemed. Telemetry logged thousands of engineer interactions, while surveys captured user sentiment. The headline figure—56 to 60 minutes saved per developer per day, equal to roughly 28 working days a year—quickly became political ammunition for a government eager to brandish AI-driven efficiency under its “Plan for Change” agenda.

But underneath those glossy numbers lies a messier truth. Acceptance rates were low. GitHub Copilot’s line-level suggestions were accepted just 15.8% of the time, and only 15% of AI-generated code was used completely unchanged. Engineers committed AI-suggested code less than half the time. User satisfaction might have been high—72% said the tools offered good value, 65% completed tasks faster, and 58% would rather not give them up—but the same trial exposed a dangerous illusion: speed in drafting does not equal speed in delivery.

The Trial by Numbers

GDS made 2,500 licences available to over 50 central government organisations. The participant pool spanned thousands of engineers, though actual usage concentrated on the two principal tools: GitHub Copilot and Google Gemini Code Assist. These mature, off-the-shelf assistants were deployed for drafting initial code, reviewing and refactoring existing code, generating tests and small utility functions, and searching for examples.

The trial’s primary metrics were time savings, user satisfaction, telemetry (suggestion acceptance rates and usage patterns), and qualitative feedback. The headline time saving of ~1 hour per day emerged from self-reported surveys and usage patterns. Yet the telemetry told a more sobering story: only a minority of AI outputs survived first contact with a human engineer without edit. GitHub Copilot’s 15.8% acceptance rate was consistent with other independent studies showing that AI code assistants rarely produce production-ready output on their own.

Where the Time Savings Actually Come From

Most of the reclaimed minutes came from two sources: first‑draft generation and code review assistance. Boilerplate code, helper functions, scaffolding, and small components that are tedious to write from scratch accounted for the bulk of speed gains. Engineers also used the tools to speed up code review—pre‑reviewing their own changes or generating review comments for pull requests.

These activities represent low‑risk, high‑volume work that has long been a bottleneck. By handling the grunt work, AI assistants let developers shift their attention to higher‑order tasks like architecture, security, and integration. The trial found that even with heavy editing, the net effect was positive: the generated starting point was still faster than starting from a blank file.

The Remediation Reality Check

Despite the glowing user testimonials, independent experts and the trial’s own telemetry expose a costly after‑phase: remediation. Martin Reynolds, Field CTO at Harness, pointed out that around 85% of AI‑generated code still needed manual editing. Moreover, that editing happens before the code even enters the “downstream” delivery pipeline—testing, security scanning, deployment, and continuous verification. Each of those phases consumes time that the raw drafting metric ignores.

When engineers spend 39 minutes debugging a flawed AI suggestion or refactoring code that doesn’t align with existing architecture, the net productivity gain shrinks. The government’s hour‑a‑day figure captures only the first chapter of a multi‑chapter story. Widespread adoption without accounting for downstream costs could erase much of the claimed benefit.

Security Risks Lurking Under the Hood

The most urgent alarm comes from security research. AI‑generated code frequently contains vulnerabilities. Veracode’s 2025 research found high rates of OWASP‑class weaknesses in AI outputs, especially when prompts lacked explicit security constraints. Syntactically correct code is not secure code. AI models tend to produce short, working examples that omit input validation, proper error handling, and least‑privilege principles—exactly the kinds of flaws that attackers exploit.

In government systems, where sensitive citizen data and critical infrastructure are at stake, a single security regression could be catastrophic. Nigel Douglas, head of developer relations at Cloudsmith, warned that without security‑aware tooling and policy enforcement, “over‑enthusiastic use of AI coding assistants” might unknowingly inject vulnerabilities into the country’s most critical software ecosystems.

Beyond injection flaws, AI‑generated code can introduce bloated dependencies, duplicate logic, and hallucinations—confidently incorrect code that references non‑existent APIs or behaviours. In safety‑critical contexts, these failures are not just expensive; they are dangerous.

Expert Warnings: Velocity Isn’t Everything

Martin Reynolds cautioned that the trial’s productivity figures represent a “velocity boost” only at the very start of the delivery cycle. The real work—testing, security scanning, deployment, verification—remains stubbornly manual. Nigel Douglas urged the government to demand provenance verification, secure‑by‑design prompts, and supply‑chain controls. Both experts echoed a growing industry consensus: AI coding assistants are powerful drafting tools, but they are not replacements for engineering judgement.

Industry studies corroborate these warnings. Fastly’s recent research found that developers often spend so much time remediating faulty AI output that the net time savings vanish. Other analyses have shown that AI‑generated code often increases technical debt over time because it disregards existing codebase patterns, leading to refactoring that erodes initial gains.

A Blueprint for Safe AI Adoption in Government

So how can the UK government—and any large public‑sector organisation—reap the rewards without multiplying the risks? The trial’s lessons offer a practical playbook:

  • Start with trusted use cases. Deploy AI assistants first in low‑risk, high‑volume scenarios: internal utilities, test generation, boilerplate code for non‑safety‑critical services.
  • Keep a human in the loop. No AI‑generated code should merge to production without explicit human review and automated security checks. This must be a hard policy step in CI/CD pipelines.
  • Embed security in prompts. Provide standardised prompt templates that enforce security constraints—for example, “generate code following least privilege, with input validation and without hard‑coded secrets.” Measure compliance systematically.
  • Track model provenance. Maintain observability over which model version produced which output, in response to which prompt. This supports incident triage and audit trails.
  • Isolate sensitive workloads. For critical systems, use private model hosting or on‑premises inference to keep proprietary code off third‑party clouds. Contractual clauses must forbid vendor‑side model training on government inputs.
  • Automate security testing. Add static analysis, dependency scanning, and policy gates in CI that run automatically on AI‑sourced changes. Treat all AI output as suspect until proven otherwise.
  • Invest in developer training. Senior engineers extract the most value and spot AI‑introduced weaknesses fastest. Upskilling junior staff in security‑conscious code review is essential.
  • Measure downstream costs. Track not only hours saved in drafting but also review time, remediation time, security findings per AI‑sourced pull request, and production incidents. Only with those metrics can leaders calculate true ROI.

Procurement Must Pivot to Governance

Government IT buyers must move beyond feature checklists when procuring AI coding assistants. Contracts should mandate:

  • Explicit non‑training clauses and strict data‑use terms that forbid vendors from using government inputs to improve their models.
  • Access to model versioning and logs for auditing purposes.
  • Service‑level agreements for detecting and mitigating model‑linked security incidents.
  • Exit and portability clauses to avoid vendor lock‑in.

Public bodies should demand technical runbooks demonstrating how vendors will handle sensitive data, provide isolated hosting, or offer dedicated on‑tenant models. Rushed procurement without these safeguards risks embedding long‑term operational and security liabilities into critical national infrastructure.

Culture Shift: Trust but Verify

AI coding assistants will reshape engineering culture. Teams must learn to treat AI outputs as drafts, not deliverables. The trial showed that experienced engineers already do this instinctively, but less experienced staff may over‑trust the tool. Rewarding activities that catch AI errors—thorough code reviews, security testing, architecture work—must become as prestigious as feature delivery.

Hiring and training priorities may need to shift. Senior engineering judgement, security expertise, and systems thinking will become more valuable than speed‑coding skills. Continuous measurement and transparent reporting across departments will be vital to maintaining public trust as AI scales.

Hard Limits and Unanswered Questions

The government’s trial is a landmark first step, but it leaves crucial questions unanswered:

  • Will time savings persist? As usage expands into complex legacy systems, remediation costs could rise non‑linearly, wiping out the per‑hour gains.
  • Can non‑training clauses be enforced reliably? Across multiple vendors and over years, verifying that data isn’t being retained or used for training remains technically and legally fraught.
  • Will skill atrophy set in? Over‑reliance on AI for routine tasks could dull the very skills engineers need to spot and fix AI errors.
  • What is the environmental cost? Scaling inference workloads across government carries a carbon footprint. Some departmental pilots flagged this as an area for further study, yet no concrete data emerged from the trial.

These open points mean scaling must remain conditional on measurable safety gates, not political expediency.

The Bottom Line

The GDS trial proves that mainstream AI coding assistants can deliver material productivity gains for routine engineering tasks across government. The tools work—but capturing the upside without amplifying security regressions, maintenance costs, and supply‑chain risk demands disciplined governance.

Public‑sector IT leaders should expand pilots into high‑value, low‑risk areas immediately, but they must also harden their pipelines, tie procurement to strict non‑training and provenance terms, and track the metrics that truly matter. The seductive narrative that AI will automatically free up millions of hours and pay for itself is only half the story. Managed well, AI coding assistants can be a force multiplier; managed poorly, they will multiply technical debt and security exposure across the services citizens depend on every day.