GitHub CEO Calls Copilot Use in Performance Reviews 'Totally Fair Game' — But Warns Against 'Easily Gamed' Metrics

GitHub CEO Thomas Dohmke has publicly defended an internal Microsoft memo that urged managers to include employees’ use of AI tools like Copilot in performance evaluations, calling the approach “totally fair game” — but he issued a stern warning against reducing the practice to simplistic, gamable metrics.

During an August 7 appearance on the Decoder podcast, Dohmke addressed the leaked guidance from Microsoft executive Julia Liuson, which told managers that “AI is now a fundamental part of how we work” and that AI fluency should be part of holistic performance reflections. Dohmke reframed the directive as a developmental tool, not a surveillance mechanism. “It’s about learning and mindset,” he said, adding that asking employees “What did you learn from using Copilot?” is a legitimate growth-oriented conversation. However, he explicitly cautioned against counting lines of AI-generated code or similar raw numbers, labeling such metrics “easily gamed.”

The controversy erupted in mid-2025 when the memo circulated among Microsoft managers, coinciding with a broader corporate push to embed Copilot across all products and teams. Microsoft views internal adoption as a critical feedback loop to accelerate product improvements and strengthen the commercial case for its AI tools. But for a workforce already jittery from rounds of restructuring and layoffs, the memo landed as a threatening signal — an implication that failure to embrace AI could jeopardize careers.

The memo’s core message and the disconnects that followed

The most widely quoted line from the memo — that “using AI is no longer optional” and that AI fluency is comparable to collaboration or data-driven thinking — has been cited across multiple outlets and internal summaries. Microsoft leadership stressed that managers should treat AI usage as part of reflective performance conversations, asking questions like “Did you use Copilot to summarize a meeting? If not, why not? What did you learn?” rather than enforcing a usage quota.

In practice, however, that nuance proved fragile. Critics flagged three predictable translation problems:
- Tactical managers under performance pressure often default to measurable proxies, such as counts of AI sessions or tokens consumed.
- Employees perceive any requirement tied to tools as coercive or a pretext for surveillance, especially in a climate of recent layoffs.
- Without robust guardrails — security configurations, allowed models, IP protections — internal adoption can increase operational risk.

Why Microsoft is pushing AI adoption so aggressively

Microsoft’s push is driven by a tight product-level feedback loop: more internal usage yields faster, higher-quality feedback on Copilot integrations, and more employee usage helps demonstrate product value to customers. GitHub, which operates Copilot, sits at the center of that strategy. Dohmke and other leaders have long argued that internal dogfooding is a legitimate way to discover real-world issues and accelerate improvement.

There is also a straightforward business incentive: embedding Copilot across Microsoft teams strengthens the narrative that Copilot is mission-critical and worth enterprise spend. When employees become product evangelists, the company benefits from better metrics, case studies, and organic advocacy. The memo is as much about commercial alignment as it is about capability building.

The productivity and quality evidence behind Copilot

A central argument for encouraging Copilot adoption is its measurable impact on developer productivity. GitHub’s own research, published on its blog, has shown that developers code up to 55% faster when using Copilot. But newer, more rigorous data goes further into code quality.

In a controlled trial with 202 experienced Python developers, half of whom were randomly assigned Copilot access, researchers found:
- 53.2% greater likelihood of passing all 10 unit tests in the coding task, indicating significantly more functional code.
- Code written with Copilot had 13.6% more lines of code per code error during blind review — meaning fewer readability mistakes.
- Blind reviewers rated Copilot-assisted code as 3.62% more readable, 2.94% more reliable, 2.47% more maintainable, and 4.16% more concise (all statistically significant).
- Developers were 5% more likely to approve code authored with Copilot, speeding up merge times.

The study concluded that because developers spent less time making code functional, they could focus more on refining quality — a finding that aligns with earlier research showing 85% of developers felt more confident in their code when using Copilot.

Independent academic research reinforces these gains. A randomized experiment from ArXiv found that participants completed a coding task approximately 55.8% faster with an AI pair programmer. These findings provide a solid evidence base for promoting Copilot adoption. But as Dohmke’s warning implies, raw output metrics don’t capture responsible use or quality of work — and can lead to dangerous optimization behavior.

The real risks: surveillance, metrics, IP, and inequity

Even with strong productivity data, tying AI usage to performance reviews introduces serious risks:

1. Perception of surveillance and eroded trust. Asking about tool usage during reviews creates a perception of monitoring, even if framed as “learning.” In an environment already scarred by layoffs, this can signal job insecurity and harm psychological safety.

2. Gamable metrics and perverse incentives. Raw counters — lines of AI-generated code, session counts, tokens consumed — are trivially gamed. If managers use these as proxies for performance, teams will optimize the metric rather than outcomes, potentially lowering quality and bypassing safety checks.

3. Privacy, IP leakage, and compliance. Copilot-style tools can surface proprietary code or secrets if used without proper guardrails. Mandating usage before proper configuration and training increases the risk of accidental leakage or licensing conflicts.

4. Equity and accommodation. AI fluency varies across roles and individuals. Tying reviews to AI competency without offering training and reasonable timelines disadvantages non-technical staff or those needing accommodations, risking disparate impact.

5. Legal and labor implications. When usage becomes evaluative, it can trigger labor-law considerations — from reasonable accommodation to discrimination claims — and may affect collective bargaining. Performance criteria must be defensible, transparent, and consistent with employment law.

A practical playbook for fair AI-inclusive reviews

For organizations that choose to incorporate AI usage into performance conversations, governance and process design are critical. Experts recommend:

Define the competency, not the raw number. Describe what proficiency looks like — e.g., “uses Copilot to prototype and document code, verifies and tests generated output, and documents changes.”
Prioritize outcomes over activity. Evaluate whether the work delivered met quality, security, and timeliness goals, and whether AI was used responsibly to aid those outcomes.
Build mandatory, role-appropriate training. Require training sessions and allow ramp-up time before any evaluation.
Set strict data governance rules. Provide enterprise-grade guardrails: private tenant models, disabled telemetry where necessary, and clear instructions on what not to paste into an assistant.
Use human review and appeals. Ensure human oversight on any evaluative decisions tied to AI use and provide an appeal route for employees.
Monitor for disparate impact. Collect anonymized metrics on the policy’s effect across demographics and adjust if gaps appear.
Communicate transparently. Publish the rationale, evaluation rubric, and examples of acceptable evidence of learning — logs of experiments, annotated outputs, test results.

Individual contributors can protect themselves by keeping concise documentation of AI experiments, adhering strictly to approved tool configurations, and requesting explicit success criteria before being assessed.

Strategic implications for Microsoft, GitHub, and developers

For Microsoft, encouraging internal Copilot use helps iterate and sell the product, but it also heightens scrutiny about whether platform owners can remain impartial stewards of open-source communities when product incentives align with internal adoption. Dohmke’s defense attempts to balance those tensions, but the optics remain difficult.

For GitHub, the message that “everyone at GitHub uses GitHub” is both logical and fraught. Internal dogfooding is a best practice, but mandating internal product use raises trust questions among a developer ecosystem that often values neutrality and openness.

For the broader developer community, this episode is a test case in how large platform providers will socialize rapid tool adoption without undermining goodwill. If Microsoft and GitHub manage the rollout with strong governance, transparent metrics, and training, the payoff could be real productivity gains. If not, the backlash could deepen developer distrust.

The evidence shows that Copilot can dramatically improve both speed and quality — a 55% faster coding pace and statistically significant quality boosts are not trivial. But Dohmke’s own warning underscores the central tension: even the best-intentioned policies can be twisted by poor implementation. The difference between a learning culture and a surveillance culture hinges on how organizations translate “holistic reflections” into daily management practice, and whether they truly invest in the training, guardrails, and transparency that make AI adoption safe, equitable, and genuinely productive.