OpenAI's Open-Weight Model Flunks a 10-Year-Old's Exam Despite Solid Reasoning

OpenAI's latest open-weight model, gpt-oss-20b, demonstrated strong chain-of-thought reasoning when faced with a real UK 11+ exam—but its final answers were so wrong that an actual 10-year-old outperformed it. That’s the headline finding from Windows Central’s hands-on test, which exposed a jarring disconnect between the model’s internal logic and the user-facing output.

Earlier this month, OpenAI released two open-weight reasoning models under a permissive Apache 2.0 license: gpt-oss-120b and gpt-oss-20b. The smaller variant, with 21 billion total parameters but only 3.6 billion active per token thanks to a mixture-of-experts (MoE) architecture, is explicitly designed to run within 16 GB of memory—putting capable local inference on consumer hardware within reach. OpenAI's model card notes native support for configurable reasoning effort, a full chain-of-thought (CoT) channel for developer debugging, and agentic tool use like web browsing and function calling. But as the Windows Central reporter discovered, turning those impressive specs into reliable, user-facing results is far from plug-and-play.

A School Exam Meets a Local LLM

For a real-world stress test, the reporter fed a sample UK 11+ practice paper—a mix of numerical sequences, word problems, and logic puzzles aimed at 10- and 11-year-olds—into gpt-oss-20b running via Ollama on a consumer gaming PC. The hardware? An Nvidia GeForce RTX 5080 with 16 GB of VRAM, a card whose memory footprint is typical of this generation but tight for large-context inference. The prompt was straightforward: “Read the test and answer all questions.”

The first run took about 15 minutes of internal thinking. The model returned 80 answers for 80 questions, but only around nine were correct. Many outputs were irrelevant, nonsensical, or transformed the exam questions into something entirely different. What made the failure startling was that the model’s internal chain-of-thought—visible in the reasoning buffer—often showed perfectly coherent, step-by-step solutions. The model could reason through a number sequence correctly, then spit out a final answer that had nothing to do with that reasoning.

On a second attempt with a larger context window (32k tokens) and more memory allocated, some improvements emerged: numerical sequence performance improved, but the model still frequently produced gibberish as its final output. In one run, it abandoned the exam entirely and generated its own quiz. The conclusion: the model’s internal competence was real, but it wasn’t translating reliably into the final channel the user sees.

The Technical Roots of the Disconnect

OpenAI’s documentation provides a clear explanation for this behavior. Both gpt-oss models were trained on a proprietary “harmony” response format that separates outputs into distinct channels: an internal analysis channel (the full CoT) and a final, user-facing channel. The Hugging Face model card explicitly warns: “Both models were trained on our harmony response format and should only be used with the harmony format as it will not work correctly otherwise.”

If the inference backend—in this case, Ollama—does not properly handle these channels, or if the prompt doesn’t instruct the model to route final answers into the correct channel, the streamed output can leak internal musings while burying the actual answer. The Windows Central reporter used a basic “answer all questions” prompt with no explicit instruction to separate workings from final responses. Given that the harmony format is designed for explicit channel management, this almost certainly caused the model to output analysis when a final answer was expected.

Context length and memory pressure compounded the problem. OpenAI’s model card boasts native support for up to 128k tokens, but on a 16 GB card, expanding the context window forces the system to swap data between GPU and system RAM. The reporter recorded token-per-second rates that plummeted from 82 at 4k context to 42 at 8k, and just 9 at 128k. Such slow, staggered inference can cause partial outputs, dropped tokens, or malformed channel routing—exactly the kind of artifacts that yield irrelevant “final” answers.

The MoE architecture and MXFP4 quantization—the secret sauce that lets a 21B-parameter model run in 16 GB—add further complexity. Only a fraction of the model’s weights are active per token, and quantized weights require specific handling by the inference engine. If the backend misinterprets the quantized layers or doesn’t fully support the harmony rendering, output corruption is almost inevitable.

What gpt-oss-20b Gets Right

Despite the on-screen flub, the model’s capabilities are undisputed. Under the hood, its chain-of-thought reasoning is crisp and human-like—useful for developers who need to inspect logic during debugging. The permissive Apache 2.0 license opens the door to fine-tuning and commercial deployment without legal headaches. And the fact that it can even load on a 16 GB gaming GPU is a milestone for on-device, privacy-friendly AI.

OpenAI’s benchmarks position gpt-oss-20b as comparable to its o3-mini model on common reasoning tasks, with the added benefit of offline local execution. The model also supports three reasoning effort levels (low/medium/high) and native tool use, making it a flexible foundation for agentic applications—provided you wire it up correctly.

How to Avoid the Pitfalls

The Windows Central experiment is a cautionary tale, not a dismissal. Hobbyists and developers can get clean, reliable outputs from gpt-oss-20b if they follow a few critical steps:

Use an inference backend that fully supports the harmony format. Ollama, vLLM, and the official Transformers integration all technically support it, but you must verify that the renderer correctly maps analysis vs. final channels. Testing with a simple structured prompt can confirm.
Explicitly instruct the model in your system prompt. For example: “Render outputs with channels: analysis (internal, not shown), final (user-facing). Only return clear final answers in the final channel.” Then in the user prompt, “Attached is an exam PDF. For each question, provide the final answer only in the final channel; use the analysis channel for workings if needed.”
Match context length to your hardware. On a 16 GB card, stick to 4k–8k context for interactive tasks; reserve 32k+ for batch processing where latency can be tolerated. If you regularly need large contexts, a GPU with 32 GB or more—like the RTX 5090 or a cloud H100—is a far better fit.
Monitor the reasoning trace. If the model’s internal CoT shows a correct answer but the final output is wrong, assume your renderer is misconfigured. Fix the channel mapping and rerun.

Following these guidelines transforms gpt-oss-20b from a flaky demo into a robust local reasoning engine.

The Bigger Picture for Windows Enthusiasts

For parents and teachers, the takeaway is reassuring: a 10-year-old with exam practice still outperforms the model in practical answer quality. The AI can be a useful study aid—explaining concepts, generating practice questions—but isn’t yet a trustworthy autonomous test-taker.

For the Windows community of tinkerers and developers, gpt-oss-20b is genuinely exciting. It’s the first time OpenAI has handed over a reasoning-capable, modifiable model to the public under such a permissive license. The freedom to run it locally, fine-tune it on niche datasets, and inspect every step of its thought process opens up use cases from privacy-sensitive enterprise workflows to personalized coding assistants. But it also means you must own your safety stack: open weights lower the barrier for misuse, and without OpenAI’s API-level content filters, the responsibility for safe deployment rests entirely on the user.

Local model enthusiasts might wonder how gpt-oss-20b stacks up against alternatives like Google’s Gemma 3 or Meta’s Llama family. On paper, it offers a compelling middle ground: stronger reasoning than tiny 4B models, but far less memory-hungry than 100B+ behemoths. In practice, highly optimized quantized versions of smaller models may deliver snappier performance on a 16 GB card. If your priority is raw speed for simple tasks, a 4B Gemma variant will win. If you need deep, chain-of-thought reasoning and can accept longer runtimes—and you’re willing to invest in prompt engineering—gpt-oss-20b carves out a unique niche.

Conclusion

The gpt-oss-20b model is a paradox wrapped in a 16-GB footprint. Its internal reasoning engine is sharp enough to solve elementary exam problems, but as the Windows Central test proves, raw intelligence is wasted without a proper delivery mechanism. The model’s failure wasn’t due to a lack of smarts; it was a deployment and integration failure. For anyone ready to respect the harmony format, size their context windows sensibly, and validate their inference pipeline, OpenAI’s open-weight release is a landmark achievement. For those expecting an out-of-the-box replacement for cloud GPT, the lesson is clear: local AI still demands a hands-on, engineer’s mindset.