Windows 11 users running local LLMs via Ollama are discovering that a single parameter—the model’s context length—can make the difference between a sluggish 9 tokens per second and a blazing 86 tokens per second. This revelation, highlighted by recent community benchmarks and troubleshooting on Windows Central, shows that regardless of your GPU’s raw power, how you set Ollama’s context window directly governs whether your AI runs on the GPU or falls back to the CPU. For enthusiasts chasing on-device privacy and zero-latency responses, the message is clear: tune your context length, or leave massive performance on the table.

Why Context Length Dictates Speed

To understand why, you need to grasp a fundamental property of transformer architectures. As explained in detailed technical analyses, the self-attention mechanism at the heart of modern LLMs requires computing relationships between every pair of tokens in a sequence. This operation scales quadratically with sequence length—doubling the number of tokens roughly quadruples the computational and memory cost. The key-value (KV) cache, which stores intermediate attention states during generation, swells accordingly, eating into precious VRAM. That’s why a model that hums along with a 2,048-token context can grind to a halt when the window is expanded to 8,192 or 32,768 tokens.

The quadratic nature stems from the dot-product attention computation:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

For n tokens, an n×n attention matrix must be computed and stored, consuming O(n²) memory and requiring O(n² d) operations, where d is the embedding dimension. Even with optimizations like FlashAttention, the memory footprint remains substantial.

Real-World Impact on Windows 11

The impact is not theoretical. Benchmarks on a Windows 11 machine with an RTX 5080 (16 GB VRAM) running the gpt-oss:20b model revealed stark differences:

  • Large context (likely 32k+): ~9 tokens/sec, 0% GPU utilization (entirely CPU-bound)
  • Reduced to 8,192 tokens: ~43 tokens/sec, partial GPU usage
  • Cut further to 4,096 tokens: ~86 tokens/sec, near-full GPU saturation

That’s nearly a tenfold improvement from a single parameter change. When Ollama cannot fit the model and its KV cache into VRAM, it silently offloads to system RAM and CPU, cratering throughput. The ollama ps command reveals the truth—look for “100% GPU” in the Processor column. Anything less means your context window is too wide.

Adjusting Context Length in Ollama

GUI: Quick and Casual

Ollama’s latest Windows desktop app sports a straightforward slider in Settings that lets you pick from preset context sizes—typically ranging from 4k to 128k tokens. Dragging it to a lower setting is the quickest way to speed up responses for short interactions.

CLI: Precision and Persistence

For scripts, repeatable setups, or exact values, the command line is indispensable. Launch Ollama interactively:

ollama run <modelname>

Then, inside the REPL, set the context:

/set parameter num_ctx 8192

Save a variant for later:

/save mymodel-8k

Alternatively, pass the parameter directly in a single command:

ollama run gemma3:12b --parameter num_ctx=4096 --verbose

Caveat: The /save command has had quirks in earlier versions—if you encounter errors, upgrade Ollama or create a Modelfile instead (see Advanced section).

Benchmarking and Verifying GPU Usage

Always measure before and after tuning:

  1. Run with verbose output to see tokens-per-second:
    ollama run <model> --verbose
    After each response, look for “eval rate” or “tokens/sec”.

  2. Check GPU/CPU placement:
    ollama ps
    The Processor field shows if the model is loaded on GPU, CPU, or split.

  3. Cross-validate with system tools:
    - NVIDIA: nvidia-smi to monitor VRAM usage and GPU load.
    - Task Manager: GPU Performance tab (though less precise).

If ollama ps reports 100% CPU or low GPU usage, reduce num_ctx until the GPU saturates.

Practical Tuning Guide by Hardware Class

VRAM Size Recommended Context Notes
Integrated / 0–4 GB 2k–4k tokens Keep models small (3B–7B, quantized) and context minimal.
6–8 GB (e.g., RTX 3060 mobile) 2k–8k tokens Use Q4/Q8 quantized 8B–13B models; test with --verbose.
12–16 GB (e.g., RTX 4080, 5080) 4k–16k tokens For 13B–20B models, 4k often saturates GPU; 8k may work for smaller quants.
24–48 GB (e.g., RTX 4090) 8k–32k tokens 20B–70B models, quantized; larger contexts possible but benchmark first.
80 GB+ (A100/H100 class) 32k–128k tokens Can handle largest open models at full context, but still validate.

Remember: Model quantization dramatically shrinks VRAM needs—Q4 vs. Q8 can halve memory, allowing wider contexts. Always start low and increase gradually, benchmarking each step.

Advanced: Creating Multiple Presets with Modelfiles

For users who frequently switch between short Q&A and long-document analysis, saved model variants prevent constant reconfiguration.

Using /save Inside REPL

Interactively set num_ctx and save:

/set parameter num_ctx 4096
/save mymodel-fast

Then launch with ollama run mymodel-fast.

Create a text file (e.g., fast.modelfile):

FROM <base-model>
PARAMETER num_ctx 4096

Then:

ollama create mymodel-fast -f ./fast.modelfile

This approach is explicit, reproducible, and avoids the /save quirks that have surfaced in GitHub issues.

Storage Warning: Each variant duplicates the full model weights, which can be tens of gigabytes. Balance convenience against disk space.

Troubleshooting Common Pitfalls

  • Model stays on CPU despite smaller context: Update NVIDIA drivers (CUDA/cuDNN), ensure Ollama supports your GPU generation, and verify with ollama ps. Some models require specific environment variables.
  • /save fails or creates odd names: This is a known issue in older Ollama builds. Upgrade to the latest version, or fall back to Modelfiles.
  • Performance still poor after lowering context: Check for background GPU tasks, confirm quantization level (heavier quants need more VRAM), and try an even smaller context.
  • Model claims 128k max context but Ollama defaults to 2k: Ollama sets a conservative default to avoid OOM errors. You must explicitly set num_ctx to unlock higher limits—just ensure your VRAM can handle it.

Risks, Limitations, and Best Practices

  • Accuracy trade-off: Shrinking context may cause the model to “forget” earlier parts of a long conversation or document. Use chunking and retrieval-augmented generation (RAG) for large inputs rather than cramming everything into one prompt.
  • Thermal and power: Sustained high-GPU usage generates heat—monitor temps and ensure adequate cooling, especially during long high-context runs.
  • Storage explosion: Multiple model presets can eat hundreds of gigabytes. Delete unused variants regularly.
  • License compliance: Open-weight models come with licenses—check them before production use.

Quick Tuning Checklist for Windows 11

  1. Update Ollama and GPU drivers.
  2. Start with num_ctx=2048 and benchmark: ollama run <model> --verbose.
  3. If ollama ps shows CPU-bound or low throughput, halve context (e.g., 16k → 8k → 4k) and retest.
  4. Once GPU is saturated, save the sweet-spot variant via /save or Modelfile.
  5. For long documents, use retrieval + chunking instead of a huge context.
  6. Keep an eye on VRAM usage and thermals.

The Bottom Line

Mastering context length is a rite of passage for Windows 11 users serious about local AI. With Ollama’s maturing toolset—a friendly GUI slider for quick experiments and a powerful CLI for fine-grained control—you can toggle between snappy performance and deep recall on demand. As open-weight models push boundaries toward 128k and beyond, the discipline to dial in context only as wide as necessary will separate a frustrating experience from a fluid one. Measure, tune, and save your presets; your GPU will thank you.