The speed you experience with ApexSpriteAI depends primarily on three factors: the number of parameters in the model you load, the size of the context window you configure, and the latency of the network path between your Mac and the LM Studio server. A 120B model on 128 GB of GPU memory can produce thoughtful, accurate output but may take 30 seconds or more to begin a response. Dropping to a 32B model on the same hardware delivers near-instant replies with only a modest reduction in capability. This page explains why each factor matters and gives you concrete steps to tune the system for the workload you care about.Documentation Index
Fetch the complete documentation index at: https://reliatrack.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Why model size determines speed
Every token the model generates requires a forward pass through all of its parameter layers. A 120B model has roughly four times the parameters of a 32B model, which means each generated token takes approximately four times as long to compute, regardless of how much memory you have available. This is token evaluation time — it scales linearly with parameter count and cannot be parallelized away by adding more memory alone. On a 128 GB NVIDIA GPU, the practical effect is:| Model size | Quantization | Approx. tokens/sec | Best use case |
|---|---|---|---|
| 120B | Q4 | ~5–8 | Deep research tasks where quality outweighs speed |
| 70B | Q4 | ~15–20 | Complex multi-step planning and architectural design |
| 32B | Q4 | ~40–55 | Interactive coding sessions, code review, refactoring |
| 16B | Q4 | ~80–100 | Fast code completion and simple Q&A |
These figures are approximate and vary with context length, batch size, and system load. Measure your actual throughput with a representative prompt before committing to a model for a long session.
Recommended model tiers
Interactive coding: Qwen2.5-Coder-32B-Instruct
For day-to-day coding work — writing functions, reviewing pull requests, debugging, and calling MCP tools — use Qwen2.5-Coder-32B-Instruct. It delivers response-start latency under two seconds on 128 GB hardware, handles 32k–64k context without a noticeable slowdown, and matches Claude 3.5 Sonnet on most code generation benchmarks.Complex planning: Llama-3.3-70B-Instruct
When you need the model to reason through a large architectural problem, plan a multi-service refactor, or work through deep logic chains, switch to Llama-3.3-70B-Instruct. Expect roughly 2–3× slower token generation than Qwen 32B. Limit use of the 70B model to planning phases and return to 32B for implementation.Optimizing context window size
The context window has two effects on performance. First, a larger window increases the memory footprint of the key-value cache, which reduces the headroom available for storing model weights in high-speed memory. Second, attention computation scales quadratically with sequence length, so very long contexts slow down generation even when memory is not a constraint. Practical guidance:- Start with 32,000 tokens. This is sufficient for most coding sessions and gives the fastest responses.
- Increase to 64,000 tokens only when you are working with files or histories that cause truncation errors.
- Avoid context windows above 64k on 32B models unless you have verified that your workload genuinely requires it. The speed penalty at 128k context is significant even on 128 GB hardware.
Set context window in LM Studio
Open LM Studio, load your model, and navigate to Model Settings → Context Length. Enter your target value and reload the model for the change to take effect.
Monitor memory headroom
After loading the model at your chosen context length, check GPU memory usage:If memory utilization is above 90%, reduce the context window by 8,000–16,000 tokens and reload. Running at the memory limit causes swap-induced slowdowns that eliminate any benefit from a larger context.
Keeping the model loaded between sessions
LM Studio unloads the model from GPU memory when you close the application or when the server times out an idle connection. Reloading a 32B model takes 15–30 seconds. To avoid this penalty during active work:- Leave LM Studio open with the server running throughout your session.
- If you use the GPU server as a headless machine, configure LM Studio to start at login so the model is ready when you connect via Tailscale.
- Avoid switching between large models frequently. Each swap incurs a full unload and reload cycle.
Network latency: Tailscale vs. localhost
When LM Studio runs on the same machine as Claude Code, network round-trip time is negligible (sub-millisecond). Over Tailscale, the round-trip adds roughly 1–5 ms for machines on the same physical LAN, or 10–30 ms over the internet. At 40 tokens/sec, a 30 ms round-trip adds less than 0.1% to total response time, so network latency is almost never the bottleneck when your model is in the 32B–70B range. The exception is the very first token — the time-to-first-token (TTFT) — which is dominated by prompt processing time rather than generation. For a 32k-token context, TTFT is typically 2–5 seconds on 32B and 8–15 seconds on 70B, making network latency insignificant by comparison.Quick reference: speed optimization checklist
- Use Qwen2.5-Coder-32B-Instruct for interactive coding sessions.
- Use Llama-3.3-70B only for complex planning tasks when you can tolerate 2–3× slower generation.
- Set context window to 32k by default; increase to 64k only when needed.
- Keep LM Studio running with the model loaded throughout your session.
- Confirm GPU memory utilization is below 90% after loading your model.
- If switching to a larger model, close other GPU-intensive processes first.