Skip to main content

Documentation Index

Fetch the complete documentation index at: https://reliatrack.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The model you load in LM Studio has the biggest impact on how your ApexSpriteAI agent feels to use. A model that is too large makes every response feel sluggish; one that is too small may struggle with complex tool chains or multi-step reasoning. This guide compares the three recommended models — covering their strengths, tradeoffs, and the hardware they require — so you can make an informed choice before downloading several gigabytes.

Model comparison

ModelParametersSpeedBest forContext window
Qwen2.5-Coder-32B-Instruct32BExtremely fastCoding, tool use, everyday tasks32k–64k tokens
Llama-3.3-70B-Instruct70BFastComplex reasoning, architecture planning128k tokens
DeepSeek-Coder-V2-Lite-Instruct16BBlazing fastStandard coding, code completion128k tokens

Model profiles

Qwen2.5-Coder-32B-Instruct

Recommended for most users. This model delivers state-of-the-art coding and tool-use performance at 32B parameters. It nearly matches Claude 3.5 Sonnet on coding benchmarks and handles 32k–64k context without a noticeable slowdown on hardware with 64 GB or more of memory.

Llama-3.3-70B-Instruct

Best for complex reasoning. At 70B parameters, Llama 3.3 offers deeper general reasoning and is well-suited to tasks like system design, architectural planning, and multi-turn problem-solving where raw coding speed is less important than thoroughness.

DeepSeek-Coder-V2-Lite-Instruct

Best when speed is the priority. The 16B Lite variant responds almost instantly, making it ideal for rapid iteration, code completion, and lightweight scripting tasks. It may struggle with highly complex, multi-step MCP tool chains compared to the larger models.

120B+ models

Not recommended for interactive use. Models above 100B parameters incur significant per-token latency — even on 128 GB of unified RAM, you can expect 30+ seconds of “thinking” time per response. Reserve these for batch or offline workloads.

When to pick each model

Qwen2.5-Coder-32B-Instruct

Choose this model when:
  • You want the fastest interactive coding assistant
  • Your workflows rely heavily on MCP tools (the model excels at deciding when and how to invoke tools)
  • You are working on code generation, refactoring, debugging, or code review
  • You want a single model that handles the widest range of daily tasks reliably

Llama-3.3-70B-Instruct

Choose this model when:
  • You need the agent to reason through complex architectural decisions before writing code
  • You are planning a large feature, breaking down a system design, or analyzing competing approaches
  • Response latency of a few extra seconds per message is acceptable
  • You have 128 GB of RAM available and want the best reasoning quality in the 32B–70B range

DeepSeek-Coder-V2-Lite-Instruct

Choose this model when:
  • You need the absolute fastest response times and are working on standard coding tasks
  • Your hardware has less available memory and cannot comfortably run a 32B model
  • You are doing rapid prototyping, autocomplete-style assistance, or simple scripts
  • You do not rely on multi-step MCP tool chains in your workflow
Models at 120B parameters or larger are not suitable for interactive use on the current hardware. Every token evaluation requires significantly more compute time, leading to response latencies measured in tens of seconds. If you have loaded a 120B model and find responses slow, switch to Qwen2.5-Coder-32B for an immediate improvement.

Hardware requirements

ModelMinimum RAMRecommended RAMQuantization
DeepSeek-Coder-V2-Lite (16B)16 GB24 GBQ4_K_M or higher
Qwen2.5-Coder-32B32 GB48–64 GBQ4_K_M or Q5_K_M
Llama-3.3-70B64 GB80–96 GBQ4_K_M
These figures apply to GPU VRAM or unified memory (Apple Silicon / NVIDIA with NVLink). If your server uses system RAM for inference, add 20–30% headroom to avoid memory pressure during long context windows.

Quick decision guide

Do you primarily write and debug code?
  └─ Yes → Qwen2.5-Coder-32B-Instruct

Do you need deep reasoning for system design or architecture?
  └─ Yes → Llama-3.3-70B-Instruct

Is raw speed your top priority and your tasks are straightforward?
  └─ Yes → DeepSeek-Coder-V2-Lite-Instruct

Do you have less than 32 GB of memory available?
  └─ Yes → DeepSeek-Coder-V2-Lite-Instruct

Switching models

You can switch models at any time in LM Studio without restarting the server. Open the Developer tab, select a different model from the dropdown in the Local Server panel, and wait for it to load into memory. The Claude Code CLI will automatically use the newly loaded model on the next request.
If you switch models frequently, consider keeping two LM Studio windows open on separate ports (e.g., 1234 and 1235) with different models loaded. You can then swap between them by changing ANTHROPIC_BASE_URL in ~/.claude/config.json.