ApexSpriteAI runs AI inference entirely on your own hardware using LM Studio, an application that hosts open-source large language models and exposes them through a local API server. Your prompts never leave your network, you pay nothing per token, and you can swap models in seconds to trade speed for capability depending on the task at hand. Choosing the right model for your workload is the single most effective way to make your agent feel fast and reliable.Documentation Index
Fetch the complete documentation index at: https://reliatrack.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
How LM Studio fits into ApexSpriteAI
LM Studio runs on your GPU server and acts as a drop-in replacement for the Anthropic API. It listens on port1234 and accepts requests in the same /v1/messages format that the Claude Code CLI uses by default. When you set the ANTHROPIC_BASE_URL environment variable to point at your LM Studio server, the CLI routes all model requests there transparently — no code changes required.
LM Studio must be running with server mode enabled and bound to
0.0.0.0 (all interfaces) so that your local machine can reach it over the network. The default port is 1234.Recommended models
Three models cover the full range of use cases in ApexSpriteAI. All three run well on a 128 GB unified RAM NVIDIA GPU system.| Model | Size | Speed | Best for |
|---|---|---|---|
| Qwen2.5-Coder-32B-Instruct | 32B | Extremely fast | Coding tasks, tool use, everyday agent work |
| Llama-3.3-70B-Instruct | 70B | Fast | Complex reasoning, architectural planning |
| DeepSeek-Coder-V2-Lite-Instruct | 16B | Blazing fast | Code completion, lightweight or high-volume tasks |
Qwen2.5-Coder-32B-Instruct
This is the recommended starting point for most users. Qwen2.5-Coder-32B delivers state-of-the-art performance for coding and tool use at the 32B parameter scale, matching Claude 3.5 Sonnet on several coding benchmarks. Its smaller footprint means you get near-instant responses — the “snippy” feel you want from an interactive coding assistant — while still handling multi-step MCP tool chains reliably. It supports 32k–64k context windows on standard hardware without a significant latency penalty, which is more than enough for most real-world codebases.Llama-3.3-70B-Instruct
Choose Llama 3.3 70B when you need deeper reasoning. It is slightly slower than Qwen 32B due to its larger parameter count, but it produces more thorough analysis on complex architectural decisions, cross-file refactors, and tasks that require the model to hold many dependencies in mind simultaneously. It is a good model for planning sessions where response latency matters less than response quality.DeepSeek-Coder-V2-Lite-Instruct (16B)
DeepSeek-Coder V2 Lite is the fastest option in the lineup. At 16B parameters it fits comfortably in memory with headroom to spare, which translates into very low latency even for longer prompts. It handles standard coding tasks and code completion well. For highly complex, multi-step MCP chains — where the model must plan several tool calls in sequence — the larger models are more reliable.Choosing the right model
Prioritize speed
Use Qwen2.5-Coder-32B or DeepSeek-Coder-V2-Lite when you want fast, interactive responses. Both handle tool calling well. Qwen 32B is the better choice if tool reliability matters.
Prioritize reasoning depth
Use Llama-3.3-70B for tasks that require complex multi-step planning, deep logical inference, or working through large amounts of context before producing an answer.
Running many tasks
Use DeepSeek-Coder-V2-Lite when running high volumes of smaller tasks in quick succession. Its low memory footprint leaves room for other processes and keeps latency consistent.
Balanced everyday use
Qwen2.5-Coder-32B is the best all-rounder. It covers 90% of agent use cases well and keeps response times short enough that waiting for the model never becomes a bottleneck.
Hardware requirements
To run the recommended models comfortably, your LM Studio server should meet these minimums:| Component | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA GPU (any CUDA-capable) | NVIDIA with 128 GB unified RAM |
| System RAM | 64 GB | 128 GB DDR5 |
| Storage | 100 GB free NVMe | 2 TB NVMe |
| LM Studio version | v0.4.1 | Latest stable |