Skip to main content

Documentation Index

Fetch the complete documentation index at: https://reliatrack.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

ApexSpriteAI runs AI inference entirely on your own hardware using LM Studio, an application that hosts open-source large language models and exposes them through a local API server. Your prompts never leave your network, you pay nothing per token, and you can swap models in seconds to trade speed for capability depending on the task at hand. Choosing the right model for your workload is the single most effective way to make your agent feel fast and reliable.

How LM Studio fits into ApexSpriteAI

LM Studio runs on your GPU server and acts as a drop-in replacement for the Anthropic API. It listens on port 1234 and accepts requests in the same /v1/messages format that the Claude Code CLI uses by default. When you set the ANTHROPIC_BASE_URL environment variable to point at your LM Studio server, the CLI routes all model requests there transparently — no code changes required.
export ANTHROPIC_BASE_URL=http://<your-server-ip>:1234
LM Studio must be running with server mode enabled and bound to 0.0.0.0 (all interfaces) so that your local machine can reach it over the network. The default port is 1234.
Three models cover the full range of use cases in ApexSpriteAI. All three run well on a 128 GB unified RAM NVIDIA GPU system.
ModelSizeSpeedBest for
Qwen2.5-Coder-32B-Instruct32BExtremely fastCoding tasks, tool use, everyday agent work
Llama-3.3-70B-Instruct70BFastComplex reasoning, architectural planning
DeepSeek-Coder-V2-Lite-Instruct16BBlazing fastCode completion, lightweight or high-volume tasks

Qwen2.5-Coder-32B-Instruct

This is the recommended starting point for most users. Qwen2.5-Coder-32B delivers state-of-the-art performance for coding and tool use at the 32B parameter scale, matching Claude 3.5 Sonnet on several coding benchmarks. Its smaller footprint means you get near-instant responses — the “snippy” feel you want from an interactive coding assistant — while still handling multi-step MCP tool chains reliably. It supports 32k–64k context windows on standard hardware without a significant latency penalty, which is more than enough for most real-world codebases.

Llama-3.3-70B-Instruct

Choose Llama 3.3 70B when you need deeper reasoning. It is slightly slower than Qwen 32B due to its larger parameter count, but it produces more thorough analysis on complex architectural decisions, cross-file refactors, and tasks that require the model to hold many dependencies in mind simultaneously. It is a good model for planning sessions where response latency matters less than response quality.

DeepSeek-Coder-V2-Lite-Instruct (16B)

DeepSeek-Coder V2 Lite is the fastest option in the lineup. At 16B parameters it fits comfortably in memory with headroom to spare, which translates into very low latency even for longer prompts. It handles standard coding tasks and code completion well. For highly complex, multi-step MCP chains — where the model must plan several tool calls in sequence — the larger models are more reliable.

Choosing the right model

Prioritize speed

Use Qwen2.5-Coder-32B or DeepSeek-Coder-V2-Lite when you want fast, interactive responses. Both handle tool calling well. Qwen 32B is the better choice if tool reliability matters.

Prioritize reasoning depth

Use Llama-3.3-70B for tasks that require complex multi-step planning, deep logical inference, or working through large amounts of context before producing an answer.

Running many tasks

Use DeepSeek-Coder-V2-Lite when running high volumes of smaller tasks in quick succession. Its low memory footprint leaves room for other processes and keeps latency consistent.

Balanced everyday use

Qwen2.5-Coder-32B is the best all-rounder. It covers 90% of agent use cases well and keeps response times short enough that waiting for the model never becomes a bottleneck.

Hardware requirements

You do not need to run LM Studio on the same machine as your CLI. ApexSpriteAI is designed to connect to a remote GPU server over a secure network connection (such as Tailscale), so your development laptop can stay lightweight while the heavy inference happens elsewhere.
To run the recommended models comfortably, your LM Studio server should meet these minimums:
ComponentMinimumRecommended
GPUNVIDIA GPU (any CUDA-capable)NVIDIA with 128 GB unified RAM
System RAM64 GB128 GB DDR5
Storage100 GB free NVMe2 TB NVMe
LM Studio versionv0.4.1Latest stable
The 70B model in particular benefits significantly from 128 GB of unified RAM. With less memory available, the model may be partially offloaded to CPU, which dramatically increases response latency. If your hardware has less than 128 GB, the 32B or 16B models will deliver a much better experience.