> ## Documentation Index
> Fetch the complete documentation index at: https://docs-apexspriteai.reliatrack.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Local AI models in ApexSpriteAI: LM Studio and model tiers

> ApexSpriteAI uses LM Studio to run open-source LLMs on your own GPU hardware. Learn which models work best and how they compare in speed and capability.

ApexSpriteAI runs AI inference entirely on your own hardware using LM Studio, an application that hosts open-source large language models and exposes them through a local API server. Your prompts never leave your network, you pay nothing per token, and you can swap models in seconds to trade speed for capability depending on the task at hand. Choosing the right model for your workload is the single most effective way to make your agent feel fast and reliable.

## How LM Studio fits into ApexSpriteAI

LM Studio runs on your GPU server and acts as a drop-in replacement for the Anthropic API. It listens on port `1234` and accepts requests in the same `/v1/messages` format that the Claude Code CLI uses by default. When you set the `ANTHROPIC_BASE_URL` environment variable to point at your LM Studio server, the CLI routes all model requests there transparently — no code changes required.

```bash theme={null}
export ANTHROPIC_BASE_URL=http://<your-server-ip>:1234
```

<Note>
  LM Studio must be running with **server mode enabled** and bound to `0.0.0.0` (all interfaces) so that your local machine can reach it over the network. The default port is `1234`.
</Note>

## Recommended models

Three models cover the full range of use cases in ApexSpriteAI. All three run well on a 128 GB unified RAM NVIDIA GPU system.

| Model                               | Size | Speed          | Best for                                          |
| ----------------------------------- | ---- | -------------- | ------------------------------------------------- |
| **Qwen2.5-Coder-32B-Instruct**      | 32B  | Extremely fast | Coding tasks, tool use, everyday agent work       |
| **Llama-3.3-70B-Instruct**          | 70B  | Fast           | Complex reasoning, architectural planning         |
| **DeepSeek-Coder-V2-Lite-Instruct** | 16B  | Blazing fast   | Code completion, lightweight or high-volume tasks |

### Qwen2.5-Coder-32B-Instruct

This is the recommended starting point for most users. Qwen2.5-Coder-32B delivers state-of-the-art performance for coding and tool use at the 32B parameter scale, matching Claude 3.5 Sonnet on several coding benchmarks. Its smaller footprint means you get near-instant responses — the "snippy" feel you want from an interactive coding assistant — while still handling multi-step MCP tool chains reliably.

It supports 32k–64k context windows on standard hardware without a significant latency penalty, which is more than enough for most real-world codebases.

### Llama-3.3-70B-Instruct

Choose Llama 3.3 70B when you need deeper reasoning. It is slightly slower than Qwen 32B due to its larger parameter count, but it produces more thorough analysis on complex architectural decisions, cross-file refactors, and tasks that require the model to hold many dependencies in mind simultaneously. It is a good model for planning sessions where response latency matters less than response quality.

### DeepSeek-Coder-V2-Lite-Instruct (16B)

DeepSeek-Coder V2 Lite is the fastest option in the lineup. At 16B parameters it fits comfortably in memory with headroom to spare, which translates into very low latency even for longer prompts. It handles standard coding tasks and code completion well. For highly complex, multi-step MCP chains — where the model must plan several tool calls in sequence — the larger models are more reliable.

## Choosing the right model

<CardGroup cols={2}>
  <Card title="Prioritize speed" icon="zap">
    Use **Qwen2.5-Coder-32B** or **DeepSeek-Coder-V2-Lite** when you want fast, interactive responses. Both handle tool calling well. Qwen 32B is the better choice if tool reliability matters.
  </Card>

  <Card title="Prioritize reasoning depth" icon="brain">
    Use **Llama-3.3-70B** for tasks that require complex multi-step planning, deep logical inference, or working through large amounts of context before producing an answer.
  </Card>

  <Card title="Running many tasks" icon="layers">
    Use **DeepSeek-Coder-V2-Lite** when running high volumes of smaller tasks in quick succession. Its low memory footprint leaves room for other processes and keeps latency consistent.
  </Card>

  <Card title="Balanced everyday use" icon="sliders">
    **Qwen2.5-Coder-32B** is the best all-rounder. It covers 90% of agent use cases well and keeps response times short enough that waiting for the model never becomes a bottleneck.
  </Card>
</CardGroup>

## Hardware requirements

<Tip>
  You do not need to run LM Studio on the same machine as your CLI. ApexSpriteAI is designed to connect to a remote GPU server over a secure network connection (such as Tailscale), so your development laptop can stay lightweight while the heavy inference happens elsewhere.
</Tip>

To run the recommended models comfortably, your LM Studio server should meet these minimums:

| Component         | Minimum                       | Recommended                    |
| ----------------- | ----------------------------- | ------------------------------ |
| GPU               | NVIDIA GPU (any CUDA-capable) | NVIDIA with 128 GB unified RAM |
| System RAM        | 64 GB                         | 128 GB DDR5                    |
| Storage           | 100 GB free NVMe              | 2 TB NVMe                      |
| LM Studio version | v0.4.1                        | Latest stable                  |

The 70B model in particular benefits significantly from 128 GB of unified RAM. With less memory available, the model may be partially offloaded to CPU, which dramatically increases response latency. If your hardware has less than 128 GB, the 32B or 16B models will deliver a much better experience.