Local AI models in ApexSpriteAI: LM Studio and model tiers

ApexSpriteAI runs AI inference entirely on your own hardware using LM Studio, an application that hosts open-source large language models and exposes them through a local API server. Your prompts never leave your network, you pay nothing per token, and you can swap models in seconds to trade speed for capability depending on the task at hand. Choosing the right model for your workload is the single most effective way to make your agent feel fast and reliable.

How LM Studio fits into ApexSpriteAI

LM Studio runs on your GPU server and acts as a drop-in replacement for the Anthropic API. It listens on port 1234 and accepts requests in the same /v1/messages format that the Claude Code CLI uses by default. When you set the ANTHROPIC_BASE_URL environment variable to point at your LM Studio server, the CLI routes all model requests there transparently — no code changes required.

export ANTHROPIC_BASE_URL=http://<your-server-ip>:1234

LM Studio must be running with server mode enabled and bound to 0.0.0.0 (all interfaces) so that your local machine can reach it over the network. The default port is 1234.

Recommended models

Three models cover the full range of use cases in ApexSpriteAI. All three run well on a 128 GB unified RAM NVIDIA GPU system.

Model	Size	Speed	Best for
Qwen2.5-Coder-32B-Instruct	32B	Extremely fast	Coding tasks, tool use, everyday agent work
Llama-3.3-70B-Instruct	70B	Fast	Complex reasoning, architectural planning
DeepSeek-Coder-V2-Lite-Instruct	16B	Blazing fast	Code completion, lightweight or high-volume tasks

Qwen2.5-Coder-32B-Instruct

This is the recommended starting point for most users. Qwen2.5-Coder-32B delivers state-of-the-art performance for coding and tool use at the 32B parameter scale, matching Claude 3.5 Sonnet on several coding benchmarks. Its smaller footprint means you get near-instant responses — the “snippy” feel you want from an interactive coding assistant — while still handling multi-step MCP tool chains reliably. It supports 32k–64k context windows on standard hardware without a significant latency penalty, which is more than enough for most real-world codebases.

Llama-3.3-70B-Instruct

Choose Llama 3.3 70B when you need deeper reasoning. It is slightly slower than Qwen 32B due to its larger parameter count, but it produces more thorough analysis on complex architectural decisions, cross-file refactors, and tasks that require the model to hold many dependencies in mind simultaneously. It is a good model for planning sessions where response latency matters less than response quality.

DeepSeek-Coder-V2-Lite-Instruct (16B)

DeepSeek-Coder V2 Lite is the fastest option in the lineup. At 16B parameters it fits comfortably in memory with headroom to spare, which translates into very low latency even for longer prompts. It handles standard coding tasks and code completion well. For highly complex, multi-step MCP chains — where the model must plan several tool calls in sequence — the larger models are more reliable.

Choosing the right model

Prioritize speed

Use Qwen2.5-Coder-32B or DeepSeek-Coder-V2-Lite when you want fast, interactive responses. Both handle tool calling well. Qwen 32B is the better choice if tool reliability matters.

Prioritize reasoning depth

Use Llama-3.3-70B for tasks that require complex multi-step planning, deep logical inference, or working through large amounts of context before producing an answer.

Running many tasks

Use DeepSeek-Coder-V2-Lite when running high volumes of smaller tasks in quick succession. Its low memory footprint leaves room for other processes and keeps latency consistent.

Balanced everyday use

Qwen2.5-Coder-32B is the best all-rounder. It covers 90% of agent use cases well and keeps response times short enough that waiting for the model never becomes a bottleneck.

Hardware requirements

You do not need to run LM Studio on the same machine as your CLI. ApexSpriteAI is designed to connect to a remote GPU server over a secure network connection (such as Tailscale), so your development laptop can stay lightweight while the heavy inference happens elsewhere.

To run the recommended models comfortably, your LM Studio server should meet these minimums:

Component	Minimum	Recommended
GPU	NVIDIA GPU (any CUDA-capable)	NVIDIA with 128 GB unified RAM
System RAM	64 GB	128 GB DDR5
Storage	100 GB free NVMe	2 TB NVMe
LM Studio version	v0.4.1	Latest stable

The 70B model in particular benefits significantly from 128 GB of unified RAM. With less memory available, the model may be partially offloaded to CPU, which dramatically increases response latency. If your hardware has less than 128 GB, the 32B or 16B models will deliver a much better experience.

Get Started

Core Concepts

Guides

Configuration

Troubleshooting

Local AI models in ApexSpriteAI: LM Studio and model tiers

How LM Studio fits into ApexSpriteAI

Recommended models

Qwen2.5-Coder-32B-Instruct

Llama-3.3-70B-Instruct

DeepSeek-Coder-V2-Lite-Instruct (16B)

Choosing the right model

Prioritize speed

Prioritize reasoning depth

Running many tasks

Balanced everyday use

Hardware requirements

Get Started

Core Concepts

Guides

Configuration

Troubleshooting

Documentation Index

​How LM Studio fits into ApexSpriteAI

​Recommended models

​Qwen2.5-Coder-32B-Instruct

​Llama-3.3-70B-Instruct

​DeepSeek-Coder-V2-Lite-Instruct (16B)

​Choosing the right model

Prioritize speed

Prioritize reasoning depth

Running many tasks

Balanced everyday use

​Hardware requirements

How LM Studio fits into ApexSpriteAI

Recommended models

Qwen2.5-Coder-32B-Instruct

Llama-3.3-70B-Instruct

DeepSeek-Coder-V2-Lite-Instruct (16B)

Choosing the right model

Hardware requirements