> ## Documentation Index
> Fetch the complete documentation index at: https://docs-apexspriteai.reliatrack.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Install and configure LM Studio for local AI inference

> Install and configure LM Studio on your GPU server to serve AI inference requests. This guide covers installation, model loading, and server configuration.

LM Studio turns your GPU server into a drop-in replacement for the Anthropic API. Once it's running, any tool that speaks the `/v1/messages` protocol — including the Claude Code CLI — can send requests to your local machine instead of the cloud. This guide walks you through installing LM Studio, loading a model, and confirming the server is reachable.

## Prerequisites

* A GPU server with at least 32 GB of VRAM or unified RAM (128 GB recommended for 32B+ models)
* A supported operating system: macOS, Windows, or Linux
* Network access to the server (direct or via Tailscale)

<Steps>
  <Step title="Download LM Studio">
    Go to [lmstudio.ai](https://lmstudio.ai) and download the installer for your server's operating system. You need **version 0.4.1 or later** — earlier releases do not include the local server feature used in this guide.

    <Note>
      LM Studio v0.4.1+ ships with a built-in OpenAI-compatible server. If your installed version is older, update it before continuing.
    </Note>
  </Step>

  <Step title="Install LM Studio on your server">
    Run the downloaded installer and follow the on-screen prompts. On Linux, the package is distributed as an AppImage:

    ```bash theme={null}
    chmod +x LM_Studio-*.AppImage
    ./LM_Studio-*.AppImage
    ```

    On macOS, drag LM Studio into your Applications folder and open it. On Windows, run the `.exe` installer directly.
  </Step>

  <Step title="Load a model">
    After LM Studio opens, navigate to the **Discover** tab and search for a model to download. The recommended starting point is **Qwen2.5-Coder-32B-Instruct** — it delivers state-of-the-art coding and tool-use performance at 32B parameters, with low latency on hardware with 64 GB or more of memory.

    <Tip>
      If your server has 128 GB of unified RAM, Qwen2.5-Coder-32B-Instruct loads comfortably and responds quickly. For complex reasoning tasks, Llama-3.3-70B-Instruct is a solid alternative at the cost of slightly higher latency. See [Choose the right AI model](/guides/model-selection) for a full comparison.
    </Tip>

    Select a quantized variant (Q4\_K\_M or Q5\_K\_M) to balance quality and speed, then click **Download**. Wait for the download to complete before moving to the next step.
  </Step>

  <Step title="Enable the local server">
    Switch to the **Developer** tab (the `</>` icon in the left sidebar). You will see a **Local Server** panel.

    Configure the server with these settings:

    | Setting      | Value     |
    | ------------ | --------- |
    | Port         | `1234`    |
    | Bind address | `0.0.0.0` |
    | CORS         | Enabled   |

    Setting the bind address to `0.0.0.0` allows connections from other machines on the same network or VPN — this is required if you are connecting from a separate Mac or workstation over Tailscale.

    <Warning>
      Binding to `0.0.0.0` exposes the server on all network interfaces. Use a VPN such as Tailscale or a firewall rule to restrict access to trusted clients only. Do not expose port 1234 directly to the public internet.
    </Warning>

    Click **Start Server**. The status indicator turns green when the server is accepting connections.
  </Step>

  <Step title="Select the loaded model in the server">
    In the **Local Server** panel, open the model selector dropdown and choose the model you downloaded in the previous step (e.g., `Qwen2.5-Coder-32B-Instruct`). LM Studio loads it into memory and makes it available at the `/v1/messages` endpoint.
  </Step>

  <Step title="Verify the server with a curl test">
    From any machine that can reach your server, run the following command to send a test message. Replace `<SERVER_IP>` with your server's IP address (or `localhost` if you are testing from the same machine).

    <CodeGroup>
      ```bash Remote (via Tailscale or LAN) theme={null}
      curl http://<SERVER_IP>:1234/v1/messages \
        -H "Content-Type: application/json" \
        -H "x-api-key: local" \
        -d '{
          "model": "local-model",
          "max_tokens": 64,
          "messages": [
            { "role": "user", "content": "Reply with: Server is running." }
          ]
        }'
      ```

      ```bash Local (same machine) theme={null}
      curl http://localhost:1234/v1/messages \
        -H "Content-Type: application/json" \
        -H "x-api-key: local" \
        -d '{
          "model": "local-model",
          "max_tokens": 64,
          "messages": [
            { "role": "user", "content": "Reply with: Server is running." }
          ]
        }'
      ```
    </CodeGroup>

    A successful response looks like this:

    ```json theme={null}
    {
      "id": "chatcmpl-...",
      "object": "chat.completion",
      "choices": [
        {
          "message": {
            "role": "assistant",
            "content": "Server is running."
          }
        }
      ]
    }
    ```

    If you receive a connection refused error, confirm that the server is started in LM Studio and that your firewall allows traffic on port 1234.
  </Step>
</Steps>

## Next steps

With your LM Studio server running, you can connect the Claude Code CLI to it so all inference requests are handled locally. Follow [Connect Claude Code CLI to your local model](/guides/connecting-claude-code) to complete the setup.
