Weekend Project: Building a Local Load Balancer for LLM API Keys

Lately, because I’ve been using various LLM services (OpenAI, Gemini, DeepSeek, etc.) intensively, I’ve run into a very real pain point: being broke.

To save money, I applied for multiple free API keys (like Google Gemini’s Free Tier or DeepSeek’s complimentary credits), but these free keys often come with strict rate limits (RPM/TPM). Just when I’m in the flow writing code, a 429 Too Many Requests error pops up, completely breaking my train of thought. It’s really frustrating.

Scenario & Requirements

My needs are simple:

  1. Multi-Key Round-Robin: I have several keys and want them to be used automatically in rotation. When one is rate-limited, it should automatically switch to the next.
  2. Unified Entry Point: I don’t want to fill in a bunch of keys in each client (Chatbox, Cursor, VSCode plugin). I want to provide just one unified URL, and the backend handles the complex authentication and routing automatically.
  3. Compatibility: It must be fully compatible with the OpenAI format, as almost all tools now support the OpenAI protocol.
  4. Visualization: I want to see which key is used the most, which one frequently reports errors, and which one is still in a cooldown period.

There are many powerful gateways on the market (like OneAPI, NewAPI), but they are too heavy. I don’t need a user system, recharge channels, or complex databases. I just need a small tool that runs locally, preferably a single executable file, or even a macOS App.

So, over the weekend, I wrote a small tool: llm-api-lb.

A dark mode API Key management interface named “llm-key-lb”, showing a form to add new API keys and a list of managed keys with fields for Name, Vendor, Base URL, Model, Weight, Key, Status, and Actions.

Inspiration & Design

The core idea is essentially a Reverse Proxy.

  1. Intercept: Intercept all requests going to /v1/*.
  2. Schedule: Maintain a list of keys in memory, including the status of each key (enabled, in cooldown, failure count, etc.).
  3. Forward: Pick an available key, replace the Authorization header in the request, and forward it to the upstream (OpenAI/Google/DeepSeek).
  4. Fault Tolerance: If the upstream returns a 429 or 5xx error, mark the key for a “cooldown period” and automatically retry with the next key.

The tech stack chosen was the simplest: Node.js + Express. Why not Go or Rust? Because I also wanted to write a simple web management interface. Node.js is just so convenient for handling HTTP and JSON, and combining it with pkg to package it into a single file is very easy.

Implementation Process

1. Core Logic

The core logic is less than 1000 lines of code. The most critical parts are “key selection” and “error handling”.

I implemented a simple Round-Robin algorithm, but with a passive cooldown mechanism. Once a key fails a request (429 rate limit or 401 authentication failure), it gets temporarily “sent to the corner” for a period of time (e.g., 1 minute). During this minute, traffic automatically bypasses it.

2. Building the macOS App

I wanted it to be more than just a black command-line tool; I wanted a somewhat elegant Menu Bar App.

Using Node.js scripting capabilities combined with macOS system commands, I implemented a “pseudo-packaging” process:

  1. Used pkg to package the Node.js code into a binary executable.
  2. Wrote a minimal Launcher in Swift responsible for calling this binary and managing the tray icon and menu.
  3. Packed them into the standard .app directory structure.

One pitfall I encountered was port conflicts. What if port 8787 on the user’s computer was already taken? I added logic in the Swift launcher: before starting, it probes the port. If it’s occupied, it shows a popup notification or automatically finds a new port. For a better experience, I also made it persist in the menu bar: clicking the red close button just hides the window, but the program continues running in the background, ready to be woken up from the top menu bar anytime. Taskbar icon

3. Icons & Details

To make it look like a legitimate app, I even drew an icon (my aesthetic sense is high, but ChatGPT’s is limited). A small hiccup was that the icon had white edges, which looked terrible in Dark Mode. So I wrote another Python script using the PIL library to process the edge pixels for transparency. Finally, it looked clean.

4. Monitoring & Visualization

I added a simple monitoring dashboard to the frontend. Using chart.js, I plotted the request count and latency trends for each key. Watching the different colored lines move gives a strange sense of reassurance—I know my keys are working hard, and the load is being evenly distributed. A dark-themed monitoring interface. The top table shows data for two keys, g1 and g2, including total requests, successes, failures, and average latency. The bottom section shows a bar chart and a line chart illustrating the trends for g1, g2, and average latency over time.

Conclusion

This project isn’t technically sophisticated, but it solved my own pain point. Now when I write code, I set the Base URL to http://localhost:8787/v1 and fill in any random key. The backend automatically bounces between Gemini’s free tier and DeepSeek, and I see far fewer 429 errors.

If you have similar troubles, or are interested in packaging Node.js into a desktop application, feel free to check out the source code on GitHub.

GitHub: https://github.com/weidussx/llm-api-lb

Happy Coding! 🚀


Want updates? Subscribe via RSS


Related Content