Weekend Project: Building a Local Load Balancer for LLM API Keys
Lately, because I’ve been using various LLM services (OpenAI, Gemini, DeepSeek, etc.) intensively, I’ve run into a very real pain point: being broke.
To save money, I applied for multiple free API keys (like Google Gemini’s Free Tier or DeepSeek’s complimentary credits), but these free keys often come with strict rate limits (RPM/TPM). Just when I’m in the flow writing code, a 429 Too Many Requests error pops up, completely breaking my train of thought. It’s really frustrating.
Scenario & Requirements
My needs are simple:
- Multi-Key Round-Robin: I have several keys and want them to be used automatically in rotation. When one is rate-limited, it should automatically switch to the next.
- Unified Entry Point: I don’t want to fill in a bunch of keys in each client (Chatbox, Cursor, VSCode plugin). I want to provide just one unified URL, and the backend handles the complex authentication and routing automatically.
- Compatibility: It must be fully compatible with the OpenAI format, as almost all tools now support the OpenAI protocol.
- Visualization: I want to see which key is used the most, which one frequently reports errors, and which one is still in a cooldown period.
There are many powerful gateways on the market (like OneAPI, NewAPI), but they are too heavy. I don’t need a user system, recharge channels, or complex databases. I just need a small tool that runs locally, preferably a single executable file, or even a macOS App.
So, over the weekend, I wrote a small tool: llm-api-lb.

Inspiration & Design
The core idea is essentially a Reverse Proxy.
- Intercept: Intercept all requests going to
/v1/*. - Schedule: Maintain a list of keys in memory, including the status of each key (enabled, in cooldown, failure count, etc.).
- Forward: Pick an available key, replace the
Authorizationheader in the request, and forward it to the upstream (OpenAI/Google/DeepSeek). - Fault Tolerance: If the upstream returns a 429 or 5xx error, mark the key for a “cooldown period” and automatically retry with the next key.
The tech stack chosen was the simplest: Node.js + Express.
Why not Go or Rust? Because I also wanted to write a simple web management interface. Node.js is just so convenient for handling HTTP and JSON, and combining it with pkg to package it into a single file is very easy.
Implementation Process
1. Core Logic
The core logic is less than 1000 lines of code. The most critical parts are “key selection” and “error handling”.
I implemented a simple Round-Robin algorithm, but with a passive cooldown mechanism. Once a key fails a request (429 rate limit or 401 authentication failure), it gets temporarily “sent to the corner” for a period of time (e.g., 1 minute). During this minute, traffic automatically bypasses it.
2. Building the macOS App
I wanted it to be more than just a black command-line tool; I wanted a somewhat elegant Menu Bar App.
Using Node.js scripting capabilities combined with macOS system commands, I implemented a “pseudo-packaging” process:
- Used
pkgto package the Node.js code into a binary executable. - Wrote a minimal Launcher in Swift responsible for calling this binary and managing the tray icon and menu.
- Packed them into the standard
.appdirectory structure.
One pitfall I encountered was port conflicts. What if port 8787 on the user’s computer was already taken?
I added logic in the Swift launcher: before starting, it probes the port. If it’s occupied, it shows a popup notification or automatically finds a new port.
For a better experience, I also made it persist in the menu bar: clicking the red close button just hides the window, but the program continues running in the background, ready to be woken up from the top menu bar anytime.

3. Icons & Details
To make it look like a legitimate app, I even drew an icon (my aesthetic sense is high, but ChatGPT’s is limited). A small hiccup was that the icon had white edges, which looked terrible in Dark Mode. So I wrote another Python script using the PIL library to process the edge pixels for transparency. Finally, it looked clean.
4. Monitoring & Visualization
I added a simple monitoring dashboard to the frontend.
Using chart.js, I plotted the request count and latency trends for each key. Watching the different colored lines move gives a strange sense of reassurance—I know my keys are working hard, and the load is being evenly distributed.

Conclusion
This project isn’t technically sophisticated, but it solved my own pain point.
Now when I write code, I set the Base URL to http://localhost:8787/v1 and fill in any random key. The backend automatically bounces between Gemini’s free tier and DeepSeek, and I see far fewer 429 errors.
If you have similar troubles, or are interested in packaging Node.js into a desktop application, feel free to check out the source code on GitHub.
GitHub: https://github.com/weidussx/llm-api-lb
Happy Coding! 🚀
🤖 AI Related Posts by semantic similarity
Want updates? Subscribe via RSS
Related Content
- From Azure SRE Agent to HolmesGPT: AIOps Practices in Multi-Cloud Kubernetes Environments
- Cilium 2026 (Continued): How the Unified Data Plane Is Reshaping Kubernetes Platform Architecture
- Hands-on · Building a Memory-Enabled AI Writing Partner (Part 4): Observability (Metrics + Logs + Trace + Cost)
- Practical · Building a Memory-Enabled AI Writing Partner (Part 3): Security Architecture (RAG Protection, Fact Guard, and BYOK)
- Before Discussing LLM Security, Is Your Kubernetes Foundation Up to Standard?