Multi-Model Orchestration on a 2017 iMac

Running a competent AI agent locally is a balancing act. You need high reasoning capability to route tasks and make decisions, which requires a massive model. But you also need lightning-fast inference to handle actual chat interactions and tool-use, which requires a smaller model. On a datacenter, you just spin up more A100s. On a 2017 iMac with 40GB of system RAM but a tiny 4GB Radeon Pro 575 GPU, you have to get creative.

The Split-Compute Architecture

To bypass the VRAM limitations, we built a dual-node llama-server pipeline running concurrently on the same machine, orchestrated by the OpenClaw framework.

Node 1: The Orchestrator

Model: 26B Orchestrator
Hardware: Locked to CPU (-ngl 0)
Context: 32,768 tokens
Role: The brain. It handles complex routing, tool decisions, and state management. Runs slowly but only outputs a few tokens.

Node 2: The Subagent

Model: 4B Concierge (Fine-Tuned)
Hardware: Full GPU Offload (-ngl 99)
Context: 8,192 tokens
Role: The concierge. It holds the domain knowledge and talks to the user rapidly via Vulkan/MoltenVK.

Overcoming Hardware Hurdles

Building this pipeline exposed several deep technical issues that you don't encounter when using cloud APIs:

1. Context Window Exhaustion

Initially, our Orchestrator was crashing due to context window exhaustion (hitting the default 16K limit). Because the Orchestrator has to read the entire chat history and the outputs of the Subagent, the context grew massively fast. We successfully bumped the Orchestrator's context window to 32K without triggering system OOM (Out of Memory) errors, stabilizing the pipeline for long-running sessions.

2. The Tokenizer Bug

During our testing of the 26B Orchestrator, we hit a critical wall. The model would start reasoning correctly, and then suddenly start outputting endless strings of <unused50> tokens. We tracked this down to a broken tokenizer inside the initial model exports for the 26B model. We had to pivot our architecture to use alternative quantized versions instead to restore stability.

3. The Telegram Gateway & Swarm Reality

With the dual-node setup running stably, we connected the OpenClaw Orchestrator to a live Telegram bot. The Orchestrator intercepts the user's message, decides if it needs to use its DuckDuckGo web search tool, and then delegates the final response generation to the fast, GPU-accelerated E4B Subagent.

"We achieved an autonomous, multi-model agentic swarm running overnight without a single API key or cloud instance."