Server cost math: what $15,000 per month actually buys in AI infrastructure

April 26, 2026 · Claudiu · 5 min read

When I tell people NEXUS runs on $15,000 a month of infrastructure, I get two reactions. Technical people nod: that sounds about right for the work. Non-technical people wince: why commit to that before launch?

This post is for both. I want to be straight about what $15k a month actually buys, why we chose to own the hardware instead of renting it, and how the numbers hold up against a $19.99 subscription.

What we run

Five dedicated servers, each built for local LLM inference. Roughly, per box:

GPU: high-memory cards that can host 70B-parameter models in quantized form, with 24-48GB of VRAM per card depending on the workload tier
RAM: 256-512GB of DDR5 (serving LLMs is memory-hungry)
Storage: 4TB of NVMe for weights and cache, plus 16TB of bulk for logs and telemetry
Networking: 10-40 Gbps NICs for inter-node traffic during council debate and parallel multi-agent calls

These aren't gaming rigs. They're purpose-built inference servers, and each one can serve roughly 10-30 concurrent Ollama calls depending on model size and context length.

Why five, not one

A single very large server could host the whole workload in theory. We chose five for three reasons.

Redundancy. If one box dies, the other four absorb the load while it gets repaired or replaced, and users never see an outage. On a single-box setup, one hardware failure is a site-wide one.

Council parallelism. The council mechanism runs three to five specialists at once on hard judgment calls. Putting them on separate physical boxes gives you true concurrency instead of time-sliced pretend-concurrency, and latency drops.

Model tiering. Different boxes host different model sizes: a 7B model for fast lookups, a 13B for mid-complexity work, a 70B for high-stakes reasoning. Routing sends each subtask to the right box.

Buy vs rent: the math we ran

The obvious question is why not just rent GPUs from AWS, Runpod, or Lambda. We took it seriously. Here's why we landed on owning.

Rental at our scale. An equivalent five-server GPU setup on a hyperscaler runs roughly $22k-35k a month, depending on reservation terms and model family. The headline hourly rate looks cheap right up until you multiply by 720 hours.

Ownership. Buy the hardware once (~$180k-220k capex). Host it in a colocation facility with business-grade power and connectivity (~$4k-6k a month). Add blended staff time for maintenance (~$3k-5k a month). All in, that's about $13-16k a month, depreciating the hardware over three years. Break-even against rental lands around month 14-18, and we plan to run these for four to five years minimum.

Determinism. Cloud GPU availability is seasonal. When demand spikes (launch windows, research crunches, enterprise migrations), spot prices climb and capacity sometimes vanishes. Owned hardware doesn't.

Data sovereignty. We control where the data lives, which matters for EU customers, for enterprise prospects, and for whatever compliance posture comes next. Owned hardware in a known location keeps that story simple.

No surprise bills. Rental exposes you to runaway cost if a bug loops and burns 10x the expected tokens. Owned hardware has a fixed monthly ceiling, and when you're a small team, predictable burn beats marginal cost optimization.

What's NOT on our servers

An important clarification: we don't try to host every model ourselves. The owned fleet handles:

Free-tier model access for Eco users (open-source Llama-family models)
Council-debate infrastructure (the coordination plane)
The quantum cloning memory layer (databases, not inference)
Routing, orchestration, and PM agents (lightweight models, fast inference)

What does not run on our servers: GPT-4, Claude Opus, Claude Sonnet, Gemini Pro, or any other closed premium model. Those run on OpenAI, Anthropic, and Google infrastructure, and users bring their own keys to call them. Our job is to route, coordinate, and remember. Theirs is to run the giant models.

That's the core BYOK split, and it's why our server bill stays at $15k no matter how many premium calls our users make. Premium inference doesn't scale with our customer count; it scales with each customer's own API invoice.

Unit economics at a few scales

500 active subscribers. Revenue: 500 × $20 = $10k a month (Power-tier average). Server cost: $15k. We're losing $5k a month. Early, and normal.

1,500 active subscribers. Revenue: $30k a month. Server cost: $15k. Margin: $15k, or 50%. Mix in the God Mode tier ($299) at an 80/20 Power/God split and it climbs higher, more like $80-100k a month in revenue at this user count once the tiers are blended properly.

5,000 active subscribers. Blended revenue: $300-400k a month, depending on tier mix. Server cost still $15k (maybe a sixth box at $18k). Margin: over 90%. This is the good zone.

The key shape: BYOK keeps our cost curve flat while revenue scales linearly with users. Most SaaS is the opposite: revenue scales, but so does cost, because they're absorbing the API fees. Our margin improves with scale instead of getting squeezed.

Where the pre-order money goes

When I talk about the waitlist and the pre-order offer, here's what the money is actually for: the server bill. That's basically it.

At 500 pre-orders (50% off the $240 annual plan, plus a God Mode mix) we cover maybe six to eight months of servers. At 1,500 pre-orders, twelve to sixteen months. At 2,500-plus, eighteen to twenty-four months, with runway to hire.

There's no version of this where we raise $5M to subsidize the product. There's a version where pre-orders fund the servers for a year or two while we onboard enough subscribers to cover the bill from recurring revenue alone. It's a narrow path. It's also the honest one.

Next: "Why AI workflows fail: the 5 quiet killers of most automation projects", and why orchestration addresses every one of them. If you've ever watched an AI project die quietly in production, this one's for you.