Server cost math: what $15,000 per month actually buys in AI infrastructure
When I tell people NEXUS PRIME's infrastructure costs $15,000 per month, I get two reactions.
Technical people go "okay, that sounds about right for what you're doing." Non-technical people go "that's insane, why would you commit to that before launch."
This post is for both groups. I want to be transparent about what $15k a month actually buys, why we chose to own the hardware instead of renting it, and how the economics shake out when the subscription is $19.99/month.
What we run
Five dedicated servers, each built for local LLM inference. The rough spec per box:
- GPU: high-memory cards capable of hosting 70B-parameter models in quantized form, with 24-48GB of VRAM per card depending on the workload tier
- RAM: 256-512GB DDR5 per box (LLM serving is memory-heavy)
- Storage: 4TB NVMe for model weights + cache, plus 16TB of bulk for logs and training telemetry
- Networking: 10-40 Gbps NICs for inter-node traffic during council debate and multi-agent parallel calls
These are not consumer gaming rigs. They are purpose-built inference servers. Each one can serve 10-30 concurrent Ollama model calls depending on the model size and context length.
Why 5 and not 1
A single very large server could in theory host the full workload. We chose five for three reasons.
Redundancy. If one box fails, the other four absorb load while the fifth gets repaired or replaced. NEXUS users don't see an outage. On a single-box setup, one hardware failure is a site-wide outage.
Council debate parallelism. The council mechanism runs 3-5 specialist agents simultaneously on hard judgment calls. Running those on separate physical boxes means true concurrency, not time-sliced pseudo-concurrency. Latency drops.
Model tiering. Different boxes host different model sizes. A 7B-param fast model for quick lookups. A 13B general-purpose box for mid-complexity work. A 70B-param reasoning box for high-stakes inference. Routing picks the right box for the right subtask.
Buy vs rent: the math we ran
The natural question is "why not just rent GPUs from AWS / Runpod / Lambda?" We considered it seriously. Here is why we landed on owned.
Rental cost at our scale. An equivalent 5-server GPU rental setup on a hyperscaler clocks in at roughly $22k-35k per month depending on reservation terms and model family. The headline hourly rate looks cheap until you multiply by 720 hours.
Ownership cost. Buy the hardware once (~$180k-220k capex). Host it in a colocation facility with business-grade power and connectivity (~$4k-6k/month opex). Add staff time for maintenance (~$3k-5k/month blended). All-in, we land around $13-16k monthly, depreciating the hardware over three years.
Break-even vs rental: roughly month 14-18. We plan to run these for 4-5 years minimum.
Determinism. Cloud GPU availability is seasonal. When demand spikes (launch windows, research crunches, enterprise migrations), spot prices go up and sometimes capacity disappears. Owned hardware doesn't.
Data sovereignty. We control where the data lives. That matters for EU customers, for enterprise prospects, and for any future compliance posture. On owned hardware in a known location, the story is simple.
No surprise bills. Rental exposes you to runaway cost if a bug causes a loop that burns 10x expected tokens. Owned hardware has a fixed monthly ceiling. When you're a small team, predictable burn is more valuable than marginal cost optimization.
What's NOT on our servers
Important clarification. NEXUS PRIME does not try to host every model ourselves. The owned fleet handles:
- Free-tier model access for Eco users (open-source Llama-family models)
- Council debate infrastructure (the coordination plane)
- The quantum cloning memory layer (databases, not inference)
- Routing, orchestration, and PM agents (lightweight models, fast inference)
What does NOT run on our servers: GPT-4, Claude Opus, Claude Sonnet, Gemini Pro, or any other closed-source premium model. Those run on OpenAI/Anthropic/Google infrastructure, and users bring their own API keys to call them. Our job is to route, coordinate, and remember. Their job is to run the giant models.
This is the core BYOK split. It's also why our server bill stays at $15k regardless of how many premium model calls our users make. Premium inference cost doesn't scale with our customer count — it scales with the customer's OWN API invoice.
Unit economics at different scales
Let's do the math at a few scales.
Scale 1: 500 active subscribers. Revenue: 500 × $20 = $10k/month (Power tier average). Server cost: $15k. We're losing $5k/month. Early. Normal.
Scale 2: 1,500 active subscribers. Revenue: $30k/month. Server cost: $15k. Margin: $15k, or 50%. Plus the God Mode tier ($299) at an 80/20 Power/God mix pushes this higher — more like $80-100k/month revenue at this user count once tiers are mixed properly.
Scale 3: 5,000 active subscribers. Revenue (blended): $300-400k/month depending on tier mix. Server cost still $15k (maybe we add a 6th box: $18k). Margin: over 90%. This is the good zone.
The key insight: BYOK means our cost curve is flat, while our revenue curve scales linearly with users. Most SaaS has the opposite shape — revenue scales linearly but so does cost (because they're absorbing API fees). Our margin improves dramatically with scale instead of getting squeezed.
Where the pre-order money goes
When I talk about the 45-day waitlist and the pre-order offer, here is what the money is actually for:
The server bill. That's basically it.
At 500 pre-orders × (50% off annual $240 + God Mode mix) we cover maybe 6-8 months of servers. At 1,500 pre-orders we cover 12-16 months. At 2,500+ pre-orders we cover 18-24 months and have runway to hire.
That's the math. There is no plan where we raise $5M to subsidize the product. There is a plan where pre-orders fund the servers for 12-24 months, during which we onboard enough subscribers to cover the bill from recurring revenue alone.
It is a narrow path. It is also the honest one.
Next post: "Why AI workflows fail: the 5 quiet killers of most automation projects" — and why orchestration addresses every one of them. If you have ever watched an AI project silently die in production, this post is for you.