2 months closed-source for speed, then 12 months self-hosted optimization. Every assumption below is live.
Grand total (14 mo)
$14,742
Compute + tooling only
$14,742
excl. salaries
Cost per file
$2.95
5,000 files
Cost per page
$0.0295
100 pg/file
01Workload
02OCR — Gemini (always-on)
03Per-step model — Phase 1 (closed)
04Phase 2 — hosting per step
Classification
JSON extraction
Orchestration
05Pricing & team
Where the money goes
Phase 1 API (closed, 2 mo)
$321
Phase 2 API (OCR + closed steps)
$1,326
Phase 2 GPU hosting (12 mo)
$4,643
One-time (BERT data + training)
$2,053
Claude dev subscriptions
$6,400
Grand total$14,742
Reality check — does self-hosting pay for itself here?
This plan: self-host in Phase 2
$8,342
compute incl. GPU, excl. dev subs
Hypothetical: 100% closed APIs, 14 mo
$2,242
no GPU, no BERT investment
At 5,000 files, the closed-API steps you replace in Phase 2 would have cost only $596 over 12 months, while the always-on GPUs cost $4,643. Self-hosting is not cost-driven at this volume. The defensible reasons to do it anyway: data residency(Aadhaar / PAN / CIBIL under RBI & DPDP rules are hard to send to third-party LLM APIs), scaling well beyond 5,000 files, and building proprietary IP. Frame the GPU line to investors as a compliance + capability investment, not a savings play.
Phase 2 strategy — same workload, three ways to host
Closed APIs (Gemini/Claude)
no infra, top reasoning, vendor-locked
$1,922
Open models via API
open weights, pay-per-token, portable
$1,352cheapest
Self-hosted GPUs
in-house, idle cost, full control
$39,837
12-month Phase-2 cost only (API + GPU + one-time), same documents and token assumptions. OCR and Phase-1 are identical across all three, so this isolates the hosting decision. Open-model APIs give you most of the cost win of self-hosting without renting idle GPUs — and let you later lift the same weights in-house when compliance or scale demands it.
Scenario matrix — total by avg pages/file
50 pg
100 pg
200 pg
500 pg
Phase 1 API (2 mo)
$225
$321
$511
$1,083
Phase 2 API (12 mo)
$830
$1,326
$2,318
$5,293
Phase 2 GPU (12 mo)
$4,643
$4,643
$4,643
$4,643
One-time BERT
$2,053
$2,053
$2,053
$2,053
Dev subscriptions
$6,400
$6,400
$6,400
$6,400
Grand total
$14,151
$14,742
$15,925
$19,472
Note: GPU, BERT one-time, and dev subscriptions are fixed — they don't move with pages/file. Only the API lines scale with document volume.
Per-step unit economics (per file)
Step
Phase 1 (closed)
Phase 2
Mode
OCR (Gemini)
$0.2315
$0.2315
closed (fixed)
Classification
$0.0350
GPU (no token $)
selfhost
JSON extract (30 calls)
$0.0282
$0.0364
openapi
Orchestration
$0.1538
$0.0414
openapi
Total / file
$0.4484
$0.3093 API
Biggest API lever: JSON extraction runs 30 calls/file because of the per-attribute design. Set "attributes/call" to 30 in advanced settings to collapse it to 1 call — often a 10–20× cut on that step with minor accuracy trade-off.
BERT training; 7–8B open LLM (Llama-3.1-8B, Qwen-2.5-7B) quantized
Good for ≤8B; ~20–40 tok/s
A100 80GB
80 GB
$3.67
13–34B full, or 70B 4-bit. JSON extraction + mid orchestration
Strong; ~40–80 tok/s on 8–34B
H100 80GB
80 GB
$6.98
70B+ low-latency, high throughput. Heavy orchestration reasoning
Lowest latency / highest throughput
Rule of thumb: T4 for BERT classification, A10 for an 8B extraction model, A100 for a 30–70B reasoner, H100 only if orchestration latency or throughput becomes the bottleneck.
List prices verified mid-2026 (Gemini, Anthropic, Azure GPU). Excludes — per your scope — database, Azure VMs, Kubernetes, Blob, Redis, vector DB, monitoring, logging, and CI/CD. Treat outputs as planning estimates; real spend depends on negotiated rates, GPU utilization, caching, and accuracy-driven retries.