Tech budget model · 14-month build

Home-loan vendor matching — compute & tooling estimate

2 months closed-source for speed, then 12 months self-hosted optimization. Every assumption below is live.
Grand total (14 mo)
$14,742
Compute + tooling only
$14,742
excl. salaries
Cost per file
$2.95
5,000 files
Cost per page
$0.0295
100 pg/file
01Workload
02OCR — Gemini (always-on)
03Per-step model — Phase 1 (closed)
04Phase 2 — hosting per step
Classification
JSON extraction
Orchestration
05Pricing & team
Where the money goes
Phase 1 API (closed, 2 mo)
$321
Phase 2 API (OCR + closed steps)
$1,326
Phase 2 GPU hosting (12 mo)
$4,643
One-time (BERT data + training)
$2,053
Claude dev subscriptions
$6,400
Grand total$14,742
Reality check — does self-hosting pay for itself here?
This plan: self-host in Phase 2
$8,342
compute incl. GPU, excl. dev subs
Hypothetical: 100% closed APIs, 14 mo
$2,242
no GPU, no BERT investment

At 5,000 files, the closed-API steps you replace in Phase 2 would have cost only $596 over 12 months, while the always-on GPUs cost $4,643. Self-hosting is not cost-driven at this volume. The defensible reasons to do it anyway: data residency(Aadhaar / PAN / CIBIL under RBI & DPDP rules are hard to send to third-party LLM APIs), scaling well beyond 5,000 files, and building proprietary IP. Frame the GPU line to investors as a compliance + capability investment, not a savings play.

Phase 2 strategy — same workload, three ways to host
Closed APIs (Gemini/Claude)
no infra, top reasoning, vendor-locked
$1,922
Open models via API
open weights, pay-per-token, portable
$1,352cheapest
Self-hosted GPUs
in-house, idle cost, full control
$39,837
12-month Phase-2 cost only (API + GPU + one-time), same documents and token assumptions. OCR and Phase-1 are identical across all three, so this isolates the hosting decision. Open-model APIs give you most of the cost win of self-hosting without renting idle GPUs — and let you later lift the same weights in-house when compliance or scale demands it.
Scenario matrix — total by avg pages/file
50 pg100 pg200 pg500 pg
Phase 1 API (2 mo)$225$321$511$1,083
Phase 2 API (12 mo)$830$1,326$2,318$5,293
Phase 2 GPU (12 mo)$4,643$4,643$4,643$4,643
One-time BERT$2,053$2,053$2,053$2,053
Dev subscriptions$6,400$6,400$6,400$6,400
Grand total$14,151$14,742$15,925$19,472
Note: GPU, BERT one-time, and dev subscriptions are fixed — they don't move with pages/file. Only the API lines scale with document volume.
Per-step unit economics (per file)
StepPhase 1 (closed)Phase 2Mode
OCR (Gemini)$0.2315$0.2315closed (fixed)
Classification$0.0350GPU (no token $)selfhost
JSON extract (30 calls)$0.0282$0.0364openapi
Orchestration$0.1538$0.0414openapi
Total / file$0.4484$0.3093 API
Biggest API lever: JSON extraction runs 30 calls/file because of the per-attribute design. Set "attributes/call" to 30 in advanced settings to collapse it to 1 call — often a 10–20× cut on that step with minor accuracy trade-off.
GPU tier reference — which for which job
GPUVRAM$/hr (azure)Best forLatency
T4 16GB16 GB$0.53DistilBERT / ModernBERT inference + training; tiny (≤3B) LLMLow for BERT; not for big LLMs
A10 24GB24 GB$0.95BERT training; 7–8B open LLM (Llama-3.1-8B, Qwen-2.5-7B) quantizedGood for ≤8B; ~20–40 tok/s
A100 80GB80 GB$3.6713–34B full, or 70B 4-bit. JSON extraction + mid orchestrationStrong; ~40–80 tok/s on 8–34B
H100 80GB80 GB$6.9870B+ low-latency, high throughput. Heavy orchestration reasoningLowest latency / highest throughput
Rule of thumb: T4 for BERT classification, A10 for an 8B extraction model, A100 for a 30–70B reasoner, H100 only if orchestration latency or throughput becomes the bottleneck.
List prices verified mid-2026 (Gemini, Anthropic, Azure GPU). Excludes — per your scope — database, Azure VMs, Kubernetes, Blob, Redis, vector DB, monitoring, logging, and CI/CD. Treat outputs as planning estimates; real spend depends on negotiated rates, GPU utilization, caching, and accuracy-driven retries.