Tech budget model · 14-month build

Home-loan vendor matching — compute & tooling estimate

2 months closed-source for speed, then 12 months self-hosted optimization. Every assumption below is live.

Grand total (14 mo)

$14,742

Compute + tooling only

$14,742

excl. salaries

Cost per file

$2.95

5,000 files

Cost per page

$0.0295

100 pg/file

01Workload

Total files (customers)

Avg pages per file

Headline scenarios computed for 50 / 100 / 200 / 500 below.

Share of files in launch phase: 14%

≈ 715 files in months 1–2, rest optimized.

02OCR — Gemini (always-on)

OCR model

Fixed to Gemini for quality. Flash is ~4× cheaper; Pro for hard scans.

03Per-step model — Phase 1 (closed)

Classification model

JSON extraction model

Orchestration model

Needs the strongest reasoning of the four steps.

04Phase 2 — hosting per step

Classification

JSON extraction

Orchestration

Open-model API (JSON + orchestration)

Classification via Open API uses a cheap ~8B tier automatically.

BERT compute (classification)

GPU hours / month: 730

730 = always-on. Lower it if you batch jobs instead of 24/7 serving.

05Pricing & team

GPU pricing

Junior devs (×3, 12 mo) plan

Senior devs (×2, 14 mo) plan

Where the money goes

Phase 1 API (closed, 2 mo)

$321

Phase 2 API (OCR + closed steps)

$1,326

Phase 2 GPU hosting (12 mo)

$4,643

One-time (BERT data + training)

$2,053

Claude dev subscriptions

$6,400

Grand total$14,742

Reality check — does self-hosting pay for itself here?

This plan: self-host in Phase 2

$8,342

compute incl. GPU, excl. dev subs

Hypothetical: 100% closed APIs, 14 mo

$2,242

no GPU, no BERT investment

At 5,000 files, the closed-API steps you replace in Phase 2 would have cost only $596 over 12 months, while the always-on GPUs cost $4,643. Self-hosting is not cost-driven at this volume. The defensible reasons to do it anyway: data residency(Aadhaar / PAN / CIBIL under RBI & DPDP rules are hard to send to third-party LLM APIs), scaling well beyond 5,000 files, and building proprietary IP. Frame the GPU line to investors as a compliance + capability investment, not a savings play.

Phase 2 strategy — same workload, three ways to host

Closed APIs (Gemini/Claude)

no infra, top reasoning, vendor-locked

$1,922

Open models via API

open weights, pay-per-token, portable

$1,352cheapest

Self-hosted GPUs

in-house, idle cost, full control

$39,837

12-month Phase-2 cost only (API + GPU + one-time), same documents and token assumptions. OCR and Phase-1 are identical across all three, so this isolates the hosting decision. Open-model APIs give you most of the cost win of self-hosting without renting idle GPUs — and let you later lift the same weights in-house when compliance or scale demands it.

Scenario matrix — total by avg pages/file

	50 pg	100 pg	200 pg	500 pg
Phase 1 API (2 mo)	$225	$321	$511	$1,083
Phase 2 API (12 mo)	$830	$1,326	$2,318	$5,293
Phase 2 GPU (12 mo)	$4,643	$4,643	$4,643	$4,643
One-time BERT	$2,053	$2,053	$2,053	$2,053
Dev subscriptions	$6,400	$6,400	$6,400	$6,400
Grand total	$14,151	$14,742	$15,925	$19,472

Note: GPU, BERT one-time, and dev subscriptions are fixed — they don't move with pages/file. Only the API lines scale with document volume.

Per-step unit economics (per file)

Step	Phase 1 (closed)	Phase 2	Mode
OCR (Gemini)	$0.2315	$0.2315	closed (fixed)
Classification	$0.0350	GPU (no token $)	selfhost
JSON extract (30 calls)	$0.0282	$0.0364	openapi
Orchestration	$0.1538	$0.0414	openapi
Total / file	$0.4484	$0.3093 API

Biggest API lever: JSON extraction runs 30 calls/file because of the per-attribute design. Set "attributes/call" to 30 in advanced settings to collapse it to 1 call — often a 10–20× cut on that step with minor accuracy trade-off.

GPU tier reference — which for which job

GPU	VRAM	$/hr (azure)	Best for	Latency
T4 16GB	16 GB	$0.53	DistilBERT / ModernBERT inference + training; tiny (≤3B) LLM	Low for BERT; not for big LLMs
A10 24GB	24 GB	$0.95	BERT training; 7–8B open LLM (Llama-3.1-8B, Qwen-2.5-7B) quantized	Good for ≤8B; ~20–40 tok/s
A100 80GB	80 GB	$3.67	13–34B full, or 70B 4-bit. JSON extraction + mid orchestration	Strong; ~40–80 tok/s on 8–34B
H100 80GB	80 GB	$6.98	70B+ low-latency, high throughput. Heavy orchestration reasoning	Lowest latency / highest throughput

Rule of thumb: T4 for BERT classification, A10 for an 8B extraction model, A100 for a 30–70B reasoner, H100 only if orchestration latency or throughput becomes the bottleneck.

List prices verified mid-2026 (Gemini, Anthropic, Azure GPU). Excludes — per your scope — database, Azure VMs, Kubernetes, Blob, Redis, vector DB, monitoring, logging, and CI/CD. Treat outputs as planning estimates; real spend depends on negotiated rates, GPU utilization, caching, and accuracy-driven retries.