Typhoon Model Inference Benchmarks & Hardware Recommendations

This guide helps you understand the minimum required hardware as well as benchmarks and recommendations for the ‘inference’ of Typhoon models.

Last Update: 14 November 2025

Minimum Requirements for Running Typhoon Models

For ASR Real-Time

This model only requires 8 GB RAM and CPU; more CPU = more concurrency support

Typhoon 2.1 Gemma 12B

When deploying larger Typhoon models such as Typhoon OCR (7B) and Typhoon 2.1 Gemma (12B) in the cloud, the choice of GPU becomes critical because of their higher VRAM and compute requirements. Each cloud provider offers different GPU families, and availability may also vary by region.

AWS → Commonly provides L4 instances, suitable for high-throughput inference. For larger workloads or lower latency, A100 and H100 instances are also available in select regions.
GCP → Offers L4 GPUs as the most accessible option for inference, with A100 and H100 available for enterprise-scale workloads.
Azure → Typically provides A100 GPUs as the standard option for running models of this size, with H100 also available in specific regions for heavier workloads.

In practice, this means that:

If you’re on AWS or GCP, the L4 is the go-to choice for production inference.
If you’re on Azure, you’ll likely need to provision an A100 instance.
For enterprise-grade inference at scale, all providers support A100 or H100 instances, though these come at a higher cost.

For Typhoon OCR (2B, 3B, 7B)

Typhoon now offers three OCR model sizes optimized for different deployment environments. The newest Typhoon OCR 1.5 (2B) is significantly lighter and more efficient, making it the default recommendation for most users.

Below is the updated guidance:

Typhoon OCR 1.5 (2B) — Recommended Default

Runs on:

CPU-only servers (slow)
Mac M1/M2 (8–16 GB RAM)
Consumer GPUs (RTX 3060/4060 and up)
Cloud L4 (best price–performance)

VRAM required: 8–12 GB

Best for:

High-throughput workloads
Cost-sensitive deployments
Real-time OCR pipelines
On-premise deployments without large GPUs

Key benefit:

Up to 2–3× higher throughput than OCR 3B
Much lower running cost on L4, A100, H100

Model	Parameters	VRAM Needed	Hardware Tier	Notes
Typhoon OCR 1.5 (2B)	2B	8–12 GB	CPU / Mac / L4 / Mid-range GPUs	Best cost-performance. New default.
Typhoon OCR 3B	3B	12–16 GB	Mac 16GB / RTX 30xx+ / L4	Mid-tier model.
Typhoon OCR 7B	7B	≥24 GB	RTX 4090 / A100 / H100	Solid accuracy, highest compute.

Summary: Typhoon Inference Hardware Guide

Model	Size	Local Dev (Laptop / Consumer GPU)	Recommended Hardware (Server/Enterprise)	Cloud GPU Equivalent	Notes
Typhoon ASR Real-Time	~1B	✅ Runs on CPU-only laptops with ≥8 GB RAM	Multi-core CPU servers (more cores = more concurrency)	N/A (GPU not required)	Lightweight, real-time speech recognition. Optimized for CPU.
Typhoon OCR 1.5 (2B)	2B	✅ 8–12 GB	CPU / Mac / Mid-range GPUs	L4	Best cost-performance. New OCR default.
Typhoon Text (Gemma 2.1)	12B	⚠️ Runs on RTX 3090/4090 (≥24 GB VRAM); or on laptops using the quantized version	A100 40GB, L4	AWS L4, GCP L4, Azure A100	Ideal for production inference with medium latency.
Typhoon Text (Typhoon 2.5)	30B	⚠️ Runs on high-RAM laptops (≥32 GB RAM) via Ollama quantized version (CPU-only inference)	A100 80GB, H100 80GB	AWS/GCP/Azure A100 or H100	Large 30B model; production on A100/H100; best on H100.

Running Typhoon Models — Test Results on Popular GPUs

We benchmarked Typhoon models on four popular NVIDIA GPUs in cloud environments. These GPUs are not the only ones compatible with Typhoon. Other GPUs with similar specs should deliver comparable results.

RTX 2000 Ada (16 GB VRAM)
L4 (24 GB VRAM)
A100 (80 GB VRAM)
H100 (80 GB VRAM)

Metrics

**Throughput Metrics: **
- Requests / sec
- Tokens / sec
Latency Metrics:
- Avg Latency (sec)
- Avg TTFT (time to first token) (sec)
**Cost Metrics: **
- Cost/million tokens (dollars)
- Cost/request
Resource Metrics:
- Peak Memory (MB)
- Avg CPU (%)

The results below reflect our test setup and assumptions about model usage. Your actual performance may vary depending on workload and configuration.

RTX 2000 Ada (16 GB VRAM)

💵 Cost per hour (RunPod): $0.25

Summary:

Best for ASR/OCR on a budget and local/dev work. Ultra-cheap to run; OK throughput for OCR; LLM latency is high, so not ideal for large text models.

Typhoon 2.1 Gemma3 12B

Max context length: 8,000

Assumption: prompt 512 tokens + response 512 tokens

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M tokens	Peak Mem (MB)	Avg CPU %
1	0.04	14.5	28.1	0.5	$0.0020	$2.06	1047.0	7.3
4	0.11	50.0	34.5	0.4	$0.0006	$0.63	894.5	23.1
8	0.12	56.1	63.7	18.9	$0.0006	$0.58	897.5	13.8

Typhoon OCR 1.5 (2B)

Max context length: 16,000
Assumption: 1 input image → ~512 tokens output

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M Tokens	Peak Mem (MB)	Avg CPU %
1	0.20	45.22	4.898	0.164	$0.00030	$1.495300	812.4	16.5
17	1.96	436.22	8.161	0.882	$0.00000	$0.155000	1150.2	17.7
32	2.46	548.15	11.490	1.871	$0.00000	$0.123400	1122.3	17.4

Typhoon OCR 3B

**Max context length: **16,000
**Assumption: **1 input image → ~512 tokens output

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M tokens	Peak Mem (MB)	Avg CPU %
1	0.06	30.8	16.5	0.18	$0.0012	$2.23	858.4	8.7
17	0.86	382.9	17.3	0.44	$0.0001	$0.18	1248.3	16.2
32	1.34	678.9	21.7	0.84	$0.00004	$0.10	1656.3	23.4

Typhoon ASR Real-Time

Concurrency	Throughput (audio sec / sec)	iRTF	Est. Cost / 1h audio
1	402.4	402.4	$0.0006
64	981.1	981.1	$0.0003

L4 (24 GB VRAM)

💵 Cost per hour: $0.71 (GCP, used for cost calculation) ｜ $0.42 (RunPod, test environment)

Summary:

A great production sweet spot. Strong value for LLM (12B) at 16–32 concurrency and very good for OCR. Cheapest ASR at scale among cloud GPUs tested.

Typhoon 2.1 Gemma3 12B

Max context length: 16,000
Assumption: prompt 512 tokens + response 512 tokens

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M tokens	Peak Mem (MB)	Avg CPU %
1	0.03	16.4	28.5	0.51	$0.0057	$5.62	918.8	13.7
16	0.36	168.3	41.2	0.51	$0.0005	$0.54	900.4	12.6
32	0.47	218.9	63.8	6.49	$0.0004	$0.41	900.3	14.4

Typhoon OCR 1.5 (2B)

Max context length: 16,000
Assumption: 1 input image → ~512 tokens output

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M Tokens	Peak Mem (MB)	Avg CPU %
1	0.22	49.16	4.497	0.267	$0.00050	$2.200800	797.8	10.4
17	2.17	484.49	7.234	0.976	$0.00010	$0.223300	1194.6	8.7
32	2.96	660.44	9.898	2.185	$0.00000	$0.163800	1251.1	7.8

Typhoon OCR 7B

Max context length: 16,000
Assumption: 1 input image → ~512 tokens output

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M tokens	Peak Mem (MB)	Avg CPU %
1	0.04	16.4	27.5	0.81	$0.0054	$11.88	858.5	11.5
17	0.53	211.4	30.2	0.46	$0.0004	$0.92	1270.3	13.3
32	0.84	391.7	35.4	1.53	$0.0002	$0.50	1490.0	13.1

Typhoon ASR Real-Time

Concurrency	Throughput (audio sec / sec)	iRTF	Est. Cost / 1h audio
1	312.8	312.8	$0.0023
64	1096.0	1096.0	$0.0006

A100 (80 GB VRAM)

💵 Cost per hour: $3.67 (Azure, used for cost calculation) ｜ $1.19 (RunPod, test environment)

Summary:

Enterprise workhorse. Scales well for both LLM and OCR, with solid latency and high throughput. Costs more per hour, so shines when you can keep it busy.

Typhoon 2.1 Gemma3 12B

Max context length: 50,000
Assumption: prompt 512 tokens + response 512 tokens

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M tokens	Peak Mem (MB)	Avg CPU %
1	0.13	64.3	7.7	0.36	$0.0079	$7.62	902.8	10.7
16	1.32	625.8	11.3	0.31	$0.0008	$0.76	902.4	9.2
32	1.89	879.5	16.1	0.42	$0.0005	$0.53	903.5	9.9
64	2.21	1033.4	27.8	0.77	$0.0005	$0.45	904.6	13.1

Typhoon OCR 1.5 (2B)

Max context length: 16,000
Assumption: 1 input image → ~512 tokens output

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M Tokens	Peak Mem (MB)	Avg CPU %
1	0.69	154.06	1.404	0.173	$0.00070	$3.072600	785.1	4.6
17	4.63	1032.01	3.225	1.199	$0.00010	$0.458700	1112.6	5.2
32	5.53	1232.65	5.043	2.440	$0.00010	$0.384000	1109.9	4.8

Typhoon OCR 7B

Max context length: 32,000
Assumption: 1 input image → ~512 tokens output

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M tokens	Peak Mem (MB)	Avg CPU %
1	0.14	66.7	6.9	1.09	$0.0071	$15.08	722.9	12.0
16	1.98	917.9	7.4	0.49	$0.0005	$1.10	1080.3	5.7
32	3.82	1327.5	7.6	0.90	$0.0003	$0.75	1406.1	12.8
64	4.31	1848.0	12.3	3.14	$0.0002	$0.54	1926.9	12.4

Typhoon ASR Real-Time

Concurrency	Throughput (audio sec / sec)	iRTF	Est. Cost / 1h audio
1	57.8	57.8	$0.0635
64	117.4	117.4	$0.0313

H100 (80 GB VRAM)

💵 Cost per hour: $2.50 (Together.ai, used for cost calculation)

Summary:

Top performance per token. Best overall for LLM and OCR (fastest + lowest cost/1M tokens). ASR is still cheap but not as cost-efficient as L4 due to higher hourly price.

Typhoon 2.1 Gemma3 12B

Max context length: 50,000
Assumption: prompt 512 tokens + response 512 tokens

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M tokens	Peak Mem (MB)	Avg CPU %
1	0.16	75.95	6.28	0.05	$0.0037	$4.30	1110.9	13.9
16	1.72	1016.1	8.48	0.14	$0.0004	$0.39	1112.6	14.2
32	3.05	1428.3	9.70	0.20	$0.0002	$0.22	1112.9	14.2
64	4.60	2117.7	13.09	0.76	$0.0002	$0.15	1113.6	13.4

Typhoon OCR 1.5 (2B)

Max context length: 16,000
Assumption: 1 input image → ~512 tokens output

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M Tokens	Peak Mem (MB)	Avg CPU %
1	0.92	206.26	1.026	0.225	$0.00080	$3.540900	797.8	6.4
17	7.68	1713.36	1.970	0.808	$0.00010	$0.426300	1084.0	5.7
32	9.42	2099.99	2.905	1.528	$0.00010	$0.347800	1227.9	6.4

Typhoon OCR 7B

Max context length: 32,000
Assumption: 1 input image → ~512 tokens output

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M tokens	Peak Mem (MB)	Avg CPU %
1	0.23	109.7	4.3	1.06	$0.0030	$6.25	924.9	15.0
16	3.32	1571.4	4.5	0.45	$0.0002	$0.44	1403.2	15.0
32	5.92	2702.1	4.9	0.69	$0.0001	$0.25	1683.2	15.7
64	7.24	3370.1	7.4	2.74	$0.0001	$0.20	2016.4	16.6
128	6.81	3104.9	14.2	7.55	$0.0001	$0.22	2545.1	27.0

Typhoon 2.5 30B A3B

Max context length: 32,000
Assumption: prompt 534 tokens + response ~435 tokens

BF16 Precision:

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M tokens	Peak Mem (MB)	Avg CPU %
1	0.31	149.4	3.2	0.12	$0.0022	$2.19	919.3	30.9
16	2.40	1044.8	6.2	0.30	$0.0003	$0.30	921.2	21.8
32	3.96	1718.3	7.4	0.23	$0.0002	$0.18	921.4	21.5
64	5.92	2616.1	10.0	0.51	$0.0001	$0.12	923.5	19.7
128	8.12	3574.4	14.6	1.44	$0.0001	$0.09	930.8	41.8
256	7.91	3442.4	29.0	13.0	$0.0001	$0.09	932.9	47.1

FP8 Precision (Higher Throughput):

Concurrency	Requests/sec	Tokens/sec	Avg Latency (s)	Avg TTFT (s)	Cost / Req	Cost / 1M tokens	Peak Mem (MB)	Avg CPU %
32	4.41	1966.4	6.7	0.25	$0.0002	$0.16	904.6	62.5
64	6.97	2959.6	8.5	0.42	$0.0001	$0.10	906.2	65.9

Typhoon ASR Real-Time

Concurrency	Throughput (audio sec / sec)	iRTF	Est. Cost / 1h audio
1	416.5	416.5	$0.0060
64	1416.0	1416.0	$0.0018

GPU Comparison Overview (Best-Case Results)

GPU (VRAM)	Hourly Cost	LLM (Gemma 12B) – Best Concurrency	Req/sec	Tokens/sec	Cost / 1M Tokens	OCR 1.5 (2B) – Best Concurrency	Req/sec	Tokens/sec	Cost / 1M Tokens	ASR – Best Concurrency	Throughput (audio sec/sec)	Est. $ / 1h Audio
RTX 2000 Ada (16 GB)	$0.25	8	0.12	56.4	$0.57	32	2.46	548.15	$0.1234	64	981.1	$0.0003
L4 (24 GB)	$0.71	32	0.35	160.0	$0.57	32	2.96	660.44	$0.1638	64	1096.0	$0.0006
A100 (80 GB)	$3.67	32	1.46	725.6	$0.67	32	5.53	1232.65	$0.3840	64	117.4	$0.0313
H100 (80 GB)	$2.50	64	2.84	1340.5	$0.24	32	9.42	2099.99	$0.3478	64	1416.0	$0.0018

Quick Insights:

Best value for LLMs: H100 remains the top choice — fastest throughput and lowest cost per token for Typhoon 2.1 Gemma 12B.
Best value for OCR (Typhoon OCR 1.5, 2B):
- L4 provides the best price-performance balance, delivering strong throughput at a low GPU cost.
- RTX 2000 Ada is surprisingly competitive and extremely cheap per million tokens—great for smaller workloads.
- H100 and A100 achieve the highest raw throughput, ideal for large enterprise pipelines.
Best value for ASR: RTX 2000 Ada and L4 deliver the lowest cost per audio hour by a large margin, making them ideal for real-time or batch transcription.

Other Setup Details

For consistency, all benchmarks were run with the following setup:

Inference engine: vLLM, version v0.10.1.1
Benchmarking repo & scripts: scb-10x/all-in-one-pref-benchmark