Typhoon Model Inference Benchmarks & Hardware Recommendations
This guide helps you understand the minimum required hardware as well as benchmarks and recommendations for the ‘inference’ of Typhoon models.
Last Update: September 2025
Minimum Requirements for Running Typhoon Models
Section titled “Minimum Requirements for Running Typhoon Models”For ASR Real-Time
Section titled “For ASR Real-Time”- This model only requires 8 GB RAM and CPU; more CPU = more concurrency support
For Typhoon OCR & Typhoon 2.1 Gemma 12B
Section titled “For Typhoon OCR & Typhoon 2.1 Gemma 12B”When deploying larger Typhoon models such as Typhoon OCR (7B) and Typhoon 2.1 Gemma (12B) in the cloud, the choice of GPU becomes critical because of their higher VRAM and compute requirements. Each cloud provider offers different GPU families, and availability may also vary by region.
-
AWS → Commonly provides L4 instances, suitable for high-throughput inference. For larger workloads or lower latency, A100 and H100 instances are also available in select regions.
-
GCP → Offers L4 GPUs as the most accessible option for inference, with A100 and H100 available for enterprise-scale workloads.
-
Azure → Typically provides A100 GPUs as the standard option for running models of this size, with H100 also available in specific regions for heavier workloads.
In practice, this means that:
-
If you’re on AWS or GCP, the L4 is the go-to choice for production inference.
-
If you’re on Azure, you’ll likely need to provision an A100 instance.
-
For enterprise-grade inference at scale, all providers support A100 or H100 instances, though these come at a higher cost.
Summary: Typhoon Inference Hardware Guide
Section titled “Summary: Typhoon Inference Hardware Guide”Model | Size | Local Dev (Laptop / Consumer GPU) | Recommended Hardware (Server/Enterprise) | Cloud GPU Equivalent | Notes |
---|---|---|---|---|---|
Typhoon ASR Real-Time | ~1B | ✅ Runs on CPU-only laptops with ≥8 GB RAM | Multi-core CPU servers (more cores = more concurrency) | N/A (GPU not required) | Lightweight, real-time speech recognition. Optimized for CPU. |
Typhoon OCR | 3B | ✅ Runs on Mac M1/M2 (16 GB RAM) or RTX 3060+ | 16 GB RAM CPU server or mid-tier GPU (≥16 GB VRAM) | Small GPU instances (e.g., AWS T4, L4 low config) | GPU accelerates throughput, but CPU is usable. |
Typhoon OCR | 7B | ⚠️ Needs high-VRAM GPU (RTX 4090, ≥24 GB VRAM) | A100 40GB, L4, enterprise-grade GPUs | AWS L4, GCP L4, Azure A100 | Large OCR workloads; not suitable for low-end laptops. |
Typhoon Text (Gemma 2.1) | 12B | ⚠️ Runs on RTX 3090/4090 (≥24 GB VRAM); or on laptops using the quantized version | A100 40GB, L4 | AWS L4, GCP L4, Azure A100 | Ideal for production inference with medium latency. |
Running Typhoon Models — Test Results on Popular GPUs
Section titled “Running Typhoon Models — Test Results on Popular GPUs”We benchmarked Typhoon models on four popular NVIDIA GPUs in cloud environments. These GPUs are not the only ones compatible with Typhoon. Other GPUs with similar specs should deliver comparable results.
- RTX 2000 Ada (16 GB VRAM)
- L4 (24 GB VRAM)
- A100 (80 GB VRAM)
- H100 (80 GB VRAM)
Metrics
-
**Throughput Metrics: **
- Requests / sec
- Tokens / sec
-
Latency Metrics:
- Avg Latency (sec)
- Avg TTFT (time to first token) (sec)
-
**Cost Metrics: **
- Cost/million tokens (dollars)
- Cost/request
-
Resource Metrics:
- Peak Memory (MB)
- Avg CPU (%)
The results below reflect our test setup and assumptions about model usage. Your actual performance may vary depending on workload and configuration.
RTX 2000 Ada (16 GB VRAM)
Section titled “RTX 2000 Ada (16 GB VRAM)”💵 Cost per hour (RunPod): $0.25
Summary:
Best for ASR/OCR on a budget and local/dev work. Ultra-cheap to run; OK throughput for OCR; LLM latency is high, so not ideal for large text models.
Typhoon 2.1 Gemma3 12B
Section titled “Typhoon 2.1 Gemma3 12B”- Max context length: 8,000
Assumption: prompt 512 tokens + response 512 tokens
Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
---|---|---|---|---|---|---|---|---|
1 | 0.03 | 13.8 | 30.8 | 2.2 | $0.0021 | $2.22 | 1047.0 | 7.3 |
8 | 0.12 | 56.4 | 62.0 | 21.8 | $0.0006 | $0.57 | 897.5 | 13.8 |
16 | 0.11 | 52.5 | 131.6 | 90.9 | $0.0006 | $0.61 | 897.3 | 13.0 |
Typhoon OCR 3B
Section titled “Typhoon OCR 3B”-
**Max context length: **16,000
-
**Assumption: **1 input image → ~512 tokens output
Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
---|---|---|---|---|---|---|---|---|
1 | 0.06 | 30.8 | 16.5 | 0.18 | $0.0012 | $2.23 | 858.4 | 8.7 |
17 | 0.86 | 382.9 | 17.3 | 0.44 | $0.0001 | $0.18 | 1248.3 | 16.2 |
32 | 1.34 | 678.9 | 21.7 | 0.84 | $0.00004 | $0.10 | 1656.3 | 23.4 |
Typhoon ASR Real-Time
Section titled “Typhoon ASR Real-Time”Concurrency | Throughput (audio sec / sec) | iRTF | Est. Cost / 1h audio |
---|---|---|---|
1 | 402.4 | 402.4 | $0.0006 |
64 | 981.1 | 981.1 | $0.0003 |
L4 (24 GB VRAM)
Section titled “L4 (24 GB VRAM)”💵 Cost per hour: $0.71 (GCP, used for cost calculation) | $0.42 (RunPod, test environment)
Summary:
A great production sweet spot. Strong value for LLM (12B) at 16–32 concurrency and very good for OCR. Cheapest ASR at scale among cloud GPUs tested.
Typhoon 2.1 Gemma3 12B
Section titled “Typhoon 2.1 Gemma3 12B”-
Max context length: 16,000
-
Assumption: prompt 512 tokens + response 512 tokens
Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
---|---|---|---|---|---|---|---|---|
1 | 0.03 | 16.3 | 28.5 | 0.63 | $0.0056 | $5.62 | 918.8 | 13.7 |
16 | 0.30 | 142.2 | 51.7 | 8.7 | $0.0007 | $0.65 | 900.4 | 12.6 |
32 | 0.35 | 160.0 | 86.0 | 17.1 | $0.0006 | $0.57 | 900.3 | 6.1 |
Typhoon OCR 7B
Section titled “Typhoon OCR 7B”-
Max context length: 16,000
-
Assumption: 1 input image → ~512 tokens output
Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
---|---|---|---|---|---|---|---|---|
1 | 0.04 | 16.4 | 27.5 | 0.81 | $0.0054 | $11.88 | 858.5 | 11.5 |
17 | 0.53 | 211.4 | 30.2 | 0.46 | $0.0004 | $0.92 | 1270.3 | 13.3 |
32 | 0.84 | 391.7 | 35.4 | 1.53 | $0.0002 | $0.50 | 1490.0 | 13.1 |
Typhoon ASR Real-Time
Section titled “Typhoon ASR Real-Time”Concurrency | Throughput (audio sec / sec) | iRTF | Est. Cost / 1h audio |
---|---|---|---|
1 | 312.8 | 312.8 | $0.0023 |
64 | 1096.0 | 1096.0 | $0.0006 |
A100 (80 GB VRAM)
Section titled “A100 (80 GB VRAM)”💵 Cost per hour: $3.67 (Azure, used for cost calculation) | $1.19 (RunPod, test environment)
Summary:
Enterprise workhorse. Scales well for both LLM and OCR, with solid latency and high throughput. Costs more per hour, so shines when you can keep it busy.
Typhoon 2.1 Gemma3 12B
Section titled “Typhoon 2.1 Gemma3 12B”-
Max context length: 50,000
-
Assumption: prompt 512 tokens + response 512 tokens
Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
---|---|---|---|---|---|---|---|---|
1 | 0.10 | 43.2 | 10.1 | 0.62 | $0.0103 | $10.61 | 902.8 | 10.7 |
16 (run 1) | 0.35 | 162.3 | 43.7 | 12.1 | $0.0029 | $2.89 | 903.0 | 10.1 |
16 (run 2) | 0.96 | 477.1 | 15.6 | 0.81 | $0.0011 | $1.03 | 902.4 | 9.2 |
32 | 1.46 | 725.6 | 20.4 | 0.44 | $0.0007 | $0.67 | 903.5 | 9.9 |
64 | 1.80 | 900.5 | 32.0 | 1.14 | $0.0006 | $0.55 | 904.6 | 13.1 |
Typhoon OCR 7B
Section titled “Typhoon OCR 7B”-
Max context length: 32,000
-
Assumption: 1 input image → ~512 tokens output
Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
---|---|---|---|---|---|---|---|---|
1 | 0.14 | 66.7 | 6.9 | 1.09 | $0.0071 | $15.08 | 722.9 | 12.0 |
16 | 1.98 | 917.9 | 7.4 | 0.49 | $0.0005 | $1.10 | 1080.3 | 5.7 |
32 | 3.82 | 1327.5 | 7.6 | 0.90 | $0.0003 | $0.75 | 1406.1 | 12.8 |
64 | 4.31 | 1848.0 | 12.3 | 3.14 | $0.0002 | $0.54 | 1926.9 | 12.4 |
Typhoon ASR Real-Time
Section titled “Typhoon ASR Real-Time”Concurrency | Throughput (audio sec / sec) | iRTF | Est. Cost / 1h audio |
---|---|---|---|
1 | 57.8 | 57.8 | $0.0635 |
64 | 117.4 | 117.4 | $0.0313 |
H100 (80 GB VRAM)
Section titled “H100 (80 GB VRAM)”💵 Cost per hour: $2.50 (Together.ai, used for cost calculation)
Summary:
Top performance per token. Best overall for LLM and OCR (fastest + lowest cost/1M tokens). ASR is still cheap but not as cost-efficient as L4 due to higher hourly price.
Typhoon 2.1 Gemma3 12B
Section titled “Typhoon 2.1 Gemma3 12B”-
Max context length: 50,000
-
Assumption: prompt 512 tokens + response 512 tokens
Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
---|---|---|---|---|---|---|---|---|
1 | 0.19 | 90.5 | 5.3 | 1.01 | $0.0037 | $3.61 | 1110.9 | 13.9 |
16 | 1.47 | 708.8 | 10.3 | 3.08 | $0.0005 | $0.46 | 1112.6 | 14.2 |
32 | 2.42 | 1131.7 | 12.5 | 4.62 | $0.0003 | $0.29 | 1112.9 | 14.2 |
64 | 2.84 | 1340.5 | 19.9 | 10.4 | $0.0002 | $0.24 | 1113.6 | 13.4 |
Typhoon OCR 7B
Section titled “Typhoon OCR 7B”-
Max context length: 32,000
-
Assumption: 1 input image → ~512 tokens output
Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
---|---|---|---|---|---|---|---|---|
1 | 0.23 | 109.7 | 4.3 | 1.06 | $0.0030 | $6.25 | 924.9 | 15.0 |
16 | 3.32 | 1571.4 | 4.5 | 0.45 | $0.0002 | $0.44 | 1403.2 | 15.0 |
32 | 5.92 | 2702.1 | 4.9 | 0.69 | $0.0001 | $0.25 | 1683.2 | 15.7 |
64 | 7.24 | 3370.1 | 7.4 | 2.74 | $0.0001 | $0.20 | 2016.4 | 16.6 |
128 | 6.81 | 3104.9 | 14.2 | 7.55 | $0.0001 | $0.22 | 2545.1 | 27.0 |
Typhoon ASR Real-Time
Section titled “Typhoon ASR Real-Time”Concurrency | Throughput (audio sec / sec) | iRTF | Est. Cost / 1h audio |
---|---|---|---|
1 | 416.5 | 416.5 | $0.0060 |
64 | 1416.0 | 1416.0 | $0.0018 |
GPU Comparison Overview (Best-Case Results)
Section titled “GPU Comparison Overview (Best-Case Results)”GPU (VRAM) | Hourly Cost | LLM (Gemma 12B) – Best Concurrency | Req/sec | Tokens/sec | Cost / 1M Tokens | OCR (7B/3B) – Best Concurrency | Req/sec | Tokens/sec | Cost / 1M Tokens | ASR – Best Concurrency | Throughput (audio sec/sec) | Est. $ / 1h Audio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
RTX 2000 Ada (16 GB) | $0.25 | 8 | 0.12 | 56.4 | $0.57 | 32 (OCR 3B) | 1.34 | 678.9 | $0.10 | 64 | 981.1 | $0.0003 |
L4 (24 GB) | $0.71 | 32 | 0.35 | 160.0 | $0.57 | 32 (OCR 7B) | 0.84 | 391.7 | $0.50 | 64 | 1096.0 | $0.0006 |
A100 (80 GB) | $3.67 | 32 | 1.46 | 725.6 | $0.67 | 64 (OCR 7B) | 4.31 | 1848.0 | $0.54 | 64 | 117.4 | $0.0313 |
H100 (80 GB) | $2.50 | 64 | 2.84 | 1340.5 | $0.24 | 64 (OCR 7B) | 7.24 | 3370.1 | $0.20 | 64 | 1416.0 | $0.0018 |
Quick Insights:
-
Best value for LLMs: H100 (fastest, lowest cost per token).
-
Best value for OCR: H100 (massive throughput), with A100 also strong at scale.
-
Best value for ASR: RTX 2000 Ada and L4 (super cheap per audio hour).
Other Setup Details
Section titled “Other Setup Details”For consistency, all benchmarks were run with the following setup:
-
Inference engine: vLLM, version v0.10.1.1
-
Benchmarking repo & scripts: scb-10x/all-in-one-pref-benchmark