Typhoon Model Inference Benchmarks & Hardware Recommendations

This guide helps you understand the minimum required hardware as well as benchmarks and recommendations for the ‘inference’ of Typhoon models.
Last Update: 14 November 2025
Minimum Requirements for Running Typhoon Models
Section titled “Minimum Requirements for Running Typhoon Models”For ASR Real-Time
Section titled “For ASR Real-Time”- This model only requires 8 GB RAM and CPU; more CPU = more concurrency support
Typhoon 2.1 Gemma 12B
Section titled “Typhoon 2.1 Gemma 12B”When deploying larger Typhoon models such as Typhoon OCR (7B) and Typhoon 2.1 Gemma (12B) in the cloud, the choice of GPU becomes critical because of their higher VRAM and compute requirements. Each cloud provider offers different GPU families, and availability may also vary by region.
-
AWS → Commonly provides L4 instances, suitable for high-throughput inference. For larger workloads or lower latency, A100 and H100 instances are also available in select regions.
-
GCP → Offers L4 GPUs as the most accessible option for inference, with A100 and H100 available for enterprise-scale workloads.
-
Azure → Typically provides A100 GPUs as the standard option for running models of this size, with H100 also available in specific regions for heavier workloads.
In practice, this means that:
-
If you’re on AWS or GCP, the L4 is the go-to choice for production inference.
-
If you’re on Azure, you’ll likely need to provision an A100 instance.
-
For enterprise-grade inference at scale, all providers support A100 or H100 instances, though these come at a higher cost.
For Typhoon OCR (2B, 3B, 7B)
Section titled “For Typhoon OCR (2B, 3B, 7B)”Typhoon now offers three OCR model sizes optimized for different deployment environments. The newest Typhoon OCR 1.5 (2B) is significantly lighter and more efficient, making it the default recommendation for most users.
Below is the updated guidance:
Typhoon OCR 1.5 (2B) — Recommended Default
Section titled “Typhoon OCR 1.5 (2B) — Recommended Default”Runs on:
-
CPU-only servers (slow)
-
Mac M1/M2 (8–16 GB RAM)
-
Consumer GPUs (RTX 3060/4060 and up)
-
Cloud L4 (best price–performance)
VRAM required: 8–12 GB
Best for:
-
High-throughput workloads
-
Cost-sensitive deployments
-
Real-time OCR pipelines
-
On-premise deployments without large GPUs
Key benefit:
-
Up to 2–3× higher throughput than OCR 3B
-
Much lower running cost on L4, A100, H100
| Model | Parameters | VRAM Needed | Hardware Tier | Notes |
|---|---|---|---|---|
| Typhoon OCR 1.5 (2B) | 2B | 8–12 GB | CPU / Mac / L4 / Mid-range GPUs | Best cost-performance. New default. |
| Typhoon OCR 3B | 3B | 12–16 GB | Mac 16GB / RTX 30xx+ / L4 | Mid-tier model. |
| Typhoon OCR 7B | 7B | ≥24 GB | RTX 4090 / A100 / H100 | Solid accuracy, highest compute. |
Summary: Typhoon Inference Hardware Guide
Section titled “Summary: Typhoon Inference Hardware Guide”| Model | Size | Local Dev (Laptop / Consumer GPU) | Recommended Hardware (Server/Enterprise) | Cloud GPU Equivalent | Notes |
|---|---|---|---|---|---|
| Typhoon ASR Real-Time | ~1B | ✅ Runs on CPU-only laptops with ≥8 GB RAM | Multi-core CPU servers (more cores = more concurrency) | N/A (GPU not required) | Lightweight, real-time speech recognition. Optimized for CPU. |
| Typhoon OCR 1.5 (2B) | 2B | ✅ 8–12 GB | CPU / Mac / Mid-range GPUs | L4 | Best cost-performance. New OCR default. |
| Typhoon Text (Gemma 2.1) | 12B | ⚠️ Runs on RTX 3090/4090 (≥24 GB VRAM); or on laptops using the quantized version | A100 40GB, L4 | AWS L4, GCP L4, Azure A100 | Ideal for production inference with medium latency. |
| Typhoon Text (Typhoon 2.5) | 30B | ⚠️ Runs on high-RAM laptops (≥32 GB RAM) via Ollama quantized version (CPU-only inference) | A100 80GB, H100 80GB | AWS/GCP/Azure A100 or H100 | Large 30B model; production on A100/H100; best on H100. |
Running Typhoon Models — Test Results on Popular GPUs
Section titled “Running Typhoon Models — Test Results on Popular GPUs”We benchmarked Typhoon models on four popular NVIDIA GPUs in cloud environments. These GPUs are not the only ones compatible with Typhoon. Other GPUs with similar specs should deliver comparable results.
- RTX 2000 Ada (16 GB VRAM)
- L4 (24 GB VRAM)
- A100 (80 GB VRAM)
- H100 (80 GB VRAM)
Metrics
-
**Throughput Metrics: **
- Requests / sec
- Tokens / sec
-
Latency Metrics:
- Avg Latency (sec)
- Avg TTFT (time to first token) (sec)
-
**Cost Metrics: **
- Cost/million tokens (dollars)
- Cost/request
-
Resource Metrics:
- Peak Memory (MB)
- Avg CPU (%)
The results below reflect our test setup and assumptions about model usage. Your actual performance may vary depending on workload and configuration.
RTX 2000 Ada (16 GB VRAM)
Section titled “RTX 2000 Ada (16 GB VRAM)”💵 Cost per hour (RunPod): $0.25
Summary:
Best for ASR/OCR on a budget and local/dev work. Ultra-cheap to run; OK throughput for OCR; LLM latency is high, so not ideal for large text models.
Typhoon 2.1 Gemma3 12B
Section titled “Typhoon 2.1 Gemma3 12B”- Max context length: 8,000
Assumption: prompt 512 tokens + response 512 tokens
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.04 | 14.5 | 28.1 | 0.5 | $0.0020 | $2.06 | 1047.0 | 7.3 |
| 4 | 0.11 | 50.0 | 34.5 | 0.4 | $0.0006 | $0.63 | 894.5 | 23.1 |
| 8 | 0.12 | 56.1 | 63.7 | 18.9 | $0.0006 | $0.58 | 897.5 | 13.8 |
Typhoon OCR 1.5 (2B)
Section titled “Typhoon OCR 1.5 (2B)”-
Max context length: 16,000
-
Assumption: 1 input image → ~512 tokens output
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M Tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.20 | 45.22 | 4.898 | 0.164 | $0.00030 | $1.495300 | 812.4 | 16.5 |
| 17 | 1.96 | 436.22 | 8.161 | 0.882 | $0.00000 | $0.155000 | 1150.2 | 17.7 |
| 32 | 2.46 | 548.15 | 11.490 | 1.871 | $0.00000 | $0.123400 | 1122.3 | 17.4 |
Typhoon OCR 3B
Section titled “Typhoon OCR 3B”-
**Max context length: **16,000
-
**Assumption: **1 input image → ~512 tokens output
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.06 | 30.8 | 16.5 | 0.18 | $0.0012 | $2.23 | 858.4 | 8.7 |
| 17 | 0.86 | 382.9 | 17.3 | 0.44 | $0.0001 | $0.18 | 1248.3 | 16.2 |
| 32 | 1.34 | 678.9 | 21.7 | 0.84 | $0.00004 | $0.10 | 1656.3 | 23.4 |
Typhoon ASR Real-Time
Section titled “Typhoon ASR Real-Time”| Concurrency | Throughput (audio sec / sec) | iRTF | Est. Cost / 1h audio |
|---|---|---|---|
| 1 | 402.4 | 402.4 | $0.0006 |
| 64 | 981.1 | 981.1 | $0.0003 |
L4 (24 GB VRAM)
Section titled “L4 (24 GB VRAM)”💵 Cost per hour: $0.71 (GCP, used for cost calculation) | $0.42 (RunPod, test environment)
Summary:
A great production sweet spot. Strong value for LLM (12B) at 16–32 concurrency and very good for OCR. Cheapest ASR at scale among cloud GPUs tested.
Typhoon 2.1 Gemma3 12B
Section titled “Typhoon 2.1 Gemma3 12B”-
Max context length: 16,000
-
Assumption: prompt 512 tokens + response 512 tokens
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.03 | 16.4 | 28.5 | 0.51 | $0.0057 | $5.62 | 918.8 | 13.7 |
| 16 | 0.36 | 168.3 | 41.2 | 0.51 | $0.0005 | $0.54 | 900.4 | 12.6 |
| 32 | 0.47 | 218.9 | 63.8 | 6.49 | $0.0004 | $0.41 | 900.3 | 14.4 |
Typhoon OCR 1.5 (2B)
Section titled “Typhoon OCR 1.5 (2B)”-
Max context length: 16,000
-
Assumption: 1 input image → ~512 tokens output
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M Tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.22 | 49.16 | 4.497 | 0.267 | $0.00050 | $2.200800 | 797.8 | 10.4 |
| 17 | 2.17 | 484.49 | 7.234 | 0.976 | $0.00010 | $0.223300 | 1194.6 | 8.7 |
| 32 | 2.96 | 660.44 | 9.898 | 2.185 | $0.00000 | $0.163800 | 1251.1 | 7.8 |
Typhoon OCR 7B
Section titled “Typhoon OCR 7B”-
Max context length: 16,000
-
Assumption: 1 input image → ~512 tokens output
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.04 | 16.4 | 27.5 | 0.81 | $0.0054 | $11.88 | 858.5 | 11.5 |
| 17 | 0.53 | 211.4 | 30.2 | 0.46 | $0.0004 | $0.92 | 1270.3 | 13.3 |
| 32 | 0.84 | 391.7 | 35.4 | 1.53 | $0.0002 | $0.50 | 1490.0 | 13.1 |
Typhoon ASR Real-Time
Section titled “Typhoon ASR Real-Time”| Concurrency | Throughput (audio sec / sec) | iRTF | Est. Cost / 1h audio |
|---|---|---|---|
| 1 | 312.8 | 312.8 | $0.0023 |
| 64 | 1096.0 | 1096.0 | $0.0006 |
A100 (80 GB VRAM)
Section titled “A100 (80 GB VRAM)”💵 Cost per hour: $3.67 (Azure, used for cost calculation) | $1.19 (RunPod, test environment)
Summary:
Enterprise workhorse. Scales well for both LLM and OCR, with solid latency and high throughput. Costs more per hour, so shines when you can keep it busy.
Typhoon 2.1 Gemma3 12B
Section titled “Typhoon 2.1 Gemma3 12B”-
Max context length: 50,000
-
Assumption: prompt 512 tokens + response 512 tokens
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.13 | 64.3 | 7.7 | 0.36 | $0.0079 | $7.62 | 902.8 | 10.7 |
| 16 | 1.32 | 625.8 | 11.3 | 0.31 | $0.0008 | $0.76 | 902.4 | 9.2 |
| 32 | 1.89 | 879.5 | 16.1 | 0.42 | $0.0005 | $0.53 | 903.5 | 9.9 |
| 64 | 2.21 | 1033.4 | 27.8 | 0.77 | $0.0005 | $0.45 | 904.6 | 13.1 |
Typhoon OCR 1.5 (2B)
Section titled “Typhoon OCR 1.5 (2B)”-
Max context length: 16,000
-
Assumption: 1 input image → ~512 tokens output
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M Tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.69 | 154.06 | 1.404 | 0.173 | $0.00070 | $3.072600 | 785.1 | 4.6 |
| 17 | 4.63 | 1032.01 | 3.225 | 1.199 | $0.00010 | $0.458700 | 1112.6 | 5.2 |
| 32 | 5.53 | 1232.65 | 5.043 | 2.440 | $0.00010 | $0.384000 | 1109.9 | 4.8 |
Typhoon OCR 7B
Section titled “Typhoon OCR 7B”-
Max context length: 32,000
-
Assumption: 1 input image → ~512 tokens output
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.14 | 66.7 | 6.9 | 1.09 | $0.0071 | $15.08 | 722.9 | 12.0 |
| 16 | 1.98 | 917.9 | 7.4 | 0.49 | $0.0005 | $1.10 | 1080.3 | 5.7 |
| 32 | 3.82 | 1327.5 | 7.6 | 0.90 | $0.0003 | $0.75 | 1406.1 | 12.8 |
| 64 | 4.31 | 1848.0 | 12.3 | 3.14 | $0.0002 | $0.54 | 1926.9 | 12.4 |
Typhoon ASR Real-Time
Section titled “Typhoon ASR Real-Time”| Concurrency | Throughput (audio sec / sec) | iRTF | Est. Cost / 1h audio |
|---|---|---|---|
| 1 | 57.8 | 57.8 | $0.0635 |
| 64 | 117.4 | 117.4 | $0.0313 |
H100 (80 GB VRAM)
Section titled “H100 (80 GB VRAM)”💵 Cost per hour: $2.50 (Together.ai, used for cost calculation)
Summary:
Top performance per token. Best overall for LLM and OCR (fastest + lowest cost/1M tokens). ASR is still cheap but not as cost-efficient as L4 due to higher hourly price.
Typhoon 2.1 Gemma3 12B
Section titled “Typhoon 2.1 Gemma3 12B”-
Max context length: 50,000
-
Assumption: prompt 512 tokens + response 512 tokens
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.16 | 75.95 | 6.28 | 0.05 | $0.0037 | $4.30 | 1110.9 | 13.9 |
| 16 | 1.72 | 1016.1 | 8.48 | 0.14 | $0.0004 | $0.39 | 1112.6 | 14.2 |
| 32 | 3.05 | 1428.3 | 9.70 | 0.20 | $0.0002 | $0.22 | 1112.9 | 14.2 |
| 64 | 4.60 | 2117.7 | 13.09 | 0.76 | $0.0002 | $0.15 | 1113.6 | 13.4 |
Typhoon OCR 1.5 (2B)
Section titled “Typhoon OCR 1.5 (2B)”-
Max context length: 16,000
-
Assumption: 1 input image → ~512 tokens output
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M Tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.92 | 206.26 | 1.026 | 0.225 | $0.00080 | $3.540900 | 797.8 | 6.4 |
| 17 | 7.68 | 1713.36 | 1.970 | 0.808 | $0.00010 | $0.426300 | 1084.0 | 5.7 |
| 32 | 9.42 | 2099.99 | 2.905 | 1.528 | $0.00010 | $0.347800 | 1227.9 | 6.4 |
Typhoon OCR 7B
Section titled “Typhoon OCR 7B”-
Max context length: 32,000
-
Assumption: 1 input image → ~512 tokens output
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.23 | 109.7 | 4.3 | 1.06 | $0.0030 | $6.25 | 924.9 | 15.0 |
| 16 | 3.32 | 1571.4 | 4.5 | 0.45 | $0.0002 | $0.44 | 1403.2 | 15.0 |
| 32 | 5.92 | 2702.1 | 4.9 | 0.69 | $0.0001 | $0.25 | 1683.2 | 15.7 |
| 64 | 7.24 | 3370.1 | 7.4 | 2.74 | $0.0001 | $0.20 | 2016.4 | 16.6 |
| 128 | 6.81 | 3104.9 | 14.2 | 7.55 | $0.0001 | $0.22 | 2545.1 | 27.0 |
Typhoon 2.5 30B A3B
Section titled “Typhoon 2.5 30B A3B”-
Max context length: 32,000
-
Assumption: prompt 534 tokens + response ~435 tokens
BF16 Precision:
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.31 | 149.4 | 3.2 | 0.12 | $0.0022 | $2.19 | 919.3 | 30.9 |
| 16 | 2.40 | 1044.8 | 6.2 | 0.30 | $0.0003 | $0.30 | 921.2 | 21.8 |
| 32 | 3.96 | 1718.3 | 7.4 | 0.23 | $0.0002 | $0.18 | 921.4 | 21.5 |
| 64 | 5.92 | 2616.1 | 10.0 | 0.51 | $0.0001 | $0.12 | 923.5 | 19.7 |
| 128 | 8.12 | 3574.4 | 14.6 | 1.44 | $0.0001 | $0.09 | 930.8 | 41.8 |
| 256 | 7.91 | 3442.4 | 29.0 | 13.0 | $0.0001 | $0.09 | 932.9 | 47.1 |
FP8 Precision (Higher Throughput):
| Concurrency | Requests/sec | Tokens/sec | Avg Latency (s) | Avg TTFT (s) | Cost / Req | Cost / 1M tokens | Peak Mem (MB) | Avg CPU % |
|---|---|---|---|---|---|---|---|---|
| 32 | 4.41 | 1966.4 | 6.7 | 0.25 | $0.0002 | $0.16 | 904.6 | 62.5 |
| 64 | 6.97 | 2959.6 | 8.5 | 0.42 | $0.0001 | $0.10 | 906.2 | 65.9 |
Typhoon ASR Real-Time
Section titled “Typhoon ASR Real-Time”| Concurrency | Throughput (audio sec / sec) | iRTF | Est. Cost / 1h audio |
|---|---|---|---|
| 1 | 416.5 | 416.5 | $0.0060 |
| 64 | 1416.0 | 1416.0 | $0.0018 |
GPU Comparison Overview (Best-Case Results)
Section titled “GPU Comparison Overview (Best-Case Results)”| GPU (VRAM) | Hourly Cost | LLM (Gemma 12B) – Best Concurrency | Req/sec | Tokens/sec | Cost / 1M Tokens | OCR 1.5 (2B) – Best Concurrency | Req/sec | Tokens/sec | Cost / 1M Tokens | ASR – Best Concurrency | Throughput (audio sec/sec) | Est. $ / 1h Audio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RTX 2000 Ada (16 GB) | $0.25 | 8 | 0.12 | 56.4 | $0.57 | 32 | 2.46 | 548.15 | $0.1234 | 64 | 981.1 | $0.0003 |
| L4 (24 GB) | $0.71 | 32 | 0.35 | 160.0 | $0.57 | 32 | 2.96 | 660.44 | $0.1638 | 64 | 1096.0 | $0.0006 |
| A100 (80 GB) | $3.67 | 32 | 1.46 | 725.6 | $0.67 | 32 | 5.53 | 1232.65 | $0.3840 | 64 | 117.4 | $0.0313 |
| H100 (80 GB) | $2.50 | 64 | 2.84 | 1340.5 | $0.24 | 32 | 9.42 | 2099.99 | $0.3478 | 64 | 1416.0 | $0.0018 |
Quick Insights:
-
Best value for LLMs: H100 remains the top choice — fastest throughput and lowest cost per token for Typhoon 2.1 Gemma 12B.
-
Best value for OCR (Typhoon OCR 1.5, 2B):
- L4 provides the best price-performance balance, delivering strong throughput at a low GPU cost.
- RTX 2000 Ada is surprisingly competitive and extremely cheap per million tokens—great for smaller workloads.
- H100 and A100 achieve the highest raw throughput, ideal for large enterprise pipelines.
-
Best value for ASR: RTX 2000 Ada and L4 deliver the lowest cost per audio hour by a large margin, making them ideal for real-time or batch transcription.
Other Setup Details
Section titled “Other Setup Details”For consistency, all benchmarks were run with the following setup:
-
Inference engine: vLLM, version v0.10.1.1
-
Benchmarking repo & scripts: scb-10x/all-in-one-pref-benchmark