Skip to content

Typhoon Model Inference Benchmarks & Hardware Recommendations

This guide helps you understand the minimum required hardware as well as benchmarks and recommendations for the ‘inference’ of Typhoon models.

Last Update: 14 November 2025

Minimum Requirements for Running Typhoon Models

Section titled “Minimum Requirements for Running Typhoon Models”
  • This model only requires 8 GB RAM and CPU; more CPU = more concurrency support

When deploying larger Typhoon models such as Typhoon OCR (7B) and Typhoon 2.1 Gemma (12B) in the cloud, the choice of GPU becomes critical because of their higher VRAM and compute requirements. Each cloud provider offers different GPU families, and availability may also vary by region.

  • AWS → Commonly provides L4 instances, suitable for high-throughput inference. For larger workloads or lower latency, A100 and H100 instances are also available in select regions.

  • GCP → Offers L4 GPUs as the most accessible option for inference, with A100 and H100 available for enterprise-scale workloads.

  • Azure → Typically provides A100 GPUs as the standard option for running models of this size, with H100 also available in specific regions for heavier workloads.

In practice, this means that:

  • If you’re on AWS or GCP, the L4 is the go-to choice for production inference.

  • If you’re on Azure, you’ll likely need to provision an A100 instance.

  • For enterprise-grade inference at scale, all providers support A100 or H100 instances, though these come at a higher cost.

Typhoon now offers three OCR model sizes optimized for different deployment environments. The newest Typhoon OCR 1.5 (2B) is significantly lighter and more efficient, making it the default recommendation for most users.

Below is the updated guidance:

Section titled “Typhoon OCR 1.5 (2B) — Recommended Default”

Runs on:

  • CPU-only servers (slow)

  • Mac M1/M2 (8–16 GB RAM)

  • Consumer GPUs (RTX 3060/4060 and up)

  • Cloud L4 (best price–performance)

VRAM required: 8–12 GB

Best for:

  • High-throughput workloads

  • Cost-sensitive deployments

  • Real-time OCR pipelines

  • On-premise deployments without large GPUs

Key benefit:

  • Up to 2–3× higher throughput than OCR 3B

  • Much lower running cost on L4, A100, H100

ModelParametersVRAM NeededHardware TierNotes
Typhoon OCR 1.5 (2B)2B8–12 GBCPU / Mac / L4 / Mid-range GPUsBest cost-performance. New default.
Typhoon OCR 3B3B12–16 GBMac 16GB / RTX 30xx+ / L4Mid-tier model.
Typhoon OCR 7B7B≥24 GBRTX 4090 / A100 / H100Solid accuracy, highest compute.
ModelSizeLocal Dev (Laptop / Consumer GPU)Recommended Hardware (Server/Enterprise)Cloud GPU EquivalentNotes
Typhoon ASR Real-Time~1B✅ Runs on CPU-only laptops with ≥8 GB RAMMulti-core CPU servers (more cores = more concurrency)N/A (GPU not required)Lightweight, real-time speech recognition. Optimized for CPU.
Typhoon OCR 1.5 (2B)2B✅ 8–12 GBCPU / Mac / Mid-range GPUsL4Best cost-performance. New OCR default.
Typhoon Text (Gemma 2.1)12B⚠️ Runs on RTX 3090/4090 (≥24 GB VRAM); or on laptops using the quantized versionA100 40GB, L4AWS L4, GCP L4, Azure A100Ideal for production inference with medium latency.
Typhoon Text (Typhoon 2.5)30B⚠️ Runs on high-RAM laptops (≥32 GB RAM) via Ollama quantized version (CPU-only inference)A100 80GB, H100 80GBAWS/GCP/Azure A100 or H100Large 30B model; production on A100/H100; best on H100.
Section titled “Running Typhoon Models — Test Results on Popular GPUs”

We benchmarked Typhoon models on four popular NVIDIA GPUs in cloud environments. These GPUs are not the only ones compatible with Typhoon. Other GPUs with similar specs should deliver comparable results.

  • RTX 2000 Ada (16 GB VRAM)
  • L4 (24 GB VRAM)
  • A100 (80 GB VRAM)
  • H100 (80 GB VRAM)

Metrics

  • **Throughput Metrics: **

    • Requests / sec
    • Tokens / sec
  • Latency Metrics:

    • Avg Latency (sec)
    • Avg TTFT (time to first token) (sec)
  • **Cost Metrics: **

    • Cost/million tokens (dollars)
    • Cost/request
  • Resource Metrics:

    • Peak Memory (MB)
    • Avg CPU (%)

The results below reflect our test setup and assumptions about model usage. Your actual performance may vary depending on workload and configuration.

💵 Cost per hour (RunPod): $0.25

Summary:

Best for ASR/OCR on a budget and local/dev work. Ultra-cheap to run; OK throughput for OCR; LLM latency is high, so not ideal for large text models.

  • Max context length: 8,000

Assumption: prompt 512 tokens + response 512 tokens

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.0414.528.10.5$0.0020$2.061047.07.3
40.1150.034.50.4$0.0006$0.63894.523.1
80.1256.163.718.9$0.0006$0.58897.513.8
  • Max context length: 16,000

  • Assumption: 1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M TokensPeak Mem (MB)Avg CPU %
10.2045.224.8980.164$0.00030$1.495300812.416.5
171.96436.228.1610.882$0.00000$0.1550001150.217.7
322.46548.1511.4901.871$0.00000$0.1234001122.317.4
  • **Max context length: **16,000

  • **Assumption: **1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.0630.816.50.18$0.0012$2.23858.48.7
170.86382.917.30.44$0.0001$0.181248.316.2
321.34678.921.70.84$0.00004$0.101656.323.4
ConcurrencyThroughput (audio sec / sec)iRTFEst. Cost / 1h audio
1402.4402.4$0.0006
64981.1981.1$0.0003

💵 Cost per hour: $0.71 (GCP, used for cost calculation) | $0.42 (RunPod, test environment)

Summary:

A great production sweet spot. Strong value for LLM (12B) at 16–32 concurrency and very good for OCR. Cheapest ASR at scale among cloud GPUs tested.

  • Max context length: 16,000

  • Assumption: prompt 512 tokens + response 512 tokens

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.0316.428.50.51$0.0057$5.62918.813.7
160.36168.341.20.51$0.0005$0.54900.412.6
320.47218.963.86.49$0.0004$0.41900.314.4
  • Max context length: 16,000

  • Assumption: 1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M TokensPeak Mem (MB)Avg CPU %
10.2249.164.4970.267$0.00050$2.200800797.810.4
172.17484.497.2340.976$0.00010$0.2233001194.68.7
322.96660.449.8982.185$0.00000$0.1638001251.17.8
  • Max context length: 16,000

  • Assumption: 1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.0416.427.50.81$0.0054$11.88858.511.5
170.53211.430.20.46$0.0004$0.921270.313.3
320.84391.735.41.53$0.0002$0.501490.013.1
ConcurrencyThroughput (audio sec / sec)iRTFEst. Cost / 1h audio
1312.8312.8$0.0023
641096.01096.0$0.0006

💵 Cost per hour: $3.67 (Azure, used for cost calculation) | $1.19 (RunPod, test environment)

Summary:

Enterprise workhorse. Scales well for both LLM and OCR, with solid latency and high throughput. Costs more per hour, so shines when you can keep it busy.

  • Max context length: 50,000

  • Assumption: prompt 512 tokens + response 512 tokens

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.1364.37.70.36$0.0079$7.62902.810.7
161.32625.811.30.31$0.0008$0.76902.49.2
321.89879.516.10.42$0.0005$0.53903.59.9
642.211033.427.80.77$0.0005$0.45904.613.1
  • Max context length: 16,000

  • Assumption: 1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M TokensPeak Mem (MB)Avg CPU %
10.69154.061.4040.173$0.00070$3.072600785.14.6
174.631032.013.2251.199$0.00010$0.4587001112.65.2
325.531232.655.0432.440$0.00010$0.3840001109.94.8
  • Max context length: 32,000

  • Assumption: 1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.1466.76.91.09$0.0071$15.08722.912.0
161.98917.97.40.49$0.0005$1.101080.35.7
323.821327.57.60.90$0.0003$0.751406.112.8
644.311848.012.33.14$0.0002$0.541926.912.4
ConcurrencyThroughput (audio sec / sec)iRTFEst. Cost / 1h audio
157.857.8$0.0635
64117.4117.4$0.0313

💵 Cost per hour: $2.50 (Together.ai, used for cost calculation)

Summary:

Top performance per token. Best overall for LLM and OCR (fastest + lowest cost/1M tokens). ASR is still cheap but not as cost-efficient as L4 due to higher hourly price.

  • Max context length: 50,000

  • Assumption: prompt 512 tokens + response 512 tokens

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.1675.956.280.05$0.0037$4.301110.913.9
161.721016.18.480.14$0.0004$0.391112.614.2
323.051428.39.700.20$0.0002$0.221112.914.2
644.602117.713.090.76$0.0002$0.151113.613.4
  • Max context length: 16,000

  • Assumption: 1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M TokensPeak Mem (MB)Avg CPU %
10.92206.261.0260.225$0.00080$3.540900797.86.4
177.681713.361.9700.808$0.00010$0.4263001084.05.7
329.422099.992.9051.528$0.00010$0.3478001227.96.4
  • Max context length: 32,000

  • Assumption: 1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.23109.74.31.06$0.0030$6.25924.915.0
163.321571.44.50.45$0.0002$0.441403.215.0
325.922702.14.90.69$0.0001$0.251683.215.7
647.243370.17.42.74$0.0001$0.202016.416.6
1286.813104.914.27.55$0.0001$0.222545.127.0
  • Max context length: 32,000

  • Assumption: prompt 534 tokens + response ~435 tokens

BF16 Precision:

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.31149.43.20.12$0.0022$2.19919.330.9
162.401044.86.20.30$0.0003$0.30921.221.8
323.961718.37.40.23$0.0002$0.18921.421.5
645.922616.110.00.51$0.0001$0.12923.519.7
1288.123574.414.61.44$0.0001$0.09930.841.8
2567.913442.429.013.0$0.0001$0.09932.947.1

FP8 Precision (Higher Throughput):

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
324.411966.46.70.25$0.0002$0.16904.662.5
646.972959.68.50.42$0.0001$0.10906.265.9
ConcurrencyThroughput (audio sec / sec)iRTFEst. Cost / 1h audio
1416.5416.5$0.0060
641416.01416.0$0.0018

GPU Comparison Overview (Best-Case Results)

Section titled “GPU Comparison Overview (Best-Case Results)”
GPU (VRAM)Hourly CostLLM (Gemma 12B) – Best ConcurrencyReq/secTokens/secCost / 1M TokensOCR 1.5 (2B) – Best ConcurrencyReq/secTokens/secCost / 1M TokensASR – Best ConcurrencyThroughput (audio sec/sec)Est. $ / 1h Audio
RTX 2000 Ada (16 GB)$0.2580.1256.4$0.57322.46548.15$0.123464981.1$0.0003
L4 (24 GB)$0.71320.35160.0$0.57322.96660.44$0.1638641096.0$0.0006
A100 (80 GB)$3.67321.46725.6$0.67325.531232.65$0.384064117.4$0.0313
H100 (80 GB)$2.50642.841340.5$0.24329.422099.99$0.3478641416.0$0.0018

Quick Insights:

  • Best value for LLMs: H100 remains the top choice — fastest throughput and lowest cost per token for Typhoon 2.1 Gemma 12B.

  • Best value for OCR (Typhoon OCR 1.5, 2B):

    • L4 provides the best price-performance balance, delivering strong throughput at a low GPU cost.
    • RTX 2000 Ada is surprisingly competitive and extremely cheap per million tokens—great for smaller workloads.
    • H100 and A100 achieve the highest raw throughput, ideal for large enterprise pipelines.
  • Best value for ASR: RTX 2000 Ada and L4 deliver the lowest cost per audio hour by a large margin, making them ideal for real-time or batch transcription.

For consistency, all benchmarks were run with the following setup: