Skip to content

Typhoon Model Inference Benchmarks & Hardware Recommendations

This guide helps you understand the minimum required hardware as well as benchmarks and recommendations for the ‘inference’ of Typhoon models.

Last Update: September 2025

Minimum Requirements for Running Typhoon Models

Section titled “Minimum Requirements for Running Typhoon Models”
  • This model only requires 8 GB RAM and CPU; more CPU = more concurrency support

When deploying larger Typhoon models such as Typhoon OCR (7B) and Typhoon 2.1 Gemma (12B) in the cloud, the choice of GPU becomes critical because of their higher VRAM and compute requirements. Each cloud provider offers different GPU families, and availability may also vary by region.

  • AWS → Commonly provides L4 instances, suitable for high-throughput inference. For larger workloads or lower latency, A100 and H100 instances are also available in select regions.

  • GCP → Offers L4 GPUs as the most accessible option for inference, with A100 and H100 available for enterprise-scale workloads.

  • Azure → Typically provides A100 GPUs as the standard option for running models of this size, with H100 also available in specific regions for heavier workloads.

In practice, this means that:

  • If you’re on AWS or GCP, the L4 is the go-to choice for production inference.

  • If you’re on Azure, you’ll likely need to provision an A100 instance.

  • For enterprise-grade inference at scale, all providers support A100 or H100 instances, though these come at a higher cost.

ModelSizeLocal Dev (Laptop / Consumer GPU)Recommended Hardware (Server/Enterprise)Cloud GPU EquivalentNotes
Typhoon ASR Real-Time~1B✅ Runs on CPU-only laptops with ≥8 GB RAMMulti-core CPU servers (more cores = more concurrency)N/A (GPU not required)Lightweight, real-time speech recognition. Optimized for CPU.
Typhoon OCR3B✅ Runs on Mac M1/M2 (16 GB RAM) or RTX 3060+16 GB RAM CPU server or mid-tier GPU (≥16 GB VRAM)Small GPU instances (e.g., AWS T4, L4 low config)GPU accelerates throughput, but CPU is usable.
Typhoon OCR7B⚠️ Needs high-VRAM GPU (RTX 4090, ≥24 GB VRAM)A100 40GB, L4, enterprise-grade GPUsAWS L4, GCP L4, Azure A100Large OCR workloads; not suitable for low-end laptops.
Typhoon Text (Gemma 2.1)12B⚠️ Runs on RTX 3090/4090 (≥24 GB VRAM); or on laptops using the quantized versionA100 40GB, L4AWS L4, GCP L4, Azure A100Ideal for production inference with medium latency.
Section titled “Running Typhoon Models — Test Results on Popular GPUs”

We benchmarked Typhoon models on four popular NVIDIA GPUs in cloud environments. These GPUs are not the only ones compatible with Typhoon. Other GPUs with similar specs should deliver comparable results.

  • RTX 2000 Ada (16 GB VRAM)
  • L4 (24 GB VRAM)
  • A100 (80 GB VRAM)
  • H100 (80 GB VRAM)

Metrics

  • **Throughput Metrics: **

    • Requests / sec
    • Tokens / sec
  • Latency Metrics:

    • Avg Latency (sec)
    • Avg TTFT (time to first token) (sec)
  • **Cost Metrics: **

    • Cost/million tokens (dollars)
    • Cost/request
  • Resource Metrics:

    • Peak Memory (MB)
    • Avg CPU (%)

The results below reflect our test setup and assumptions about model usage. Your actual performance may vary depending on workload and configuration.

💵 Cost per hour (RunPod): $0.25

Summary:

Best for ASR/OCR on a budget and local/dev work. Ultra-cheap to run; OK throughput for OCR; LLM latency is high, so not ideal for large text models.

  • Max context length: 8,000

Assumption: prompt 512 tokens + response 512 tokens

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.0313.830.82.2$0.0021$2.221047.07.3
80.1256.462.021.8$0.0006$0.57897.513.8
160.1152.5131.690.9$0.0006$0.61897.313.0
  • **Max context length: **16,000

  • **Assumption: **1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.0630.816.50.18$0.0012$2.23858.48.7
170.86382.917.30.44$0.0001$0.181248.316.2
321.34678.921.70.84$0.00004$0.101656.323.4
ConcurrencyThroughput (audio sec / sec)iRTFEst. Cost / 1h audio
1402.4402.4$0.0006
64981.1981.1$0.0003

💵 Cost per hour: $0.71 (GCP, used for cost calculation) | $0.42 (RunPod, test environment)

Summary:

A great production sweet spot. Strong value for LLM (12B) at 16–32 concurrency and very good for OCR. Cheapest ASR at scale among cloud GPUs tested.

  • Max context length: 16,000

  • Assumption: prompt 512 tokens + response 512 tokens

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.0316.328.50.63$0.0056$5.62918.813.7
160.30142.251.78.7$0.0007$0.65900.412.6
320.35160.086.017.1$0.0006$0.57900.36.1
  • Max context length: 16,000

  • Assumption: 1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.0416.427.50.81$0.0054$11.88858.511.5
170.53211.430.20.46$0.0004$0.921270.313.3
320.84391.735.41.53$0.0002$0.501490.013.1
ConcurrencyThroughput (audio sec / sec)iRTFEst. Cost / 1h audio
1312.8312.8$0.0023
641096.01096.0$0.0006

💵 Cost per hour: $3.67 (Azure, used for cost calculation) | $1.19 (RunPod, test environment)

Summary:

Enterprise workhorse. Scales well for both LLM and OCR, with solid latency and high throughput. Costs more per hour, so shines when you can keep it busy.

  • Max context length: 50,000

  • Assumption: prompt 512 tokens + response 512 tokens

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.1043.210.10.62$0.0103$10.61902.810.7
16 (run 1)0.35162.343.712.1$0.0029$2.89903.010.1
16 (run 2)0.96477.115.60.81$0.0011$1.03902.49.2
321.46725.620.40.44$0.0007$0.67903.59.9
641.80900.532.01.14$0.0006$0.55904.613.1
  • Max context length: 32,000

  • Assumption: 1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.1466.76.91.09$0.0071$15.08722.912.0
161.98917.97.40.49$0.0005$1.101080.35.7
323.821327.57.60.90$0.0003$0.751406.112.8
644.311848.012.33.14$0.0002$0.541926.912.4
ConcurrencyThroughput (audio sec / sec)iRTFEst. Cost / 1h audio
157.857.8$0.0635
64117.4117.4$0.0313

💵 Cost per hour: $2.50 (Together.ai, used for cost calculation)

Summary:

Top performance per token. Best overall for LLM and OCR (fastest + lowest cost/1M tokens). ASR is still cheap but not as cost-efficient as L4 due to higher hourly price.

  • Max context length: 50,000

  • Assumption: prompt 512 tokens + response 512 tokens

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.1990.55.31.01$0.0037$3.611110.913.9
161.47708.810.33.08$0.0005$0.461112.614.2
322.421131.712.54.62$0.0003$0.291112.914.2
642.841340.519.910.4$0.0002$0.241113.613.4
  • Max context length: 32,000

  • Assumption: 1 input image → ~512 tokens output

ConcurrencyRequests/secTokens/secAvg Latency (s)Avg TTFT (s)Cost / ReqCost / 1M tokensPeak Mem (MB)Avg CPU %
10.23109.74.31.06$0.0030$6.25924.915.0
163.321571.44.50.45$0.0002$0.441403.215.0
325.922702.14.90.69$0.0001$0.251683.215.7
647.243370.17.42.74$0.0001$0.202016.416.6
1286.813104.914.27.55$0.0001$0.222545.127.0
ConcurrencyThroughput (audio sec / sec)iRTFEst. Cost / 1h audio
1416.5416.5$0.0060
641416.01416.0$0.0018

GPU Comparison Overview (Best-Case Results)

Section titled “GPU Comparison Overview (Best-Case Results)”
GPU (VRAM)Hourly CostLLM (Gemma 12B) – Best ConcurrencyReq/secTokens/secCost / 1M TokensOCR (7B/3B) – Best ConcurrencyReq/secTokens/secCost / 1M TokensASR – Best ConcurrencyThroughput (audio sec/sec)Est. $ / 1h Audio
RTX 2000 Ada (16 GB)$0.2580.1256.4$0.5732 (OCR 3B)1.34678.9$0.1064981.1$0.0003
L4 (24 GB)$0.71320.35160.0$0.5732 (OCR 7B)0.84391.7$0.50641096.0$0.0006
A100 (80 GB)$3.67321.46725.6$0.6764 (OCR 7B)4.311848.0$0.5464117.4$0.0313
H100 (80 GB)$2.50642.841340.5$0.2464 (OCR 7B)7.243370.1$0.20641416.0$0.0018

Quick Insights:

  • Best value for LLMs: H100 (fastest, lowest cost per token).

  • Best value for OCR: H100 (massive throughput), with A100 also strong at scale.

  • Best value for ASR: RTX 2000 Ada and L4 (super cheap per audio hour).

For consistency, all benchmarks were run with the following setup: