Tokens and Usage

Understanding how tokens work is essential for effectively using the OpenTyphoon.ai API. This page explains tokens, context windows, and provides tips for optimizing your token usage.

What are Tokens?

Tokens are the fundamental units of text processing in language models. The OpenTyphoon.ai API processes text by breaking it down into tokens before sending it to the model. A token can be as short as a single character or as long as a word, depending on the language and specific text.

In Thai language:

A word can be 1-3 tokens on average
Space characters count as tokens
Punctuation marks are separate tokens
Numbers are typically broken down into individual digits

For example, the Thai phrase “สวัสดีครับ” might be tokenized into approximately 3-4 tokens.

Context Window

The context window refers to the maximum number of tokens that a model can process in a single request, including both the input (prompt) and the generated output. Each OpenTyphoon.ai model has a specific context window size:

Model	Total Context Window	Input Token Limit	Output Token Limit
All Typhoon models	8K tokens	Depends on output	Depends on input

The total of input + output tokens cannot exceed the context window size. For example, with an 8K context window, if your input prompt uses 7K tokens, the model can only generate up to 1K tokens in response.

Token Counting

To estimate the number of tokens in your text, you can use this rough approximation:

For Thai text: approximately 2-3 tokens per word
For English text: approximately 1.3 tokens per word

For more precise token counting, especially in production applications, you should use a proper tokenizer. Unfortunately, the exact tokenizer used by OpenTyphoon.ai is not publicly available, but you can use similar tokenizers for estimation.

Token Usage in API Responses

Each API response includes a usage field that provides information about the tokens used in your request:

{
  "id": "cmpl-abc123",
  "object": "chat.completion",
  "created": 1677858242,
  "model": "typhoon-v2.1-12b-instruct",
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 12,
    "total_tokens": 37
  },
  "choices": [...]
}

prompt_tokens: The number of tokens in your input messages
completion_tokens: The number of tokens in the generated response
total_tokens: The total number of tokens used in the request (prompt + completion)

Best Practices for Token Usage

Optimize Prompts

Keep your prompts concise and focused:

Remove unnecessary context: Include only the information that’s directly relevant to your query.

Use system messages efficiently: System messages can help set behavior without lengthy explanations in each user message.

# Efficient use of system message
messages = [
    {"role": "system", "content": "You are a Thai language translator. Always translate English to Thai."},
    {"role": "user", "content": "Hello, how are you?"}
]

Truncate conversation history: For long conversations, consider keeping only the most recent and relevant messages.

Handling Long Inputs

When working with long documents or conversations that might exceed token limits:

Chunk the text: Split long text into smaller chunks and process them separately.

def process_long_document(document, chunk_size=1000, overlap=100):
    # This is a simplified example - in practice, you'd want to split on sentence boundaries
    words = document.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)

    return chunks

Summarize previous context: Instead of sending the entire conversation history, summarize earlier turns.

# Example summarization approach
summary_messages = [
    {"role": "system", "content": "Summarize the following conversation in 2-3 sentences."},
    {"role": "user", "content": full_conversation_history}
]

summary_response = client.chat.completions.create(
    model="typhoon-v2.1-12b-instruct",
    messages=summary_messages,
    max_tokens=100
)

summary = summary_response.choices[0].message.content

# Now use the summary + recent messages
new_conversation = [
    {"role": "system", "content": "Previous conversation summary: " + summary},
    # Add recent message exchanges...
]

Use retrieval-based approaches: For question answering with large documents, use retrieval to fetch only the most relevant portions.

Managing Token Limits

When working within token limits:

Set appropriate max_tokens: Specify a reasonable value for the max_tokens parameter based on how long you expect the response to be.
Monitor token usage: Keep track of token usage to ensure you stay within limits and optimize where necessary.
Implement truncation strategies: Have a plan for handling cases where content exceeds token limits.

Example: Token Usage Estimator

def estimate_tokens(text, lang="thai"):
    # Very rough estimation - use a proper tokenizer for production
    if lang.lower() == "thai":
        # Thai text: estimate 2.5 tokens per word (rough average)
        # Count Thai characters and spaces
        words = len(text.split())
        return int(words * 2.5)
    else:
        # English text: estimate 1.3 tokens per word
        words = len(text.split())
        return int(words * 1.3)

def check_token_limits(messages, model="typhoon-v2.1-12b-instruct", max_output_tokens=500):
    # Estimate total input tokens
    input_tokens = 0
    for message in messages:
        # Add 4 tokens for message formatting
        content = message.get("content", "")
        lang = "english" if all(ord(c) < 128 for c in content) else "thai"
        input_tokens += estimate_tokens(content, lang) + 4

    # Check against model's context limit
    context_limit = 8192  # 8K tokens for all Typhoon models
    remaining_tokens = context_limit - input_tokens

    if remaining_tokens <= 0:
        print(f"Warning: Input exceeds context window of {context_limit} tokens.")
        return False

    if remaining_tokens < max_output_tokens:
        print(f"Warning: Only {remaining_tokens} tokens remaining for output " +
              f"(requested {max_output_tokens}).")
        print("Consider reducing input length or requested output tokens.")
        return False

    print(f"Estimated input tokens: {input_tokens}")
    print(f"Remaining tokens for output: {remaining_tokens}")
    return True

Thai Language Considerations

Thai language has some specific characteristics that affect tokenization:

No spaces between words: Thai language doesn’t use spaces between words, which affects tokenization differently than space-delimited languages like English.
Character-level tokens: Many Thai characters or combinations of characters become individual tokens.
Tone marks and vowels: These are often separate tokens from the consonants they modify.

These characteristics mean that Thai text may use more tokens than you might expect compared to English text of similar meaning. Keep this in mind when designing prompts and estimating token usage.