Skip to content

Token Limit Scanner

It ensures that prompts do not exceed a predetermined token count, helping prevent resource-intensive operations and potential denial of service attacks on large language models (LLMs).

Attack scenario

The complexity and size of LLMs make them susceptible to heavy resource usage, especially when processing lengthy prompts. Malicious users can exploit this by feeding extraordinarily long inputs, aiming to disrupt service or incur excessive computational costs.

This vulnerability is highlighted in the OWASP: LLM04: Model Denial of Service.

How it works

The scanner works by calculating the number of tokens in the provided prompt using tiktoken library. If the token count exceeds the configured limit, the prompt is flagged as being too long.

One token usually equates to approximately 4 characters in common English text. Roughly speaking, 100 tokens are equivalent to about 75 words.

For an in-depth understanding, refer to:

Usage

from llm_guard.input_scanners import TokenLimit

scanner = TokenLimit(limit=4096, encoding_name="cl100k_base")
sanitized_prompt, is_valid, risk_score = scanner.scan(prompt)

Note

Models supported for encoding cl100k_base: gpt-4, gpt-3.5-turbo, text-embedding-ada-002.

Benchmarks

Environment:

  • Platform: Amazon Linux 2
  • Python Version: 3.11.6

Run the following script:

python benchmarks/run.py input TokenLimit

Results:

Instance Input Length Test Times Latency Variance Latency 90 Percentile Latency 95 Percentile Latency 99 Percentile Average Latency (ms) QPS
AWS m5.xlarge 282 5 0.00 0.69 0.86 1.01 0.31 914308.54
AWS g5.xlarge GPU 282 5 0.00 0.60 0.76 0.89 0.27 1039014.63
Azure Standard_D4as_v4 282 5 0.00 0.98 1.26 1.48 0.41 683912.25