Toxicity Scanner

It is designed to assess the toxicity level of the content generated by language models, acting as a safeguard against potentially harmful or offensive output.

Attack scenario

Language models, when interacting with users, can sometimes produce responses that may be deemed toxic or inappropriate. This poses a risk, as such output can perpetuate harm or misinformation. By monitoring and classifying the model's output, potential toxic content can be flagged and handled appropriately.

How it works

The scanner uses the unitary/unbiased-toxic-roberta model from Hugging Face for binary classification of the text as toxic or non-toxic.

Toxicity Detection: If the text is classified as toxic, the toxicity score corresponds to the model's confidence in this classification.
Non-Toxicity Confidence: For non-toxic text, the score is the inverse of the model's confidence, i.e., 1 − confidence score.
Threshold-Based Flagging: Text is flagged as toxic if the toxicity score exceeds a predefined threshold (default: 0.5).

Usage

from llm_guard.output_scanners import Toxicity
from llm_guard.output_scanners.toxicity import MatchType

scanner = Toxicity(threshold=0.5, match_type=MatchType.SENTENCE)
sanitized_output, is_valid, risk_score = scanner.scan(prompt, model_output)

Match Types:

Sentence Type: In this mode (MatchType.SENTENCE), the scanner scans each sentence to check for toxic.
Full Text Type: In MatchType.FULL mode, the entire text is scanned.

Optimization Strategies

Benchmarks

Test setup:

Platform: Amazon Linux 2
Python Version: 3.11.6
Input length: 217
Test times: 5

Run the following script:

python benchmarks/run.py output Toxicity

Results:

Instance	Latency Variance	Latency 90 Percentile	Latency 95 Percentile	Latency 99 Percentile	Average Latency (ms)	QPS
AWS m5.xlarge	2.89	154.18	181.05	202.55	100.40	2161.43
AWS m5.xlarge with ONNX	0.00	49.61	49.98	50.28	48.77	4449.47
AWS g5.xlarge GPU	33.35	282.36	373.59	446.56	99.57	2179.37
AWS g5.xlarge GPU with ONNX	0.01	8.00	9.56	10.81	4.85	44719.38
Azure Standard_D4as_v4	3.90	182.94	213.16	237.33	118.62	1829.38
Azure Standard_D4as_v4 with ONNX	0.07	70.81	73.93	76.43	61.40	3534.14