No Refusal Scanner

It is specifically designed to detect refusals in the output of language models.

It can be especially useful to detect when someone is trying to force the model to produce a harmful output.

Attack scenario

In order to identify and mitigate these risks, commercial LLM creators have constructed datasets of harmful prompts. They have also implemented safety mechanisms to restrict model behavior to a “safe” subset of capabilities by training-time interventions to align models with predefined values, and post hoc flagging and filtering of inputs and outputs.

Refusals are responses produced by language models when confronted with prompts that are considered to be against the policies set by the model. Such refusals are important safety mechanisms, guarding against misuse of the model. Examples of refusals can include statements like "Sorry, I can't assist with that" or "I'm unable to provide that information."

How it works

It leverages the proprietary model ProtectAI/distilroberta-base-rejection-v1 to classify the model's output.

Alternatively, it has lighter version that uses a simple rule-based approach to detect refusals. Such approach is common in research papers when evaluating language models.

Usage

from llm_guard.output_scanners import NoRefusal
from llm_guard.output_scanners.no_refusal import MatchType

scanner = NoRefusal(threshold=0.5, match_type=MatchType.FULL)
sanitized_output, is_valid, risk_score = scanner.scan(prompt, model_output)

Alternatively, a lighter version can be used:

from llm_guard.output_scanners import NoRefusalLight

scanner = NoRefusalLight()
sanitized_output, is_valid, risk_score = scanner.scan(prompt, model_output)

Optimization Strategies

Benchmarks

Test setup:

Platform: Amazon Linux 2
Python Version: 3.11.6
Input length: 47
Test times: 5

Run the following script:

python benchmarks/run.py output NoRefusal

Results:

Instance	Latency Variance	Latency 90 Percentile	Latency 95 Percentile	Latency 99 Percentile	Average Latency (ms)	QPS
AWS m5.xlarge	2.65	109.78	135.49	156.06	58.27	806.66
AWS m5.xlarge with ONNX	0.00	12.20	12.55	12.84	11.36	4138.75
AWS g5.xlarge GPU	31.15	269.84	357.97	428.47	93.09	504.86
AWS g5.xlarge GPU with ONNX	0.11	18.09	23.41	27.67	7.41	6346.18
AWS r6a.xlarge (AMD)	0.00	26.33	27.07	27.66	24.61	1909.65
AWS r6a.xlarge (AMD) with ONNX	0.08	27.08	31.53	35.09	18.11	2595.73