Ban Substrings Scanner
BanSubstrings scanner provides a safeguard mechanism to prevent undesired substrings from appearing in the language model's outputs.
How it works
It specifically filters the outputs generated by the language model, ensuring that they are free from the designated banned substrings. It provides the flexibility to perform this check at two different levels of granularity:
-
String Level: The scanner checks the entire model output for the presence of any banned substring.
-
Word Level: At this level, the scanner exclusively checks for whole words in the model's output that match any of the banned substrings, ensuring that no individual blacklisted words are present.
Additionally, the scanner can be configured to replace the banned substrings with [REDACT]
in the model's output.
Use cases
1. Prevent DAN attacks
The DAN (Do Anything Now) attack represents an exploitation technique targeting Language Learning Models like ChatGPT. Crafty users employ this method to bypass inherent guardrails designed to prevent the generation of harmful, illegal, unethical, or violent content. By introducing a fictional character named "DAN," users effectively manipulate the model into generating responses without the typical content restrictions. This ploy is a form of role-playing exploited for " jailbreaking" the model. As ChatGPT's defense mechanisms against these attacks improve, attackers iterate on the DAN prompt, making it more sophisticated.
Info
As specified by the OWASP Top 10 LLM attacks
, this vulnerability is categorized
under: LLM08: Excessive Agency
2. Prevent harmful substrings in the model's output
There is also a dataset prepared of harmful substrings for prompts: output_stop_substrings.json
3. Hide mentions of competitors
List all competitor names and pass them to the scanner. It will replace all competitor names with [REDACT]
in the model's output.
Usage
from llm_guard.output_scanners import BanSubstrings
from llm_guard.input_scanners.ban_substrings import MatchType
scanner = BanSubstrings(
substrings=["forbidden", "unwanted"],
match_type=MatchType.WORD,
case_sensitive=False,
redact=False,
contains_all=False,
)
sanitized_output, is_valid, risk_score = scanner.scan(prompt, model_output)
In the above configuration, is_valid
will be False
if the provided model_output
contains any of the banned
substrings as
whole words. To ban substrings irrespective of their word boundaries, simply change the mode to str
.
Benchmarks
It uses data structures and replace function, which makes it fast.