URL Reachability Scanner

This scanner identifies URLs in the text and checks them for accessibility, ensuring that all URLs are reachable and not broken.

Motivation

Large Language Models (LLMs) like GPT-4 have the capacity to generate a variety of content, including URLs. While these models are trained on extensive datasets to provide accurate and relevant information, there's a possibility of generating URLs that are either incorrect or no longer accessible. Ensuring the reachability of these URLs is crucial for maintaining the credibility and usefulness of the content produced by LLMs.

How it works

It scans the text for URLs and verifies each URL's accessibility. A URL is considered reachable if a request to it returns a successful HTTP status code (200 OK). If the URL is not accessible (for instance, due to a broken link or server error), the scanner flags it as unreachable.

Usage

from llm_guard.output_scanners import URLReachability

scanner = URLReachability(success_status_codes=[200, 201, 202, 301, 302], timeout=1)
sanitized_output, is_valid, risk_score = scanner.scan(prompt, model_output)

In this example, output_text is the text generated by the LLM, and all_urls_reachable is a boolean indicating whether all URLs in the text are reachable.

Optimization Strategies

Timeout Settings: Configure appropriate timeout settings in the HTTP requests to balance between thorough checking and efficiency.

Benchmarks

Benchmark is not relevant for this scanner because it depends on the factors we cannot control, such as the network connection and the availability of the URLs.