Best Practices

Performance Optimization

Benchmark Analysis: Before choosing the scanners, it's crucial to understand their performance on different instances. Review the benchmarks for each scanner to make an informed decision based on your specific requirements.
Model Size Trade-off: Opting for smaller models will expedite processing, reducing latency. However, this comes at the cost of accuracy. We are actively working on providing compact versions with minimal accuracy trade-offs.
Use ONNX Runtime for CPU inference: ONNX Runtime is a high-performance inference engine for machine learning models. When possible, we recommend using ONNX Runtime for serving the models.
Tune Transformers kwargs: Transformers have a variety of parameters that can be tuned to optimize performance. For example, low_cpu_mem_usage, which helps to use less memory by utilizing Accelerate library.

Fast Failure Mode: Enable the fail_fast mode while serving to ensure early exits, preventing the wait for all scanners to complete, thus optimizing the response time.
Scanner Selection: Assess the relevance of different scanners for your use-case. Instead of employing all scanners synchronously, which might overwhelm the system, consider using them asynchronously. This approach enhances observability, aiding in precise debugging and performance monitoring.
Request Sampling: Run slower scanners on a sample of requests to reduce the overall latency. This approach is especially useful when the system is under heavy load.

Logging and Metrics: Implement robust logging and metric collection to monitor the system's performance and health.

Feedback Loops: Establish feedback loops with your system's users to understand how the library is performing in real-world scenarios, and to gather suggestions for improvements.
Regular Updates and Testing: Stay updated with the latest versions of llm-guard, and ensure thorough testing in a staging environment before rolling out updates in a production setup.