Optimization Strategies
ONNX Runtime
ONNX (Open Neural Network Exchange) provides a high-performance inference engine for machine learning models, allowing for faster and more efficient model execution. If an ONNX version of a model is available, it can serve as a substantial optimization for the scanner.
To leverage ONNX Runtime, you must first install the appropriate package:
pip install llm-guard[onnxruntime] # for CPU instances
pip install llm-guard[onnxruntime-gpu] # for GPU instances
Activate ONNX by initializing your scanner with the use_onnx parameter set to True:
scanner = Code(languages=["PHP"], use_onnx=True)
In case you have issues installing the ONNX Runtime package, you can check the official documentation.
ONNX Runtime with Quantization
Although not built-in in the library, you can use quantized or optimized versions of the models. However, that doesn't always lead to better latency but can reduce the model size.
Enabling Low CPU/Memory Usage
To minimize CPU and memory usage:
from llm_guard.input_scanners.code import Code, DEFAULT_MODEL
DEFAULT_MODEL.kwargs["low_cpu_mem_usage"] = True
scanner = Code(languages=["PHP"], model=DEFAULT_MODEL)
For an in-depth understanding of this feature and its impact on large model handling, refer to the detailed Large Model Loading Documentation.
Alternatively, quantization can be used to reduce the model size and memory usage.
Use smaller models
For certain scanners, smaller model variants are available e.g. distilbert, bert-small, bert-tiny versions. These models are designed for enhanced performance, offering reduced latency without significantly compromising accuracy or effectiveness.
PyTorch hacks
To speed up warm compile times:
import torch
torch.set_float32_matmul_precision('high')
import torch._inductor.config
torch._inductor.config.fx_graph_cache = True
Streaming mode
To optimize the output scanning, you can analyze the output in chunks. In OpenAI guide, we demonstrate how to use LLM Guard to protect OpenAI client with streaming.