API Deployment

From source

Copy the code from llm_guard_api

Install dependencies (preferably in a virtual environment)

python -m pip install ".[cpu]"
python -m pip install ".[gpu]" # If you have a GPU

Alternatively, you can use Makefile:
```
make install
```

Using uvicorn

Run the API locally:

make run

Or using CLI:

llm_guard_api ./config/scanners.yml

Using gunicorn

In case you want to use gunicorn to run the API, you can use the following command:

gunicorn --workers 1 --preload --worker-class uvicorn.workers.UvicornWorker 'app.app:create_app(config_file="./config/scanners.yml")'

It will preload models in the shared memory among workers, which can be useful for performance.

From Docker

Either build the Docker image or pull our official image from Docker Hub.

In order to build the Docker image, run the following command:

make build-docker-multi
make build-docker-cuda-multi # If you have a GPU

Or pull the official image:

docker pull laiyer/llm-guard-api:latest

Now, you can run the Docker container:

docker run -d -p 8000:8000 -e LOG_LEVEL='DEBUG' -e AUTH_TOKEN='my-token' laiyer/llm-guard-api:latest

This will start the API on port 8000. You can now access the API at http://localhost:8000/swagger.json.

If you want to use a custom configuration, you can mount a volume to /home/user/app/config:

docker run -d -p 8000:8000 -e APP_WORKERS=1 -e AUTH_TOKEN='my-token' -e LOG_LEVEL='DEBUG' -v ./entrypoint.sh:/home/user/app/entrypoint.sh -v ./config/scanners.yml:/home/user/app/config/scanners.yml laiyer/llm-guard-api:latest

Warning

We recommend at least 16GB of RAM allocated to Docker. We are working on optimizing the memory usage when the container starts.

Troubleshooting

Out-of-memory error

If you get an out-of-memory error, you can change config.yml file to use less scanners. Alternatively, you can enable low_cpu_mem_usage in scanners that rely on HuggingFace models.

Failed HTTP probe

If you get a failed HTTP probe, it might be because the API is still starting. You can increase the initialDelaySeconds in the Kubernetes deployment.

Alternatively, you can configure lazy_load in the YAML config file to load models only on the first request.