Environment Setup¶
This guide provides a complete walkthrough for setting up the environment, executing precompiled model binaries, and serving them using vLLM on Qualcomm Cloud AI accelerators.
Download Model¶
Download a pre-compiled model binary for Qualcomm Cloud AI accelerators from the Model Catalog.
Efficient Transformers Library¶
The Efficient Transformer Library provides APIs for running precompiled Qualcomm Cloud AI model binaries. It is included in the Cloud AI inference container or can be installed standalone.
Option 1: Docker Container
-
Launch container from cloud_ai_inference_ubuntu24
sudo docker run -it \ --shm-size=2gb \ --workdir /workspace \ --network host \ --mount type=bind,source=${PWD},target=/workspace \ --mount type=bind,source=${HOME}/.cache,target=/cache \ --env HF_HOME=/cache/huggingface \ --env QEFF_HOME=/cache/qeff_models \ --device /dev/accel/ \ ghcr.io/quic/cloud_ai_inference_ubuntu24:1.20.4.0 -
Activate Efficient Transformers Library environment:
Note: Cloud AI SDK installation is also required for Linux Kernel 6.9 and earlier versions.
Option 2: Local Installation
- Install Cloud AI SDK
Follow the instructions at Cloud AI SDK Installation Guide. This includes both the Application (Apps) SDK and the Platform SDK. - Install Efficient Transformers Library
- Activate Efficient Transformers Library environment:
Enable P2P between Cloud AI Devices¶
sudo sh -c 'echo 2600 > /sys/module/qaic/parameters/control_resp_timeout_s'
sudo /opt/qti-aic/tools/qaics-util -a
Hugging Face Token¶
Some models on Hugging Face require an access token. Refer to https://huggingface.co/docs/hub/en/security-tokens for more information. Set your token with the following command:
export HF_TOKEN=<your_auth_token>
vLLM Inference Serving¶
The precompiled QPC can also be served using vLLM via a REST API endpoint:
Launch container from cloud_ai_inference_ubuntu24:
sudo docker run -it \
--shm-size=2gb \
--workdir /workspace \
--network host \
--mount type=bind,source=${PWD},target=/workspace \
--mount type=bind,source=${PWD}/.cache,target=/cache \
--env HF_HOME=/cache/huggingface \
--env QEFF_HOME=/cache/qeff_models \
--device /dev/accel/ \
ghcr.io/quic/cloud_ai_inference_ubuntu24:1.20.4.0
Activate vLLM environment:
Start API endpoint¶
Start an API serving endpoint using the instructions in the Model Catalog. Example:API Endpoint section.
Chat Request Example
After starting an API endpoint, try curl to send a chat request. Replace "model" with the model card name from Hugging Face.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test-key" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'