Environment Setup¶

This guide provides a complete walkthrough for setting up the environment, executing precompiled model binaries, and serving them using vLLM on Qualcomm Cloud AI accelerators.

Download Model¶

Download a pre-compiled model binary for Qualcomm Cloud AI accelerators from the Model Catalog.

Efficient Transformers Library¶

The Efficient Transformer Library provides APIs for running precompiled Qualcomm Cloud AI model binaries. It is included in the Cloud AI inference container or can be installed standalone.

Option 1: Docker Container

Launch container from cloud_ai_inference_ubuntu24

sudo docker run -it \
  --shm-size=2gb \
  --workdir /workspace \
  --network host \
  --mount type=bind,source=${PWD},target=/workspace \
  --mount type=bind,source=${HOME}/.cache,target=/cache \
  --env HF_HOME=/cache/huggingface \
  --env QEFF_HOME=/cache/qeff_models \
  --device /dev/accel/ \
  ghcr.io/quic/cloud_ai_inference_ubuntu24:1.20.4.0

Activate Efficient Transformers Library environment:
```
source /opt/qeff-env/bin/activate
```

Note: Cloud AI SDK installation is also required for Linux Kernel 6.9 and earlier versions.

Option 2: Local Installation

Install Cloud AI SDK
Follow the instructions at Cloud AI SDK Installation Guide. This includes both the Application (Apps) SDK and the Platform SDK.
Install Efficient Transformers Library
Activate Efficient Transformers Library environment:
```
source qeff_env/bin/activate
```

Enable P2P between Cloud AI Devices¶

sudo sh -c 'echo 2600 > /sys/module/qaic/parameters/control_resp_timeout_s'
sudo /opt/qti-aic/tools/qaics-util -a

Hugging Face Token¶

Some models on Hugging Face require an access token. Refer to https://huggingface.co/docs/hub/en/security-tokens for more information. Set your token with the following command:

export HF_TOKEN=<your_auth_token>

vLLM Inference Serving¶

The precompiled QPC can also be served using vLLM via a REST API endpoint:

Launch container from cloud_ai_inference_ubuntu24:

sudo docker run -it \
  --shm-size=2gb \
  --workdir /workspace \
  --network host \
  --mount type=bind,source=${PWD},target=/workspace \
  --mount type=bind,source=${PWD}/.cache,target=/cache \
  --env HF_HOME=/cache/huggingface \
  --env QEFF_HOME=/cache/qeff_models \
  --device /dev/accel/ \
  ghcr.io/quic/cloud_ai_inference_ubuntu24:1.20.4.0

Activate vLLM environment:

source /opt/vllm-env/bin/activate

Start API endpoint¶

Start an API serving endpoint using the instructions in the Model Catalog. Example:API Endpoint section.

Chat Request Example

After starting an API endpoint, try curl to send a chat request. Replace "model" with the model card name from Hugging Face.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer test-key" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful AI assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'