Skip to content

Environment Setup

This guide provides a complete walkthrough for setting up the environment, executing precompiled model binaries, and serving them using vLLM on Qualcomm Cloud AI accelerators.

Download Model

Download a pre-compiled model binary for Qualcomm Cloud AI accelerators from the Model Catalog.

Efficient Transformers Library

The Efficient Transformer Library provides APIs for running precompiled Qualcomm Cloud AI model binaries. It is included in the Cloud AI inference container or can be installed standalone.

Option 1: Docker Container

  1. Launch container from cloud_ai_inference_ubuntu24

    sudo docker run -it \
      --shm-size=2gb \
      --workdir /workspace \
      --network host \
      --mount type=bind,source=${PWD},target=/workspace \
      --mount type=bind,source=${HOME}/.cache,target=/cache \
      --env HF_HOME=/cache/huggingface \
      --env QEFF_HOME=/cache/qeff_models \
      --device /dev/accel/ \
      ghcr.io/quic/cloud_ai_inference_ubuntu24:1.20.4.0
    

  2. Activate Efficient Transformers Library environment:

    source /opt/qeff-env/bin/activate
    

Note: Cloud AI SDK installation is also required for Linux Kernel 6.9 and earlier versions.

Option 2: Local Installation

  1. Install Cloud AI SDK
    Follow the instructions at Cloud AI SDK Installation Guide. This includes both the Application (Apps) SDK and the Platform SDK.
  2. Install Efficient Transformers Library
  3. Activate Efficient Transformers Library environment:
    source qeff_env/bin/activate
    

Enable P2P between Cloud AI Devices

sudo sh -c 'echo 2600 > /sys/module/qaic/parameters/control_resp_timeout_s'
sudo /opt/qti-aic/tools/qaics-util -a

Hugging Face Token

Some models on Hugging Face require an access token. Refer to https://huggingface.co/docs/hub/en/security-tokens for more information. Set your token with the following command:

export HF_TOKEN=<your_auth_token>

vLLM Inference Serving

The precompiled QPC can also be served using vLLM via a REST API endpoint:

Launch container from cloud_ai_inference_ubuntu24:

sudo docker run -it \
  --shm-size=2gb \
  --workdir /workspace \
  --network host \
  --mount type=bind,source=${PWD},target=/workspace \
  --mount type=bind,source=${PWD}/.cache,target=/cache \
  --env HF_HOME=/cache/huggingface \
  --env QEFF_HOME=/cache/qeff_models \
  --device /dev/accel/ \
  ghcr.io/quic/cloud_ai_inference_ubuntu24:1.20.4.0

Activate vLLM environment:

source /opt/vllm-env/bin/activate

Start API endpoint

Start an API serving endpoint using the instructions in the Model Catalog. Example:API Endpoint section.

Chat Request Example

After starting an API endpoint, try curl to send a chat request. Replace "model" with the model card name from Hugging Face.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer test-key" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful AI assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'