How to Use QPCs¶

The following steps provide a straightforward approach to accessing and utilizing the precompiled QPC models optimized for Qualcomm hardware:

Install the Cloud AI SDK:

This includes both the Application (Apps) SDK and the Platform SDK. Follow the instructions at Cloud AI SDK Installation Guide.

Set up Efficient Transformers:

Use the Efficient Transformers library to port pretrained models and checkpoints from the HuggingFace hub into formats that run efficiently on Qualcomm Cloud AI 100 accelerators. Detailed setup instructions can be found at Efficient Transformers GitHub Repository.

Execute the Precompiled QPC:

Utilize the precompiled QPC in the execute API to run for different prompts. For example:

python -m QEfficient.cloud.execute 
--model_name gpt2 
--qpc_path qeff_models/gpt2/qpc_16cores_1BS_32PL_128CL_1devices_mxfp6/qpcs/ 
--prompt "Once upon a time in" 
--device_group [0]

Run QPC with vLLM via REST Endpoint¶

The precompiled QPC can also be served using vLLM via a REST API endpoint:

Download the Cloud AI Inference Container and follow these steps to launch the container.
Start REST API Server Using vLLM:

Run the following command inside the container to launch the vLLM API server using the precompiled QPC:

VLLM_QAIC_MAX_CPU_THREADS=8 VLLM_QAIC_QPC_PATH=/path/to/qpcs /opt/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server 
--host 0.0.0.0 
--port 8000 
--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 
--max-model-len 8192 
--max-num-seq 1 
--max-seq_len-to-capture 128 
--device qaic 
--device-group 0,1,2,3,4,5,6,7

In this example, the context length is 8192 (max-model-len), batch size is 1 (max-num-seq) and the model runs tensor-sliced on 8 Cloud AI devices [0,1,2,3,4,5,6,7].

Test the vLLM endpoint:

Use the following curl command to test the chat completion endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer test-key" \
  -d '{
    "model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful AI assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'