Gpt oss 120b
Model Overview¶
OpenAI’s GPT-OSS models (gpt-oss-120b & gpt-oss-20b) are open-weight models designed for powerful reasoning, agentic tasks and versatile developer use cases. GPT-OSS-120B is used for production, general purpose, high reasoning use cases that fit into a single 80GB GPU.
- Model Architecture: 117B parameters with 5.1B active parameters. Trained on harmony response format and should only be used with the harmony format as it will not work correctly otherwise.
- Model Source: openai/gpt-oss-120b
- License: Apache 2.0 license. Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
- Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
- Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users.
- Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning.
- Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
- Native MXFP4 quantization: The models were post-trained with MXFP4 quantization of the MoE weights, making gpt-oss-120b run on a single 80GB GPU.
QPC Configurations¶
| Precision | SoCs / Tensor slicing | NSP-Cores (per SoC) | Batch Size | Chunking Prompt Length | Context Length (CL) | Generated URL | Download |
|---|---|---|---|---|---|---|---|
| MXFP6 | 4 | 16 | 1 | 1 | 8192 | https://qualcom-qpc-models.s3-accelerate.amazonaws.com/SDK1.20.2/openai/gpt-oss-120b/qpc_120b_ts4_cl8k_bs1_1spec_decode.tar.gz | Download |
| MXFP6 | 8 | 16 | 1 | 1 | 8192 | https://qualcom-qpc-models.s3-accelerate.amazonaws.com/SDK1.20.2/openai/gpt-oss-120b/qpc_120b_ts8_cl8k_bs1_1spec_decode.tar.gz | Download |
| MXFP6 | 4 | 16 | 1 | 1 | 4096 | https://qualcom-qpc-models.s3-accelerate.amazonaws.com/SDK1.20.2/openai/gpt-oss-120b/qpc_120b_ts4_cl4k_bs1_1spec_decode.tar.gz | Download |
| MXFP6 | 8 | 16 | 1 | 1 | 4096 | https://qualcom-qpc-models.s3-accelerate.amazonaws.com/SDK1.20.2/openai/gpt-oss-120b/qpc_120b_ts8_cl4k_bs1_1spec_decode.tar.gz | Download |
| MXFP6 | 4 | 16 | 1 | 1 | 32768 | https://qualcom-qpc-models.s3-accelerate.amazonaws.com/SDK1.20.2/openai/gpt-oss-120b/qpc_120b_ts4_cl32k_bs1_1spec_decode.tar.gz | Download |
Run This Model¶
# Download QPC
mkdir -p openai/gpt-oss-120b
cd openai/gpt-oss-120b
wget <Download URL>
tar xzvf <downloaded filename.tar.gz>
# Run QPC
python3 -m QEfficient.cloud.execute --model_name openai/gpt-oss-120b --qpc_path <path/to/qpc> --device_group [0,1,2,3] --prompt "# shortest path algorithm\n" --generation_len 128
API Endpoint¶
# Start REST endpoint with vLLM
VLLM_QAIC_MAX_CPU_THREADS=8 VLLM_QAIC_QPC_PATH=/path/to/qpc python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model openai/gpt-oss-120b \
--max-model-len <Context Length> \
--max-num-seq <Full Batch Size> \
--max-seq_len-to-capture <Chunking Prompt Length> \
--device qaic \
--block-size 32