Skip to content

Gpt oss 20b

Model Overview

OpenAI’s GPT-OSS models (gpt-oss-120b & gpt-oss-20b) are open-weight models designed for powerful reasoning, agentic tasks and versatile developer use cases. GPT-OSS-20B is used for lower latency, and local or specialized use cases.

  • Model Architecture: 21B parameters with 3.6B active parameters. Trained on harmony response format and should only be used with the harmony format as it will not work correctly otherwise.
  • Model Source: openai/gpt-oss-20b
  • License: Apache 2.0 license. Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
  • Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
  • Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users.
  • Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning.
  • Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
  • Native MXFP4 quantization: The models are trained with native MXFP4 precision for the MoE layer, making the gpt-oss-20b model run within 16GB of memory.

QPC Configurations

Precision SoCs / Tensor slicing NSP-Cores (per SoC) Full Batch Size Chunking Prompt Length Context Length (CL) Generated URL Download Generation Date
MXFP6 2 8 1 1 4096 https://dc00tk1pxen80.cloudfront.net/SDK1.20.4/openai/gpt-oss-20b/qpc_8cores_1pl_4096cl_1fbs_2devices_mxfp6_mxint8.tar.gz Download
MXFP6 2 16 1 1 8192 https://dc00tk1pxen80.cloudfront.net/SDK1.20.4/openai/gpt-oss-20b/gpt_oss_20b_qpc_16cores_1pl_8192cl_1fbs_2devices_mxfp6_mxint8.tar.gz Download 22-Jan-2026

Run This Model

# Download QPC
mkdir -p openai/gpt-oss-20b
cd openai/gpt-oss-20b
wget <Download URL>
tar xzvf <downloaded filename.tar.gz>

# Run QPC
python3 -m QEfficient.cloud.execute --model_name openai/gpt-oss-20b --qpc_path <path/to/qpc> --prompt "# shortest path algorithm\n" --generation_len 128

API Endpoint

# Start REST endpoint with vLLM
VLLM_QAIC_MAX_CPU_THREADS=8 VLLM_QAIC_QPC_PATH=/path/to/qpc python3 -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port 8000 \
  --model openai/gpt-oss-20b \
  --max-model-len <Context Length> \
  --max-num-seq <Full Batch Size>  \
  --max-seq_len-to-capture <Chunking Prompt Length>  \
  --device qaic \
  --block-size 32