News
protocol - v LLM
5+ hour, 29+ min ago (223+ words) Prompt token count for usage; defaults to 0 if omitted. Mirrors chat_request on Derender Chat Request. Required by the parsing so parsers receive the full request context. One prompt token count per response; each defaults to 0 if omitted. Char-level (start, end) offsets…...
vllm. model_executor. layers. attention. rswa_attention
9+ hour, 20+ min ago (86+ words) v LLM docs Attention layer that reports RSWASpec as its KV cache spec. Drop-in replacement for the standard Attention layer when the model is configured with Reference Sliding Window Attention (R-SWA, rswa_window > 0 ). The actual masking logic lives in the attention backend…...
fused_ops - v LLM
9+ hour, 27+ min ago (98+ words) v LLM docs Fused ops for deepseek_v32 (eager / breakable-cudagraph path). These recover fusions that v LLM's torch. compile passes would normally do but that don't fire when running eager under the breakable CUDA graph. All-reduce + add residual + (standard) RMSNorm, fused via…...
vllm. model_executor. models. openai_privacy_filter
8+ hour, 37+ min ago (46+ words) v LLM docs Inference-only Open AI Privacy Filter model. gpt-oss reused as a bidirectional encoder for token classification: every layer runs non-causal attention with a banded "sliding_window mask, and the LM head is replaced with a 33-class BIOES score head....
serving - v LLM
5+ hour, 35+ min ago (49+ words) Extract multimodal metadata from a rendered engine prompt. Returns None for text-only prompts. Validate the model and preprocess a chat completion request. Validate the model and preprocess a completion request. This is the authoritative implementation used directly by the GPU-less…...
hpc_moe - v LLM
9+ hour, 41+ min ago (174+ words) v LLM docs Mo E implementation powered by HPC. Only supported on NVIDIA Hopper GPUs (e. g. H20, H200), and currently limited to FP8 models such as Hy3-FP8, Qwen3-235 B-A22 B-FP8, etc. Compute the shapes for the temporary and final outputs of the two gemms workspace_shapes(M, N, K, topk,…...
vllm. model_executor. warmup. qwen_triton_warmup
7+ hour, 24+ min ago (26+ words) v LLM docs Warm up Qwen Triton kernels from the loaded model's compile keys. Warm Qwen Triton kernels reported by the JIT monitor....
mm_serde - v LLM
7+ hour, 25+ min ago (22+ words) v LLM docs Encode/decode utilities for multimodal tensors and field metadata over JSON/HTTP, used by the disaggregated generate endpoint....
vllm. v1. attention. backends. mla. prefill. aiter_flash_attn
7+ hour, 57+ min ago (54+ words) v LLM docs AITER Flash Attention backend for MLA prefill (ROCm). This backend calls aiter. flash_attn_varlen_func directly, which natively supports different q/k and v head dims (qk headdim 192, v headdim 128) without padding V, and dispatches to the fast aiter: :fmha_fwd_ kernel…...
vllm. entrypoints. scale_out. token_in_token_out
7+ hour, 26+ min ago (15+ words) v LLM docs Encode/decode utilities for multimodal tensors and field metadata...