News

v LLM docs
docs. vllm. ai > en > latest > api > vllm > entrypoints > scale_out > token_in_token_out > protocol

protocol - v LLM

5+ hour, 29+ min ago  (223+ words) Prompt token count for usage; defaults to 0 if omitted. Mirrors chat_request on Derender Chat Request. Required by the parsing so parsers receive the full request context. One prompt token count per response; each defaults to 0 if omitted. Char-level (start, end) offsets…...

Symbols: erc-20,agent-id,owner-id
v LLM docs
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > attention > rswa_attention

vllm. model_executor. layers. attention. rswa_attention

9+ hour, 20+ min ago  (86+ words) v LLM docs Attention layer that reports RSWASpec as its KV cache spec. Drop-in replacement for the standard Attention layer when the model is configured with Reference Sliding Window Attention (R-SWA, rswa_window > 0 ). The actual masking logic lives in the attention backend…...

Symbols: a000660,000660.ks,btc-usd,six:the
v LLM docs
docs. vllm. ai > en > latest > api > vllm > models > deepseek_v32 > nvidia > fused_ops

fused_ops - v LLM

9+ hour, 27+ min ago  (98+ words) v LLM docs Fused ops for deepseek_v32 (eager / breakable-cudagraph path). These recover fusions that v LLM's torch. compile passes would normally do but that don't fire when running eager under the breakable CUDA graph. All-reduce + add residual + (standard) RMSNorm, fused via…...

Symbols: mawts-1
v LLM docs
docs. vllm. ai > en > latest > api > vllm > model_executor > models > openai_privacy_filter

vllm. model_executor. models. openai_privacy_filter

8+ hour, 37+ min ago  (46+ words) v LLM docs Inference-only Open AI Privacy Filter model. gpt-oss reused as a bidirectional encoder for token classification: every layer runs non-causal attention with a banded "sliding_window mask, and the LM head is replaced with a 33-class BIOES score head....

Symbols: lloy.l,shel.l,btc-usd,0ma6.il,0exo.il,0man.il
v LLM docs
docs. vllm. ai > en > latest > api > vllm > entrypoints > scale_out > render > serving

serving - v LLM

5+ hour, 35+ min ago  (49+ words) Extract multimodal metadata from a rendered engine prompt. Returns None for text-only prompts. Validate the model and preprocess a chat completion request. Validate the model and preprocess a completion request. This is the authoritative implementation used directly by the GPU-less…...

Symbols: nasdaq:avgo,nyse:v
v LLM docs
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > fused_moe > hpc_moe

hpc_moe - v LLM

9+ hour, 41+ min ago  (174+ words) v LLM docs Mo E implementation powered by HPC. Only supported on NVIDIA Hopper GPUs (e. g. H20, H200), and currently limited to FP8 models such as Hy3-FP8, Qwen3-235 B-A22 B-FP8, etc. Compute the shapes for the temporary and final outputs of the two gemms workspace_shapes(M, N, K, topk,…...

Symbols: nasdaq:raaq
v LLM docs
docs. vllm. ai > en > latest > api > vllm > model_executor > warmup > qwen_triton_warmup

vllm. model_executor. warmup. qwen_triton_warmup

7+ hour, 24+ min ago  (26+ words) v LLM docs Warm up Qwen Triton kernels from the loaded model's compile keys. Warm Qwen Triton kernels reported by the JIT monitor....

Symbols: nyse:vrt
v LLM docs
docs. vllm. ai > en > latest > api > vllm > entrypoints > scale_out > token_in_token_out > mm_serde

mm_serde - v LLM

7+ hour, 25+ min ago  (22+ words) v LLM docs Encode/decode utilities for multimodal tensors and field metadata over JSON/HTTP, used by the disaggregated generate endpoint....

Symbols: nasdaq:mmed,nyse:dov,nasdaq:mrx,nasdaq:ntwk,nasdaq:vmet,mdg2.vi
v LLM docs
docs. vllm. ai > en > latest > api > vllm > v1 > attention > backends > mla > prefill > aiter_flash_attn

vllm. v1. attention. backends. mla. prefill. aiter_flash_attn

7+ hour, 57+ min ago  (54+ words) v LLM docs AITER Flash Attention backend for MLA prefill (ROCm). This backend calls aiter. flash_attn_varlen_func directly, which natively supports different q/k and v head dims (qk headdim 192, v headdim 128) without padding V, and dispatches to the fast aiter: :fmha_fwd_ kernel…...

Symbols: nasdaq:vfs
v LLM docs
docs. vllm. ai > en > latest > api > vllm > entrypoints > scale_out > token_in_token_out

vllm. entrypoints. scale_out. token_in_token_out

7+ hour, 26+ min ago  (15+ words) v LLM docs Encode/decode utilities for multimodal tensors and field metadata...

Symbols: nasdaq:dvlt