Gentoo Portage Overlays - dev-python/vllm

Newest News Repository news GLSAs Browse USE Flags Overlays More...

dev-python/vllm

High-throughput, memory-efficient inference and serving engine for LLMs

Screenshots

vllm-0.25.1

~amd64

cpu cuda humming rocm rust debug python_single_target_python3_12 python_single_target_python3_13 python_single_target_python3_14 debug +amdgpu_targets_gfx908 +amdgpu_targets_gfx90a +amdgpu_targets_gfx942 +amdgpu_targets_gfx950 +amdgpu_targets_gfx1030 +amdgpu_targets_gfx1100 +amdgpu_targets_gfx1101 +amdgpu_targets_gfx1200 +amdgpu_targets_gfx1201 amdgpu_targets_gfx803 amdgpu_targets_gfx900 amdgpu_targets_gfx906 amdgpu_targets_gfx940 amdgpu_targets_gfx941 amdgpu_targets_gfx1010 amdgpu_targets_gfx1011 amdgpu_targets_gfx1012 amdgpu_targets_gfx1031 amdgpu_targets_gfx1102 amdgpu_targets_gfx1103 amdgpu_targets_gfx1150 amdgpu_targets_gfx1151

View

Download

Browse License: Apache-2.0 Apache-2.0 BSD-2 BSD CC0-1.0 CDLA-Permissive-2.0 ISC LGPL-3 MIT MPL-2.0 MPL-2.0 UoI-NCSA Unicode-3.0 Unicode-DFS-2016 Unlicense ZLIB

Overlay: stuff

vllm-0.25.0

~amd64

cpu cuda humming rocm rust debug python_single_target_python3_12 python_single_target_python3_13 python_single_target_python3_14 debug +amdgpu_targets_gfx908 +amdgpu_targets_gfx90a +amdgpu_targets_gfx942 +amdgpu_targets_gfx950 +amdgpu_targets_gfx1030 +amdgpu_targets_gfx1100 +amdgpu_targets_gfx1101 +amdgpu_targets_gfx1200 +amdgpu_targets_gfx1201 amdgpu_targets_gfx803 amdgpu_targets_gfx900 amdgpu_targets_gfx906 amdgpu_targets_gfx940 amdgpu_targets_gfx941 amdgpu_targets_gfx1010 amdgpu_targets_gfx1011 amdgpu_targets_gfx1012 amdgpu_targets_gfx1031 amdgpu_targets_gfx1102 amdgpu_targets_gfx1103 amdgpu_targets_gfx1150 amdgpu_targets_gfx1151

View

Download

Browse License: Apache-2.0 Apache-2.0 BSD-2 BSD CC0-1.0 CDLA-Permissive-2.0 ISC LGPL-3 MIT MPL-2.0 MPL-2.0 UoI-NCSA Unicode-3.0 Unicode-DFS-2016 Unlicense ZLIB

Overlay: stuff

vllm-0.24.0

~amd64

cpu cuda humming rocm rust debug python_single_target_python3_12 python_single_target_python3_13 python_single_target_python3_14 debug +amdgpu_targets_gfx908 +amdgpu_targets_gfx90a +amdgpu_targets_gfx942 +amdgpu_targets_gfx950 +amdgpu_targets_gfx1030 +amdgpu_targets_gfx1100 +amdgpu_targets_gfx1101 +amdgpu_targets_gfx1200 +amdgpu_targets_gfx1201 amdgpu_targets_gfx803 amdgpu_targets_gfx900 amdgpu_targets_gfx906 amdgpu_targets_gfx940 amdgpu_targets_gfx941 amdgpu_targets_gfx1010 amdgpu_targets_gfx1011 amdgpu_targets_gfx1012 amdgpu_targets_gfx1031 amdgpu_targets_gfx1102 amdgpu_targets_gfx1103 amdgpu_targets_gfx1150 amdgpu_targets_gfx1151

View

Download

Browse License: Apache-2.0 Apache-2.0 BSD-2 BSD CC0-1.0 CDLA-Permissive-2.0 ISC LGPL-3 MIT MPL-2.0 MPL-2.0 UoI-NCSA Unicode-3.0 Unicode-DFS-2016 Unlicense ZLIB

Overlay: stuff

vllm-0.23.0

~amd64

cpu cuda humming rocm rust debug python_single_target_python3_12 python_single_target_python3_13 python_single_target_python3_14 debug +amdgpu_targets_gfx908 +amdgpu_targets_gfx90a +amdgpu_targets_gfx942 +amdgpu_targets_gfx950 +amdgpu_targets_gfx1030 +amdgpu_targets_gfx1100 +amdgpu_targets_gfx1101 +amdgpu_targets_gfx1200 +amdgpu_targets_gfx1201 amdgpu_targets_gfx803 amdgpu_targets_gfx900 amdgpu_targets_gfx906 amdgpu_targets_gfx940 amdgpu_targets_gfx941 amdgpu_targets_gfx1010 amdgpu_targets_gfx1011 amdgpu_targets_gfx1012 amdgpu_targets_gfx1031 amdgpu_targets_gfx1102 amdgpu_targets_gfx1103 amdgpu_targets_gfx1150 amdgpu_targets_gfx1151

View

Download

Browse License: Apache-2.0 Apache-2.0 BSD-2 BSD CC0-1.0 CDLA-Permissive-2.0 ISC LGPL-3 MIT MPL-2.0 MPL-2.0 UoI-NCSA Unicode-3.0 Unicode-DFS-2016 Unlicense ZLIB

Overlay: stuff

vllm-0.22.1

~amd64

cpu cuda humming rocm rust debug python_single_target_python3_12 python_single_target_python3_13 python_single_target_python3_14 debug +amdgpu_targets_gfx908 +amdgpu_targets_gfx90a +amdgpu_targets_gfx942 +amdgpu_targets_gfx950 +amdgpu_targets_gfx1030 +amdgpu_targets_gfx1100 +amdgpu_targets_gfx1101 +amdgpu_targets_gfx1200 +amdgpu_targets_gfx1201 amdgpu_targets_gfx803 amdgpu_targets_gfx900 amdgpu_targets_gfx906 amdgpu_targets_gfx940 amdgpu_targets_gfx941 amdgpu_targets_gfx1010 amdgpu_targets_gfx1011 amdgpu_targets_gfx1012 amdgpu_targets_gfx1031 amdgpu_targets_gfx1102 amdgpu_targets_gfx1103 amdgpu_targets_gfx1150 amdgpu_targets_gfx1151

View

Download

Browse License: Apache-2.0 Apache-2.0 BSD-2 BSD CC0-1.0 CDLA-Permissive-2.0 ISC LGPL-3 MIT MPL-2.0 MPL-2.0 UoI-NCSA Unicode-3.0 Unicode-DFS-2016 Unlicense ZLIB

Overlay: stuff

ChangeLog USE Flags Dependencies Reverse Deps Related Bugs

ChangeLog

commit 4e09028504859cb074badc0de5976cadb9a8a928
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Tue Jul 14 15:52:04 2026 +0200

dev-python/vllm: add 0.25.1

commit 72781f84dec212aa4c250699b42c4153470c4211
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun Jul 12 10:23:40 2026 +0200

dev-python/vllm: add 0.25.0

Regenerate the vendored Rust CRATES from rust/Cargo.lock and update
GIT_CRATES: llm-multimodal moves to smg-project, oss-harmony
(harmony@v0.0.11) is new. Repin the flash-attention submodule
dd62dac->2c839c3 and carry the fa3-only-when-archs + py314 patches
forward onto it.

Track upstream requirements/*.txt: flashinfer-python ~0.6.13,
humming-kernels ~0.1.10, mistral-common >=1.11.5, and add
torchcodec>=0.14 (GPU video decode) to the cuda target. PyNvVideoCodec
and nvtx (both cuda-only) are omitted -- vllm imports them lazily.

adds-only: 0.24.0 stays for its consumers.

commit 0faf7dbc57bfc6964aeee386cbf35afc32b3a0a3
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun Jul 5 14:16:58 2026 +0200

dev-python/vllm: condense 0.22.1/0.23.0 build-notes comments

commit a33aa273844e9b5381724680207f61a7e0a5775b
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun Jul 5 13:26:28 2026 +0200

dev-python/vllm: condense 0.24.0 build-notes comments

commit 0719c68e7fcefc2d7165377d7a90556fc514facf
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun Jul 5 13:20:29 2026 +0200

dev-python/vllm: drop redundant humming guard patch for 0.24.0

vllm-0.24.0-humming-import-optional.patch wrapped the quant registry's
`from .humming import HummingConfig` in try/except so USE=-humming builds
could still resolve the other quant methods. Upstream #44921 (landed for
vllm > 0.23.0) restructured humming to lazily import the external `humming`
package via vllm.utils.humming, so the registry import no longer hard-fails
without humming-kernels. Verified 2026-07-05 against the installed 0.24.0
with humming-kernels absent: importing
vllm.model_executor.layers.quantization.humming succeeds (every `_hm.*`
access is deferred to a method body; the top-level import is TYPE_CHECKING-
only). A Humming-quantized model still errors at load time under
USE=-humming, which is the intended behaviour. src_prepare re-verified
clean without the patch. 0.23.0 and 0.22.1 predate #44921 and keep their
guard patches.

commit a20d1d292097e96cff5c9525d883022a4dea5b61
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Wed Jul 1 10:36:58 2026 +0200

dev-python/vllm: add 0.24.0

Track upstream 0.24.0's dependency shifts (requirements/common.txt,
cuda.txt, cpu.txt):

- transformers floor 4.56.0 -> 5.5.3
- fastapi floor 0.115.0 -> 0.133.0, add <0.137.0 cap and
>=starlette-1.0.1 (0.24.0 moves to the Starlette 1.0 line)
- xgrammar floor 0.2.0 -> 0.2.1
- prometheus-fastapi-instrumentator floor 7.0.0 -> 8.0.0
- add jsonschema >=4.23.0 (new upstream dep: MiniMax M3 tool schema
validation)
- drop gguf (upstream removed gguf >=0.17.0 from common.txt)
- torch stays ==2.11.0 (cpu.txt), so ~pytorch-2.11.0 /
~torchaudio-2.11.0 / caffe2-2.11.0-r90 carry over; compressed-tensors
~0.17.0, depyf ~0.20.0, llguidance <1.8 unchanged

Not a 0.24.0 shift, corrected while here: opencv floor restored 4.12.0
-> 4.13.0. Upstream has required opencv >=4.13.0 all along; the ebuild
had relaxed it to ::gentoo's then-max 4.12.0, which now ships 4.13.0.

USE=humming (cuda-only, default off) still pins ~humming-kernels-0.1.4;
upstream 0.24.0 cuda.txt moved to ==0.1.6, not yet packaged (overlay has
0.1.4 + an unconsumed 0.1.7). Repin once 0.1.6 lands.

CRATES regenerated from 0.24.0's rust/Cargo.lock (629 crates).

files/vllm-0.24.0-humming-import-optional.patch rebased for 0.24.0:
0.24.0 dropped the .gguf import from the quantization __init__ block
(consistent with removing the gguf dep), shifting context. Guards the
optional .humming quant import so USE=-humming still resolves the other
quantization methods.

commit 2e70cf58574d1e430f9b121002461b6f6a591349
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Tue Jun 16 15:28:30 2026 +0200

dev-python/vllm: drop 0.21.0

commit 3b69fe159a102802329fdeed703693e9634f1ca7
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Mon Jun 15 23:53:50 2026 +0200

dev-python/vllm: generalize CUDA host-compiler comments

Drop a maintainer-private notes reference and host-specific wording
from the CUDAHOSTCXX rationale comments. The reason is unchanged: CUDA
13 nvcc rejects gcc>15, so nvcc's host compiler is pinned to the gcc-15
slot when the active system gcc is newer.

commit 215f771f6154bbfc7d892beec6186f3244d6e7e4
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Mon Jun 15 00:15:21 2026 +0200

dev-python/vllm: gate the humming quant backend behind USE=humming

vLLM's quantization registry imports the humming backend unconditionally
on CUDA builds (quantization/__init__.py pulls in humming.py, which
imports the external `humming` package under current_platform.is_cuda()
with no fallback). With humming-kernels absent, loading any quantized
model under vllm[cuda] aborts with ModuleNotFoundError regardless of the
method requested.

Rather than force humming-kernels -- and its import-time subprocess leak
(vllm-project/vllm#44904) -- on every cuda user, gate it behind
USE=humming and carry $-humming-import-optional.patch so the registry
tolerates a missing humming-kernels: other quant methods keep working,
and requesting humming without it raises a clear install hint. This
mirrors upstream's lazy-import fix (vllm-project/vllm#44921), which lands
after 0.23.0; drop the patch at that bump. Pins match requirements/
cuda.txt: 0.1.4 for 0.23.0, 0.1.2 for 0.22.1.

Bug: https://github.com/istitov/stuff/issues/274

commit 8f7528077839f7be54a2c5e0534cb236d245e12f
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun Jun 14 11:48:22 2026 +0200

dev-python/vllm: require triton-bin on rocm, note USE_LIBUV=0

vllm's ROCm path runs its kernels (slot mapping, sampling, the
torch.compile/inductor path) through @triton.jit, and its custom
paged-attention falls back to a Triton attention kernel on gfx targets
without a ROCm custom-attention kernel (e.g. gfx1150). Gentoo's
source-built torch does not pull Triton the way upstream's wheels do,
so without it vllm[rocm] dies at first GPU inference with "'function'
object is not subscriptable" -- the same wall the cuda path hit. Add
~dev-python/triton-bin-3.6.0 to the rocm? deps; mainline triton's AMD
backend JITs the gfx kernels via hipcc.

vllm also opens a torch.distributed TCPStore at engine start, even for
a single GPU. Since torch 2.4 the TCPStore defaults to the libuv
backend, but caffe2's ROCm build ships no libuv: it rides in via
tensorpipe, which caffe2 disables for ROCm (USE_TENSORPIPE off when
rocm). The cuda build keeps tensorpipe, so this is rocm-specific. vllm
aborts with "DistStoreError: use_libuv was requested but PyTorch was
built without libuv support"; document USE_LIBUV=0 in pkg_postinst.

Covers 0.22.1 and 0.23.0.

commit e39fff11679eb91e2e394ff7aceea5a1268d2822
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun Jun 14 09:26:23 2026 +0200

dev-python/vllm: require caffe2[distributed,gloo] and triton-bin on cuda

vllm resolves its runtime platform from the host hardware and, on a GPU
host, imports torch.distributed.PrefixStore/ProcessGroup unconditionally
at module load and builds a gloo CPU coordination group at engine init.
Our sci-ml/caffe2 builds CUDA with USE_NCCL=OFF, so the device group
also falls back to gloo. Both caffe2 USE flags are default-off, so vllm
imports cleanly but crashes at runtime without them -- first ImportError
for PrefixStore, then "Fallback Gloo backend is not available". Require
caffe2[distributed,gloo].

vllm's CUDA kernels (slot mapping, attention, sampling, the
torch.compile/inductor path) are @triton.jit, and Gentoo's source-built
torch does not pull Triton the way upstream's wheels do, so without it
vllm dies at first GPU inference with "'function' object is not
subscriptable". Add ~dev-python/triton-bin-3.6.0 to the cuda deps.

flashinfer JIT-compiles its kernels with nvcc at runtime, and CUDA 13.x
rejects host compilers newer than gcc 15; add a pkg_postinst note
pointing at NVCC_PREPEND_FLAGS / eselect gcc, since that alignment can't
be expressed as a dependency.

Covers 0.22.1 and 0.23.0 (istitov/stuff#274).

commit cb99058c7a7f8c4f37d2300a3607fec9c4e163e8
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat Jun 13 17:21:06 2026 +0200

dev-python/vllm: fix flash-attn python3_14 gate (cuda)

vllm's PYTHON_COMPAT enables python3_14, but the bundled vllm-flash-attn
hard-codes a supported-Python whitelist in its CMakeLists.txt and checks
it via find_python_constrained_versions(). On 3.14 the CMake configure
step aborts with a FATAL_ERROR before any extension is built, so
emerge vllm[cuda] fails under PYTHON_SINGLE_TARGET=python3_14
(istitov/stuff#274).

Add a per-feature patch, applied as a second eapply into the pre-staged
flash-attention source after the FA3-skip patch, that adds "3.14" to the
whitelist. The flash-attn extensions build against the Python stable
ABI (Development.SABIModule, USE_SABI 3), so the resulting abi3 module
is independent of the CPython minor version and widening the assertion
is safe. Scoped to 0.22.1 and 0.23.0, whose pinned flash-attn commits
(bce2942, dd62dac) carry the identical gate.

commit c2e9a3185e59fcb228adb4dc2779c46a7b774b5c
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat Jun 13 15:13:15 2026 +0200

dev-python/vllm: add 0.23.0

Minor bump. torch stays at 2.11.0 — upstream's build-system and
requirements/*.txt still pin torch==2.11.0.

requirements/common.txt re-pins compressed-tensors 0.15.0.1 -> 0.17.0
(packaged in the preceding commit) and raises mistral-common to >=1.11.3;
both updated. cuda? deps follow requirements/cuda.txt: flashinfer-python
-> ~0.6.12 and fastsafetensors -> >=0.3.2 (tilelang, nvidia-cutlass-dsl,
quack-kernels unchanged). humming-kernels (now 0.1.4) and tokenspeed-mla
stay omitted as before; the nvidia-cudnn-frontend floor that flashinfer
carries rose to >=1.19.1.

vllm-flash-attn pin moves bce2942 -> dd62dac (cuda?); the FA3-only-when-
archs patch applies unchanged to the new commit and is renamed to match.

The bundled vllm-rs frontend (USE=rust) gains mimalloc + libmimalloc-sys
in its vendored CRATES (both MIT, already covered by LICENSE); the rest
of the crate set and the llm-multimodal GIT_CRATES pin are unchanged.

commit cc6f5c742ce9577e032a90ac824f9c9c83b4e626
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat Jun 6 22:02:54 2026 +0200

dev-python/vllm: drop 0.20.2, 0.22.0

commit 1e631df034ec1c462a95dc1c94d0749e9a48ad30
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Fri Jun 5 18:19:06 2026 +0200

dev-python/vllm: point USE=cpu at the toolchain libgomp

vllm 0.22.x's cpu_extension.cmake locates OpenMP via
vllm_prepare_torch_gomp_shim(), which expects a libgomp vendored inside
PyTorch (torch.libs/libgomp-*.so, a PyPI-wheel artifact). Our
source-built sci-ml/caffe2 ships none, so cmake falls back to
find_library(NAMES gomp) — which can't see Gentoo's libgomp under the
gcc-internal dir, so USE=cpu died at configure. Set CMAKE_LIBRARY_PATH
from the active toolchain (tc-getCC -print-file-name) so the fallback
resolves. The older MKL-MPI link caveat is separately handled by the
>=sci-ml/caffe2-2.11.0-r90 pin.

commit 7d6ab8570593fa2711e5ebdca04a15cfabb54131
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Fri Jun 5 15:29:25 2026 +0200

dev-python/vllm: add 0.22.1

Patch bump with no build-interface change vs 0.22.0: every requirements
pin (torch 2.11.0, xgrammar, compressed-tensors, depyf, llguidance,
outlines-core, mistral-common) is unchanged, the flash-attention commit
pin is unchanged, and the Rust frontend's 619 CRATES + the llm-multimodal
GIT_CRATES commit are byte-identical. Pure copy.

commit 6bed04ae4dbe3c0ed81cceb043244becfa4de373
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Fri May 29 22:01:36 2026 +0200

dev-python/vllm: add 0.22.0

0.22.0 adds a Rust serving frontend (vllm-rs); it is opt-in at runtime
(VLLM_USE_RUST_FRONTEND=1), so its 600+-crate vendored build is gated
behind USE=rust (default off) rather than forced on every install.
Follows upstream to llguidance >=1.7.0, flashinfer 0.6.11_p2 and
nvidia-cutlass-dsl 4.5.2[cu13], and moves the vllm-flash-attn pin.
humming-kernels (new cuda quant dep) is left unpackaged for now; see the
ebuild comment.

commit 43944e2eb4b50f87e2a62ecb2276f23b329c2e64
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200

dev-python/vllm: skip _vllm_fa3_C when no Hopper arch at 0.21.0

Two changes to vllm-0.21.0[cuda] sharing a SRC_URI pre-stage.

(1) Skip _vllm_fa3_C target when CUDA_ARCHS has no Hopper member.
vllm-flash-attn intersects "9.0a;" with CUDA_ARCHS to compute
FA3_ARCHS, but adds the FA3 .cu files to _vllm_fa3_C unconditionally
even when FA3_ARCHS is empty — nvcc then compiles them at its default
arch, wasting ~30-60 min on Ampere/older. Wrap the target-definition
block in `if(FA3_ARCHS)` with an `add_custom_target(_vllm_fa3_C)`
empty-stub fallback (DeepGEMM pattern); vllm's setup.py drives ninja
with explicit `--target=_vllm_fa3_C` regardless of arch, so the
target must exist as a no-op. Apply via VLLM_FLASH_ATTN_SRC_DIR
pre-staging (vllm's vllm_flash_attn.cmake already honours that).
Runtime fallback is FA3_AVAILABLE=False → vllm picks FA2 backend.

(2) Make MAX_JOBS env-overridable.
Prior `export MAX_JOBS=4` clobbered caller env. Switch to
`MAX_JOBS="$"` so users on smaller/larger hosts can
adjust without ebuild-edit.

Verified end-to-end on sm_86: vllm.LLM imports, CUDA detected,
zero `_sm90.cu.o` builds, ~1h35m wallclock (was ~2h30m before
the FA3-skip patch). FA3-on-Hopper-CUDA-13.2 separately documented
as upstream-blocked — see feedback_flash_attn_fa3_broken_on_cuda_13.md.

commit 15577d91a539802b579ec087e125a6af294d0b64
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200

dev-python/vllm: stamp cuda? verified on sm_86 at 0.21.0

Match the rocm verified-date stamp already present. Adds host
context (sm_86 Ampere, CUDA 13.2, CUDAHOSTCXX=g++-15, MAX_JOBS=4,
339 CUDA template files, ~2.5h wallclock, ~14 GiB peak RSS) and
notes the FA3-on-Ampere build-time quirk worth a follow-up patch.

commit a1d9fdf3fb13cf3a0f67e733e59b4e14d9f333d0
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200

dev-python/vllm: add missing uvloop runtime dep at 0.21.0

vllm/v1/utils.py:25 imports uvloop unconditionally — fires from
the `from vllm import LLM` lazy chain. Upstream forgot to declare
it in any requirements/*.txt; they likely rely on uvicorn[standard]
transitively, but gentoo ships uvicorn without [standard].

Without the dep, vllm.LLM raises ModuleNotFoundError at first import.

commit 414a059b7a040fea2d4c11cde91db05548aad224
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200

dev-python/vllm: tighten cuda? branch pins at 0.21.0

Four edits to align with upstream cuda.txt:
- pin tilelang ~0.1.9 (upstream exact)
- add nvidia-cutlass-dsl ~4.4.2 (upstream exact; sibling
flashinfer-python-0.6.8_p1 commit enforces transitively, restated
here as belt-and-suspenders)
- remove apache-tvm-ffi from cuda? BDEPEND (vllm has zero direct
imports, greped setup.py + *.py + *.cpp + *.cu; flashinfer's
own BDEPEND pulls it at the right time)
- omit tokenspeed-mla from cuda? RDEPEND (lazy try/except imports
with `pip install` hint, Blackwell SM100/SM103-only kernels,
transitively pulls tokenspeed-triton — mirrors the existing
amd-quark exclusion pattern)

Also drop the setuptools<81 cap from BDEPEND with inline comment.
Acknowledged tradeoff against feedback_version_handling.md ("drop
the version rather than relax the cap"): gentoo only ships 79.0.1
+ 82.0.1 (nothing in 80/81), downgrade trips a hard pkg-resources-81
block, and vllm setup.py uses only the standard setuptools surface
(no pkg_resources, no setuptools.command.* removed in 81+). Cap
re-evaluate at next vllm bump.

cudnn-frontend cap belongs in flashinfer-python (where it's
applied), not vllm — vllm has zero cudnn_frontend imports.

commit 95106ad7396fd9add959684d0e0238a657078fed
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 12:44:20 2026 +0200

dev-python/vllm: dated cap-relax note for opencv 4.13->4.12

Upstream 0.21.0 says opencv-python-headless>=4.13.0 but ::gentoo's media-libs/opencv
tops at 4.12.0. Empirically verified 2026-05-16 on a Gentoo build host with media-
libs/opencv-4.12.0-r1[python] freshly merged that the full cv2 surface vllm imports
is present in 4.12: resize, cvtColor, COLOR_BGR2RGB,
CAP_PROP_, VideoCapture incl. the 3-arg
bytes+backend constructor form added in opencv 4.10, VideoWriter, VideoWriter_fourcc,
and the videoio_registry submodule. The upstream 4.13.0 lower bound is wheel-
publication churn, not an API extension vllm depends on. Add the verification note
to the USE-flag preamble — comments aren't allowed inside the python_gen_cond_dep
block so the per-dep position doesn't work.

commit 32c30fe59e2a7b87297b954509fa256f60aba033
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 11:53:42 2026 +0200

dev-python/vllm: re-verify 0.21.0 rocm build on gfx1150

USE=rocm AMDGPU_TARGETS=gfx1150 build of 0.21.0 ran clean against
caffe2-2.11.0-r90[rocm,amdgpu_targets_gfx1150,-nccl,-cusparselt] on a Strix Point
host. Four HIP extensions (_C, _moe_C, _rocm_C, cumem_allocator) built and imported
from the install tree. The previous wording was honest about the prior-version-only
scope but now stale — collapse the two states into a single dated-evidence line that
records both runs.

commit bf16ae96a670ab2ddfc592839f2d1d28d9a61875
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 11:38:42 2026 +0200

dev-python/vllm: rescope 0.20.x verified-claims on 0.21.0

The three dated 'verified 20...' comments in the 0.21.0 ebuild were carried over from
the 0.20.x source unchanged, which falsely implied the rocm/cuda/cpu paths were re-
verified at this bump. In reality only USE=-cpu -cuda -rocm (default) was build-
checked on 0.21.0. Reword each to mark the empirical date as evidence for 0.20.x
only: * gfx1150 rocm build — verified for 0.20.1 on 2026-05-08; 0.21.0 adds
tilelang as a hard rocm-target dep, not re-verified here. * FetchContent network-
sandbox — verified for 0.20.1 on 2026-05-07; 0.21.0's FetchContent set wasn't re-
audited. * MAX_JOBS=4 OOM threshold — measured against 0.20.1 on 2026-05-07;
the heavy CUDA template set (paged_attention, layernorm_quant, w8a8/fp8) is
structurally unchanged in 0.21.0, so the value stays a conservative default but
the underlying RSS profile wasn't re-measured. No functional change.

commit 39e78517888d354db7278e803b0eeb44129aa57d
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 10:50:45 2026 +0200

dev-python/vllm: drop 0.20.1, retire cpu-system-libgomp patch

Retention: keep 0.20.2 and 0.21.0. The 0.20.1 cpu patch was subsumed by upstream
0.21.0's cmake/cpu_extension.cmake; with the last consumer dropped, retire the file.

commit 47719071ad5021590e3a0d17c0de8bbbb8773376
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 10:49:06 2026 +0200

dev-python/vllm: add 0.21.0

Common-dep refresh per 0.21.0 upstream requirements/common.txt: - xgrammar lower
bound 0.1.32 -> 0.2.0 (upper cap <1.0.0 preserved) - mistral_common 1.11.0 ->
1.11.2 - model-hosting-container-standards 0.1.13 -> 0.1.14 requirements/rocm.txt
added tilelang as a hard runtime dep ("required for mhc module to be imported
correctly"); add it to the rocm? branch. cuda? branch already had it. Drop the cpu-
system-libgomp patch: upstream cmake/cpu_extension.cmake now falls back to
`find_library(OPEN_MP NAMES gomp REQUIRED)` when VLLM_TORCH_GOMP_SHIM_DIR is empty,
replacing what our local patch did. Build-verified end-to-end via FEATURES=-xattr
ebuild ... merge with USE=-cpu -cuda -rocm (default). Lint clean. Known gaps in the
cuda? branch (out of scope here, deferred): - upstream pins nvidia-cutlass-
dsl==4.4.2 exactly; we only have 4.5.0/4.5.1 in tree, and our cuda? branch never
named it as a dep (transitive only). cuda users should be aware. - flashinfer-
cubin, nvidia-cudnn-frontend>=1.13.0<1.19.0, and the new tokenspeed-mla==0.1.2
pin are not in cuda? either. - opencv lower bound stays at 4.12.0 — upstream says
>=4.13.0 but ::gentoo's opencv tops out at 4.12.0. These deferred items are
tracked in an internal vllm packaging-plan note.

commit 6fe9f62f76b21c712be5d735685d07a1582f32e4
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Wed May 13 14:35:38 2026 +0200

dev-python/vllm: disable py3.11

commit 1ad8828b49914211d59f9dfc0a50c6a16ba65c95
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 10 18:41:25 2026 +0200

dev-python/vllm: switch to DISTUTILS_SINGLE_IMPL

The whole pytorch/HF stack consumed here is SINGLE_IMPL: pytorch,
transformers, tokenizers, torchvision, plus the now-single-impl
dev-python/{compressed-tensors,xgrammar,flashinfer-python,tilelang,
quack-kernels,runai-model-streamer-bin,tensorizer}. Multi-impl
consumer with bare $ on them produces
python_targets_python3_*(-)? that the children can't expose, blocking
emerge resolution. Convert vllm to single-impl too: SINGLE_IMPL deps
on bare $, remainder wrapped in
python_gen_cond_dep.

--scan=n: pre-existing UnknownRestrict on network-sandbox is policy
(setup.py FetchContent of CUTLASS/spdlog/etc. needs the bypass for
cuda/rocm/cpu builds).

commit 25ffb50ea2ad6216c1e1dadfcd52bc46cbd28579
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 10 15:05:49 2026 +0200

dev-python/vllm: add 0.20.2

commit 51ef2b9d42a48f95d8751ca36f9c80e37a412c9f
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Fri May 8 12:54:18 2026 +0200

dev-python/vllm: add USE=rocm support

Sister to the USE=cuda landing — drives VLLM_TARGET_DEVICE=rocm and
compiles the _C / _moe_C / _rocm_C extensions (csrc/rocm/* +
hipify-converted CUDA sources) via hipcc against the system ROCm
toolchain at /opt/rocm.

Inherits sci-ml/caffe2's MKL-MPI scrub fork (>=2.11.0-r90) — same
public-link-interface pollution caveat as cuda; the cumem_allocator
extension's link step depends on it.

PYTORCH_ROCM_ARCH is derived from AMDGPU_TARGETS via rocm.eclass's
get_amdgpu_flags(); REQUIRED_USE adds the standard rocm? (
$ ) gate so the user's gfx target selection is
enforced. RESTRICT="rocm? ( network-sandbox )" mirrors the cpu/cuda
clauses (CMake FetchContent of CK / spdlog / etc. during the HIP
extension compile).

Build-verified end-to-end on this host's gfx1150 (Strix Point iGPU)
with caffe2[rocm,amdgpu_targets_gfx1150,-nccl,-cusparselt],
hip-7.2.3 + the full hipBLAS/hipBLASLt/hipFFT/hipRAND/hipSOLVER/
hipSPARSE/hipCUB stack, and AMDGPU_TARGETS=gfx1150. All three HIP
extensions link cleanly:
_C.abi3.so 103 MB
_rocm_C.abi3.so 50 MB
_moe_C.abi3.so 5.3 MB
and import + initialise in CPython 3.13 (vllm.platforms.rocm + the
extension modules).

amd-quark (in upstream's requirements/rocm.txt) is intentionally
omitted: vllm core never imports it directly, only the
vllm.model_executor.layers.quantization.quark internals reach for it
when Quark-quantized models are loaded — and dev-python/amd-quark-bin
in this overlay caps PYTHON_COMPAT to 3., which would block
vllm on 3.13/3.14. Users wanting Quark quantization install
amd-quark-bin separately and accept the python target restriction.

commit 5b1b120898e76ccae36b694df10a91f13cb7e49a
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Fri May 8 07:21:00 2026 +0200

dev-python/vllm: pin >=sci-ml/caffe2-2.11.0-r90 for MKL-MPI link fix

Both USE=cpu and USE=cuda were blocked by ::gentoo's caffe2 exposing
MKL MPI / cluster libs (scalapack, cdft, blacs_intelmpi, intel_thread)
in caffe2::mkl's public link interface, breaking cumem_allocator's
link step on hosts without Intel Cluster Edition + Compiler. The fix
sits in this overlay's sci-ml/caffe2-2.11.0-r90 fork — pin both the
cpu? and cuda? RDEPEND blocks at >=2.11.0-r90 so Portage's solver
won't silently fall back to ::gentoo's caffe2-2.11.0-r3.

Verified 2026-05-08: USE=cuda compile completes 340/340 ninja steps,
all CUDA C++ extension modules (_C, _C_stable_libtorch, cumem_allocator,
_moe_C, _vllm_fa2_C, _vllm_fa3_C) build and install cleanly under the
gcc-15 host pin + MAX_JOBS=4 throttle.

The CAVEAT block in the ebuild header is rewritten as historical:
the blocker is no longer present for users on this overlay. Drop the
>=r90 pin once an equivalent upstream fix lands in pytorch.

commit f0e0d829052c69481410903b8de35b89403e642b
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Thu May 7 20:25:30 2026 +0200

dev-python/vllm: add USE=cuda support

Tier 6 — wires up the full Tier-0..5 CUDA dependency stack we just
landed (apache-tvm-ffi, cuda-bindings, cuda-python, cuda-tile-bin,
flashinfer-cubin, flashinfer-python, nvidia-cudnn-frontend,
nvidia-cutlass-dsl + libs-base + libs-cu13, nvidia-ml-py, numba,
quack-kernels, fastsafetensors, tilelang, torch-c-dlpack-ext,
torchaudio, torchvision) under the new IUSE flag.

REQUIRED_USE pins cuda and cpu as mutually exclusive (VLLM_TARGET_DEVICE
is single-valued). src_configure exports the relevant target plus
the gcc-15 host-compiler pin for nvcc — CUDA 13.2's host_config.h
hard-#errors with __GNUC__>15, and on this overlay's reference host
the active gcc is 16. MAX_JOBS=4 throttles ninja's CUDA template
parallelism so the heavy paged_attention_v* / layernorm_quant_*
files don't OOM-kill cudafe++ (each peaks at 3-4 GiB; on a 31 GiB
host -j24 is fatal). Tune MAX_JOBS per host.

RESTRICT="cuda? ( network-sandbox )" mirrors the cpu? clause —
both targets FetchContent at CMake configure time (CUTLASS / spdlog
for cuda; oneDNN for cpu).

Verified 2026-05-07 against 0.20.1: all 339 CUDA-compiled objects
(_C, _moe_C, _vllm_fa2_C, _vllm_fa3_C — including the full Hopper
flash-attn instantiation matrix) build cleanly under the gcc-15
host pin. Same MKL-MPI link pollution as USE=cpu blocks the final
cumem_allocator.abi3.so link step on this partial-MKL host;
workaround documented in the caveat block. The vllm-side packaging
itself is complete; the link blocker is the existing sci-ml/pytorch
TorchConfig.cmake issue tracked in project_vllm_packaging_plan.md.

commit d54b7695dbea18d81e9d865345684b71e03e97c2
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Thu May 7 09:54:54 2026 +0200

dev-python/vllm: add USE=cpu support

USE=cpu (default off) flips VLLM_TARGET_DEVICE from "empty" to "cpu"
so the Python entrypoints can actually drive inference on CPU
hardware. Pulls sci-ml/torchaudio and dev-python/numba (both landed
this session for this purpose). USE=-cpu retains the empty target —
useful for development when only the API surface is needed.

Adds vllm-0.20.1-cpu-system-libgomp.patch: relax the find_library()
call in cmake/cpu_extension.cmake so it can pick up gcc's libgomp
when the upstream torch.libs/-style probe fails (which it does for
::gentoo's pytorch — the system gcc runtime is at
/usr/lib/gcc/x86_64-pc-linux-gnu/<ver>/, not in any torch
site-packages dir). HINTS list both gcc-15 and gcc-16 paths.

Adds RESTRICT="cpu? ( network-sandbox )" because cmake/cpu_extension
.cmake fetches oneDNN v3.10 from GitHub via FetchContent at configure
time. Same network-sandbox bypass pattern as the kokoros and lemonade
live ebuilds.

Caveat documented in the ebuild: ::gentoo sci-ml/pytorch's public
TorchConfig.cmake link interface exports MKL MPI/cluster libs that
require a full Intel oneAPI install. Hosts with a partial install
(MKL but no MKL-MPI) hit linker-not-found errors. This is a
sci-ml/pytorch packaging issue. Workarounds: build pytorch with
USE=-mkl, or install the full MKL stack.

commit bc8f0171e02bd78fcdf51e231acfe52e94ec02a7
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Thu May 7 08:19:32 2026 +0200

dev-python/vllm: new package, 0.20.1 (Python-only)

First-cycle landing of vLLM in our overlay. Built with
VLLM_TARGET_DEVICE=empty so only common.txt deps are required and no
per-device CMake C++ extensions are compiled.

What works:
* Python entrypoints import cleanly: `vllm.LLM`, `vllm serve …`,
the OpenAI-compatible HTTP API surface
* All common-tier deps resolve from the 24 stuff-overlay packages
landed in this session (partial-json-parser, openai-harmony,
model-hosting-container-standards, lm-format-enforcer, depyf,
gguf, mistral-common, compressed-tensors, the
opentelemetry-exporter-otlp + semantic-conventions-ai chain,
pybase64, outlines-core, llguidance, tensorizer, einops,
prometheus-fastapi-instrumentator, plus 9 guru forks: fastapi,
tiktoken, pydantic-extra-types, anthropic, openai, mcp,
httpx-sse, sse-starlette, jiter)

What doesn't (yet):
* Backend kernels fail at first model-load. Subsequent cycles will
add USE flags for cpu/cuda/rocm targets once the missing deps
(torchaudio, numba, intel-openmp; flashinfer/tilelang/
apache-tvm-ffi for cuda; amd-quark py3.13/3.14 gap for rocm)
are packaged.

Two upstream pins relaxed because ::gentoo doesn't ship the older
versions: lark==1.2.2 → >=1.2.2 (1.3.x API-compat); opencv >=4.13.0 →
>=4.12.0 (no functional regression).

NonsolvableDepsInStable is policy noise — pytorch and the sci-ml
stack are ~amd64 only too.