dev-python/vllm
High-throughput, memory-efficient inference and serving engine for LLMs
-
vllm-0.21.0~amd64cpu cuda rocm python_single_target_python3_12 python_single_target_python3_13 python_single_target_python3_14 +amdgpu_targets_gfx908 +amdgpu_targets_gfx90a +amdgpu_targets_gfx942 +amdgpu_targets_gfx950 +amdgpu_targets_gfx1030 +amdgpu_targets_gfx1100 +amdgpu_targets_gfx1101 +amdgpu_targets_gfx1200 +amdgpu_targets_gfx1201 amdgpu_targets_gfx803 amdgpu_targets_gfx900 amdgpu_targets_gfx906 amdgpu_targets_gfx940 amdgpu_targets_gfx941 amdgpu_targets_gfx1010 amdgpu_targets_gfx1011 amdgpu_targets_gfx1012 amdgpu_targets_gfx1031 amdgpu_targets_gfx1102 amdgpu_targets_gfx1103 amdgpu_targets_gfx1150 amdgpu_targets_gfx1151
View
Download
Browse License: Apache-2.0 Overlay: stuff -
vllm-0.20.2~amd64cpu cuda rocm python_single_target_python3_12 python_single_target_python3_13 python_single_target_python3_14 +amdgpu_targets_gfx908 +amdgpu_targets_gfx90a +amdgpu_targets_gfx942 +amdgpu_targets_gfx950 +amdgpu_targets_gfx1030 +amdgpu_targets_gfx1100 +amdgpu_targets_gfx1101 +amdgpu_targets_gfx1200 +amdgpu_targets_gfx1201 amdgpu_targets_gfx803 amdgpu_targets_gfx900 amdgpu_targets_gfx906 amdgpu_targets_gfx940 amdgpu_targets_gfx941 amdgpu_targets_gfx1010 amdgpu_targets_gfx1011 amdgpu_targets_gfx1012 amdgpu_targets_gfx1031 amdgpu_targets_gfx1102 amdgpu_targets_gfx1103 amdgpu_targets_gfx1150 amdgpu_targets_gfx1151
View
Download
Browse License: Apache-2.0 Overlay: stuff
ChangeLog
commit 43944e2eb4b50f87e2a62ecb2276f23b329c2e64
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200
dev-python/vllm: skip _vllm_fa3_C when no Hopper arch at 0.21.0
Two changes to vllm-0.21.0[cuda] sharing a SRC_URI pre-stage.
(1) Skip _vllm_fa3_C target when CUDA_ARCHS has no Hopper member.
vllm-flash-attn intersects "9.0a;" with CUDA_ARCHS to compute
FA3_ARCHS, but adds the FA3 .cu files to _vllm_fa3_C unconditionally
even when FA3_ARCHS is empty — nvcc then compiles them at its default
arch, wasting ~30-60 min on Ampere/older. Wrap the target-definition
block in `if(FA3_ARCHS)` with an `add_custom_target(_vllm_fa3_C)`
empty-stub fallback (DeepGEMM pattern); vllm's setup.py drives ninja
with explicit `--target=_vllm_fa3_C` regardless of arch, so the
target must exist as a no-op. Apply via VLLM_FLASH_ATTN_SRC_DIR
pre-staging (vllm's vllm_flash_attn.cmake already honours that).
Runtime fallback is FA3_AVAILABLE=False → vllm picks FA2 backend.
(2) Make MAX_JOBS env-overridable.
Prior `export MAX_JOBS=4` clobbered caller env. Switch to
`MAX_JOBS="$"` so users on smaller/larger hosts can
adjust without ebuild-edit.
Verified end-to-end on sm_86: vllm.LLM imports, CUDA detected,
zero `_sm90.cu.o` builds, ~1h35m wallclock (was ~2h30m before
the FA3-skip patch). FA3-on-Hopper-CUDA-13.2 separately documented
as upstream-blocked — see feedback_flash_attn_fa3_broken_on_cuda_13.md.
commit 15577d91a539802b579ec087e125a6af294d0b64
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200
dev-python/vllm: stamp cuda? verified on sm_86 at 0.21.0
Match the rocm verified-date stamp already present. Adds host
context (sm_86 Ampere, CUDA 13.2, CUDAHOSTCXX=g++-15, MAX_JOBS=4,
339 CUDA template files, ~2.5h wallclock, ~14 GiB peak RSS) and
notes the FA3-on-Ampere build-time quirk worth a follow-up patch.
commit a1d9fdf3fb13cf3a0f67e733e59b4e14d9f333d0
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200
dev-python/vllm: add missing uvloop runtime dep at 0.21.0
vllm/v1/utils.py:25 imports uvloop unconditionally — fires from
the `from vllm import LLM` lazy chain. Upstream forgot to declare
it in any requirements/*.txt; they likely rely on uvicorn[standard]
transitively, but gentoo ships uvicorn without [standard].
Without the dep, vllm.LLM raises ModuleNotFoundError at first import.
commit 414a059b7a040fea2d4c11cde91db05548aad224
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200
dev-python/vllm: tighten cuda? branch pins at 0.21.0
Four edits to align with upstream cuda.txt:
- pin tilelang ~0.1.9 (upstream exact)
- add nvidia-cutlass-dsl ~4.4.2 (upstream exact; sibling
flashinfer-python-0.6.8_p1 commit enforces transitively, restated
here as belt-and-suspenders)
- remove apache-tvm-ffi from cuda? BDEPEND (vllm has zero direct
imports, greped setup.py + *.py + *.cpp + *.cu; flashinfer's
own BDEPEND pulls it at the right time)
- omit tokenspeed-mla from cuda? RDEPEND (lazy try/except imports
with `pip install` hint, Blackwell SM100/SM103-only kernels,
transitively pulls tokenspeed-triton — mirrors the existing
amd-quark exclusion pattern)
Also drop the setuptools<81 cap from BDEPEND with inline comment.
Acknowledged tradeoff against feedback_version_handling.md ("drop
the version rather than relax the cap"): gentoo only ships 79.0.1
+ 82.0.1 (nothing in 80/81), downgrade trips a hard pkg-resources-81
block, and vllm setup.py uses only the standard setuptools surface
(no pkg_resources, no setuptools.command.* removed in 81+). Cap
re-evaluate at next vllm bump.
cudnn-frontend cap belongs in flashinfer-python (where it's
applied), not vllm — vllm has zero cudnn_frontend imports.
commit 95106ad7396fd9add959684d0e0238a657078fed
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 12:44:20 2026 +0200
dev-python/vllm: dated cap-relax note for opencv 4.13->4.12
Upstream 0.21.0 says opencv-python-headless>=4.13.0 but ::gentoo's media-libs/opencv
tops at 4.12.0. Empirically verified 2026-05-16 on a Gentoo build host with media-
libs/opencv-4.12.0-r1[python] freshly merged that the full cv2 surface vllm imports
is present in 4.12: resize, cvtColor, COLOR_BGR2RGB,
CAP_PROP_, VideoCapture incl. the 3-arg
bytes+backend constructor form added in opencv 4.10, VideoWriter, VideoWriter_fourcc,
and the videoio_registry submodule. The upstream 4.13.0 lower bound is wheel-
publication churn, not an API extension vllm depends on. Add the verification note
to the USE-flag preamble — comments aren't allowed inside the python_gen_cond_dep
block so the per-dep position doesn't work.
commit 32c30fe59e2a7b87297b954509fa256f60aba033
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 11:53:42 2026 +0200
dev-python/vllm: re-verify 0.21.0 rocm build on gfx1150
USE=rocm AMDGPU_TARGETS=gfx1150 build of 0.21.0 ran clean against
caffe2-2.11.0-r90[rocm,amdgpu_targets_gfx1150,-nccl,-cusparselt] on a Strix Point
host. Four HIP extensions (_C, _moe_C, _rocm_C, cumem_allocator) built and imported
from the install tree. The previous wording was honest about the prior-version-only
scope but now stale — collapse the two states into a single dated-evidence line that
records both runs.
commit bf16ae96a670ab2ddfc592839f2d1d28d9a61875
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 11:38:42 2026 +0200
dev-python/vllm: rescope 0.20.x verified-claims on 0.21.0
The three dated 'verified 20...' comments in the 0.21.0 ebuild were carried over from
the 0.20.x source unchanged, which falsely implied the rocm/cuda/cpu paths were re-
verified at this bump. In reality only USE=-cpu -cuda -rocm (default) was build-
checked on 0.21.0. Reword each to mark the empirical date as evidence for 0.20.x
only: * gfx1150 rocm build — verified for 0.20.1 on 2026-05-08; 0.21.0 adds
tilelang as a hard rocm-target dep, not re-verified here. * FetchContent network-
sandbox — verified for 0.20.1 on 2026-05-07; 0.21.0's FetchContent set wasn't re-
audited. * MAX_JOBS=4 OOM threshold — measured against 0.20.1 on 2026-05-07;
the heavy CUDA template set (paged_attention, layernorm_quant, w8a8/fp8) is
structurally unchanged in 0.21.0, so the value stays a conservative default but
the underlying RSS profile wasn't re-measured. No functional change.
commit 39e78517888d354db7278e803b0eeb44129aa57d
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 10:50:45 2026 +0200
dev-python/vllm: drop 0.20.1, retire cpu-system-libgomp patch
Retention: keep 0.20.2 and 0.21.0. The 0.20.1 cpu patch was subsumed by upstream
0.21.0's cmake/cpu_extension.cmake; with the last consumer dropped, retire the file.
commit 47719071ad5021590e3a0d17c0de8bbbb8773376
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 10:49:06 2026 +0200
dev-python/vllm: add 0.21.0
Common-dep refresh per 0.21.0 upstream requirements/common.txt: - xgrammar lower
bound 0.1.32 -> 0.2.0 (upper cap <1.0.0 preserved) - mistral_common 1.11.0 ->
1.11.2 - model-hosting-container-standards 0.1.13 -> 0.1.14 requirements/rocm.txt
added tilelang as a hard runtime dep ("required for mhc module to be imported
correctly"); add it to the rocm? branch. cuda? branch already had it. Drop the cpu-
system-libgomp patch: upstream cmake/cpu_extension.cmake now falls back to
`find_library(OPEN_MP NAMES gomp REQUIRED)` when VLLM_TORCH_GOMP_SHIM_DIR is empty,
replacing what our local patch did. Build-verified end-to-end via FEATURES=-xattr
ebuild ... merge with USE=-cpu -cuda -rocm (default). Lint clean. Known gaps in the
cuda? branch (out of scope here, deferred): - upstream pins nvidia-cutlass-
dsl==4.4.2 exactly; we only have 4.5.0/4.5.1 in tree, and our cuda? branch never
named it as a dep (transitive only). cuda users should be aware. - flashinfer-
cubin, nvidia-cudnn-frontend>=1.13.0<1.19.0, and the new tokenspeed-mla==0.1.2
pin are not in cuda? either. - opencv lower bound stays at 4.12.0 — upstream says
>=4.13.0 but ::gentoo's opencv tops out at 4.12.0. These deferred items are
tracked in an internal vllm packaging-plan note.
commit 6fe9f62f76b21c712be5d735685d07a1582f32e4
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Wed May 13 14:35:38 2026 +0200
dev-python/vllm: disable py3.11
commit 1ad8828b49914211d59f9dfc0a50c6a16ba65c95
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 10 18:41:25 2026 +0200
dev-python/vllm: switch to DISTUTILS_SINGLE_IMPL
The whole pytorch/HF stack consumed here is SINGLE_IMPL: pytorch,
transformers, tokenizers, torchvision, plus the now-single-impl
dev-python/{compressed-tensors,xgrammar,flashinfer-python,tilelang,
quack-kernels,runai-model-streamer-bin,tensorizer}. Multi-impl
consumer with bare $ on them produces
python_targets_python3_*(-)? that the children can't expose, blocking
emerge resolution. Convert vllm to single-impl too: SINGLE_IMPL deps
on bare $, remainder wrapped in
python_gen_cond_dep.
--scan=n: pre-existing UnknownRestrict on network-sandbox is policy
(setup.py FetchContent of CUTLASS/spdlog/etc. needs the bypass for
cuda/rocm/cpu builds).
commit 25ffb50ea2ad6216c1e1dadfcd52bc46cbd28579
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 10 15:05:49 2026 +0200
dev-python/vllm: add 0.20.2
commit 51ef2b9d42a48f95d8751ca36f9c80e37a412c9f
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Fri May 8 12:54:18 2026 +0200
dev-python/vllm: add USE=rocm support
Sister to the USE=cuda landing — drives VLLM_TARGET_DEVICE=rocm and
compiles the _C / _moe_C / _rocm_C extensions (csrc/rocm/* +
hipify-converted CUDA sources) via hipcc against the system ROCm
toolchain at /opt/rocm.
Inherits sci-ml/caffe2's MKL-MPI scrub fork (>=2.11.0-r90) — same
public-link-interface pollution caveat as cuda; the cumem_allocator
extension's link step depends on it.
PYTORCH_ROCM_ARCH is derived from AMDGPU_TARGETS via rocm.eclass's
get_amdgpu_flags(); REQUIRED_USE adds the standard rocm? (
$ ) gate so the user's gfx target selection is
enforced. RESTRICT="rocm? ( network-sandbox )" mirrors the cpu/cuda
clauses (CMake FetchContent of CK / spdlog / etc. during the HIP
extension compile).
Build-verified end-to-end on this host's gfx1150 (Strix Point iGPU)
with caffe2[rocm,amdgpu_targets_gfx1150,-nccl,-cusparselt],
hip-7.2.3 + the full hipBLAS/hipBLASLt/hipFFT/hipRAND/hipSOLVER/
hipSPARSE/hipCUB stack, and AMDGPU_TARGETS=gfx1150. All three HIP
extensions link cleanly:
_C.abi3.so 103 MB
_rocm_C.abi3.so 50 MB
_moe_C.abi3.so 5.3 MB
and import + initialise in CPython 3.13 (vllm.platforms.rocm + the
extension modules).
amd-quark (in upstream's requirements/rocm.txt) is intentionally
omitted: vllm core never imports it directly, only the
vllm.model_executor.layers.quantization.quark internals reach for it
when Quark-quantized models are loaded — and dev-python/amd-quark-bin
in this overlay caps PYTHON_COMPAT to 3., which would block
vllm on 3.13/3.14. Users wanting Quark quantization install
amd-quark-bin separately and accept the python target restriction.
commit 5b1b120898e76ccae36b694df10a91f13cb7e49a
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Fri May 8 07:21:00 2026 +0200
dev-python/vllm: pin >=sci-ml/caffe2-2.11.0-r90 for MKL-MPI link fix
Both USE=cpu and USE=cuda were blocked by ::gentoo's caffe2 exposing
MKL MPI / cluster libs (scalapack, cdft, blacs_intelmpi, intel_thread)
in caffe2::mkl's public link interface, breaking cumem_allocator's
link step on hosts without Intel Cluster Edition + Compiler. The fix
sits in this overlay's sci-ml/caffe2-2.11.0-r90 fork — pin both the
cpu? and cuda? RDEPEND blocks at >=2.11.0-r90 so Portage's solver
won't silently fall back to ::gentoo's caffe2-2.11.0-r3.
Verified 2026-05-08: USE=cuda compile completes 340/340 ninja steps,
all CUDA C++ extension modules (_C, _C_stable_libtorch, cumem_allocator,
_moe_C, _vllm_fa2_C, _vllm_fa3_C) build and install cleanly under the
gcc-15 host pin + MAX_JOBS=4 throttle.
The CAVEAT block in the ebuild header is rewritten as historical:
the blocker is no longer present for users on this overlay. Drop the
>=r90 pin once an equivalent upstream fix lands in pytorch.
commit f0e0d829052c69481410903b8de35b89403e642b
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Thu May 7 20:25:30 2026 +0200
dev-python/vllm: add USE=cuda support
Tier 6 — wires up the full Tier-0..5 CUDA dependency stack we just
landed (apache-tvm-ffi, cuda-bindings, cuda-python, cuda-tile-bin,
flashinfer-cubin, flashinfer-python, nvidia-cudnn-frontend,
nvidia-cutlass-dsl + libs-base + libs-cu13, nvidia-ml-py, numba,
quack-kernels, fastsafetensors, tilelang, torch-c-dlpack-ext,
torchaudio, torchvision) under the new IUSE flag.
REQUIRED_USE pins cuda and cpu as mutually exclusive (VLLM_TARGET_DEVICE
is single-valued). src_configure exports the relevant target plus
the gcc-15 host-compiler pin for nvcc — CUDA 13.2's host_config.h
hard-#errors with __GNUC__>15, and on this overlay's reference host
the active gcc is 16. MAX_JOBS=4 throttles ninja's CUDA template
parallelism so the heavy paged_attention_v* / layernorm_quant_*
files don't OOM-kill cudafe++ (each peaks at 3-4 GiB; on a 31 GiB
host -j24 is fatal). Tune MAX_JOBS per host.
RESTRICT="cuda? ( network-sandbox )" mirrors the cpu? clause —
both targets FetchContent at CMake configure time (CUTLASS / spdlog
for cuda; oneDNN for cpu).
Verified 2026-05-07 against 0.20.1: all 339 CUDA-compiled objects
(_C, _moe_C, _vllm_fa2_C, _vllm_fa3_C — including the full Hopper
flash-attn instantiation matrix) build cleanly under the gcc-15
host pin. Same MKL-MPI link pollution as USE=cpu blocks the final
cumem_allocator.abi3.so link step on this partial-MKL host;
workaround documented in the caveat block. The vllm-side packaging
itself is complete; the link blocker is the existing sci-ml/pytorch
TorchConfig.cmake issue tracked in project_vllm_packaging_plan.md.
commit d54b7695dbea18d81e9d865345684b71e03e97c2
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Thu May 7 09:54:54 2026 +0200
dev-python/vllm: add USE=cpu support
USE=cpu (default off) flips VLLM_TARGET_DEVICE from "empty" to "cpu"
so the Python entrypoints can actually drive inference on CPU
hardware. Pulls sci-ml/torchaudio and dev-python/numba (both landed
this session for this purpose). USE=-cpu retains the empty target —
useful for development when only the API surface is needed.
Adds vllm-0.20.1-cpu-system-libgomp.patch: relax the find_library()
call in cmake/cpu_extension.cmake so it can pick up gcc's libgomp
when the upstream torch.libs/-style probe fails (which it does for
::gentoo's pytorch — the system gcc runtime is at
/usr/lib/gcc/x86_64-pc-linux-gnu/<ver>/, not in any torch
site-packages dir). HINTS list both gcc-15 and gcc-16 paths.
Adds RESTRICT="cpu? ( network-sandbox )" because cmake/cpu_extension
.cmake fetches oneDNN v3.10 from GitHub via FetchContent at configure
time. Same network-sandbox bypass pattern as the kokoros and lemonade
live ebuilds.
Caveat documented in the ebuild: ::gentoo sci-ml/pytorch's public
TorchConfig.cmake link interface exports MKL MPI/cluster libs that
require a full Intel oneAPI install. Hosts with a partial install
(MKL but no MKL-MPI) hit linker-not-found errors. This is a
sci-ml/pytorch packaging issue. Workarounds: build pytorch with
USE=-mkl, or install the full MKL stack.
commit bc8f0171e02bd78fcdf51e231acfe52e94ec02a7
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Thu May 7 08:19:32 2026 +0200
dev-python/vllm: new package, 0.20.1 (Python-only)
First-cycle landing of vLLM in our overlay. Built with
VLLM_TARGET_DEVICE=empty so only common.txt deps are required and no
per-device CMake C++ extensions are compiled.
What works:
* Python entrypoints import cleanly: `vllm.LLM`, `vllm serve …`,
the OpenAI-compatible HTTP API surface
* All common-tier deps resolve from the 24 stuff-overlay packages
landed in this session (partial-json-parser, openai-harmony,
model-hosting-container-standards, lm-format-enforcer, depyf,
gguf, mistral-common, compressed-tensors, the
opentelemetry-exporter-otlp + semantic-conventions-ai chain,
pybase64, outlines-core, llguidance, tensorizer, einops,
prometheus-fastapi-instrumentator, plus 9 guru forks: fastapi,
tiktoken, pydantic-extra-types, anthropic, openai, mcp,
httpx-sse, sse-starlette, jiter)
What doesn't (yet):
* Backend kernels fail at first model-load. Subsequent cycles will
add USE flags for cpu/cuda/rocm targets once the missing deps
(torchaudio, numba, intel-openmp; flashinfer/tilelang/
apache-tvm-ffi for cuda; amd-quark py3.13/3.14 gap for rocm)
are packaged.
Two upstream pins relaxed because ::gentoo doesn't ship the older
versions: lark==1.2.2 → >=1.2.2 (1.3.x API-compat); opencv >=4.13.0 →
>=4.12.0 (no functional regression).
NonsolvableDepsInStable is policy noise — pytorch and the sci-ml
stack are ~amd64 only too.
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200
dev-python/vllm: skip _vllm_fa3_C when no Hopper arch at 0.21.0
Two changes to vllm-0.21.0[cuda] sharing a SRC_URI pre-stage.
(1) Skip _vllm_fa3_C target when CUDA_ARCHS has no Hopper member.
vllm-flash-attn intersects "9.0a;" with CUDA_ARCHS to compute
FA3_ARCHS, but adds the FA3 .cu files to _vllm_fa3_C unconditionally
even when FA3_ARCHS is empty — nvcc then compiles them at its default
arch, wasting ~30-60 min on Ampere/older. Wrap the target-definition
block in `if(FA3_ARCHS)` with an `add_custom_target(_vllm_fa3_C)`
empty-stub fallback (DeepGEMM pattern); vllm's setup.py drives ninja
with explicit `--target=_vllm_fa3_C` regardless of arch, so the
target must exist as a no-op. Apply via VLLM_FLASH_ATTN_SRC_DIR
pre-staging (vllm's vllm_flash_attn.cmake already honours that).
Runtime fallback is FA3_AVAILABLE=False → vllm picks FA2 backend.
(2) Make MAX_JOBS env-overridable.
Prior `export MAX_JOBS=4` clobbered caller env. Switch to
`MAX_JOBS="$"` so users on smaller/larger hosts can
adjust without ebuild-edit.
Verified end-to-end on sm_86: vllm.LLM imports, CUDA detected,
zero `_sm90.cu.o` builds, ~1h35m wallclock (was ~2h30m before
the FA3-skip patch). FA3-on-Hopper-CUDA-13.2 separately documented
as upstream-blocked — see feedback_flash_attn_fa3_broken_on_cuda_13.md.
commit 15577d91a539802b579ec087e125a6af294d0b64
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200
dev-python/vllm: stamp cuda? verified on sm_86 at 0.21.0
Match the rocm verified-date stamp already present. Adds host
context (sm_86 Ampere, CUDA 13.2, CUDAHOSTCXX=g++-15, MAX_JOBS=4,
339 CUDA template files, ~2.5h wallclock, ~14 GiB peak RSS) and
notes the FA3-on-Ampere build-time quirk worth a follow-up patch.
commit a1d9fdf3fb13cf3a0f67e733e59b4e14d9f333d0
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200
dev-python/vllm: add missing uvloop runtime dep at 0.21.0
vllm/v1/utils.py:25 imports uvloop unconditionally — fires from
the `from vllm import LLM` lazy chain. Upstream forgot to declare
it in any requirements/*.txt; they likely rely on uvicorn[standard]
transitively, but gentoo ships uvicorn without [standard].
Without the dep, vllm.LLM raises ModuleNotFoundError at first import.
commit 414a059b7a040fea2d4c11cde91db05548aad224
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 17 00:52:23 2026 +0200
dev-python/vllm: tighten cuda? branch pins at 0.21.0
Four edits to align with upstream cuda.txt:
- pin tilelang ~0.1.9 (upstream exact)
- add nvidia-cutlass-dsl ~4.4.2 (upstream exact; sibling
flashinfer-python-0.6.8_p1 commit enforces transitively, restated
here as belt-and-suspenders)
- remove apache-tvm-ffi from cuda? BDEPEND (vllm has zero direct
imports, greped setup.py + *.py + *.cpp + *.cu; flashinfer's
own BDEPEND pulls it at the right time)
- omit tokenspeed-mla from cuda? RDEPEND (lazy try/except imports
with `pip install` hint, Blackwell SM100/SM103-only kernels,
transitively pulls tokenspeed-triton — mirrors the existing
amd-quark exclusion pattern)
Also drop the setuptools<81 cap from BDEPEND with inline comment.
Acknowledged tradeoff against feedback_version_handling.md ("drop
the version rather than relax the cap"): gentoo only ships 79.0.1
+ 82.0.1 (nothing in 80/81), downgrade trips a hard pkg-resources-81
block, and vllm setup.py uses only the standard setuptools surface
(no pkg_resources, no setuptools.command.* removed in 81+). Cap
re-evaluate at next vllm bump.
cudnn-frontend cap belongs in flashinfer-python (where it's
applied), not vllm — vllm has zero cudnn_frontend imports.
commit 95106ad7396fd9add959684d0e0238a657078fed
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 12:44:20 2026 +0200
dev-python/vllm: dated cap-relax note for opencv 4.13->4.12
Upstream 0.21.0 says opencv-python-headless>=4.13.0 but ::gentoo's media-libs/opencv
tops at 4.12.0. Empirically verified 2026-05-16 on a Gentoo build host with media-
libs/opencv-4.12.0-r1[python] freshly merged that the full cv2 surface vllm imports
is present in 4.12: resize, cvtColor, COLOR_BGR2RGB,
CAP_PROP_, VideoCapture incl. the 3-arg
bytes+backend constructor form added in opencv 4.10, VideoWriter, VideoWriter_fourcc,
and the videoio_registry submodule. The upstream 4.13.0 lower bound is wheel-
publication churn, not an API extension vllm depends on. Add the verification note
to the USE-flag preamble — comments aren't allowed inside the python_gen_cond_dep
block so the per-dep position doesn't work.
commit 32c30fe59e2a7b87297b954509fa256f60aba033
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 11:53:42 2026 +0200
dev-python/vllm: re-verify 0.21.0 rocm build on gfx1150
USE=rocm AMDGPU_TARGETS=gfx1150 build of 0.21.0 ran clean against
caffe2-2.11.0-r90[rocm,amdgpu_targets_gfx1150,-nccl,-cusparselt] on a Strix Point
host. Four HIP extensions (_C, _moe_C, _rocm_C, cumem_allocator) built and imported
from the install tree. The previous wording was honest about the prior-version-only
scope but now stale — collapse the two states into a single dated-evidence line that
records both runs.
commit bf16ae96a670ab2ddfc592839f2d1d28d9a61875
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 11:38:42 2026 +0200
dev-python/vllm: rescope 0.20.x verified-claims on 0.21.0
The three dated 'verified 20...' comments in the 0.21.0 ebuild were carried over from
the 0.20.x source unchanged, which falsely implied the rocm/cuda/cpu paths were re-
verified at this bump. In reality only USE=-cpu -cuda -rocm (default) was build-
checked on 0.21.0. Reword each to mark the empirical date as evidence for 0.20.x
only: * gfx1150 rocm build — verified for 0.20.1 on 2026-05-08; 0.21.0 adds
tilelang as a hard rocm-target dep, not re-verified here. * FetchContent network-
sandbox — verified for 0.20.1 on 2026-05-07; 0.21.0's FetchContent set wasn't re-
audited. * MAX_JOBS=4 OOM threshold — measured against 0.20.1 on 2026-05-07;
the heavy CUDA template set (paged_attention, layernorm_quant, w8a8/fp8) is
structurally unchanged in 0.21.0, so the value stays a conservative default but
the underlying RSS profile wasn't re-measured. No functional change.
commit 39e78517888d354db7278e803b0eeb44129aa57d
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 10:50:45 2026 +0200
dev-python/vllm: drop 0.20.1, retire cpu-system-libgomp patch
Retention: keep 0.20.2 and 0.21.0. The 0.20.1 cpu patch was subsumed by upstream
0.21.0's cmake/cpu_extension.cmake; with the last consumer dropped, retire the file.
commit 47719071ad5021590e3a0d17c0de8bbbb8773376
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 16 10:49:06 2026 +0200
dev-python/vllm: add 0.21.0
Common-dep refresh per 0.21.0 upstream requirements/common.txt: - xgrammar lower
bound 0.1.32 -> 0.2.0 (upper cap <1.0.0 preserved) - mistral_common 1.11.0 ->
1.11.2 - model-hosting-container-standards 0.1.13 -> 0.1.14 requirements/rocm.txt
added tilelang as a hard runtime dep ("required for mhc module to be imported
correctly"); add it to the rocm? branch. cuda? branch already had it. Drop the cpu-
system-libgomp patch: upstream cmake/cpu_extension.cmake now falls back to
`find_library(OPEN_MP NAMES gomp REQUIRED)` when VLLM_TORCH_GOMP_SHIM_DIR is empty,
replacing what our local patch did. Build-verified end-to-end via FEATURES=-xattr
ebuild ... merge with USE=-cpu -cuda -rocm (default). Lint clean. Known gaps in the
cuda? branch (out of scope here, deferred): - upstream pins nvidia-cutlass-
dsl==4.4.2 exactly; we only have 4.5.0/4.5.1 in tree, and our cuda? branch never
named it as a dep (transitive only). cuda users should be aware. - flashinfer-
cubin, nvidia-cudnn-frontend>=1.13.0<1.19.0, and the new tokenspeed-mla==0.1.2
pin are not in cuda? either. - opencv lower bound stays at 4.12.0 — upstream says
>=4.13.0 but ::gentoo's opencv tops out at 4.12.0. These deferred items are
tracked in an internal vllm packaging-plan note.
commit 6fe9f62f76b21c712be5d735685d07a1582f32e4
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Wed May 13 14:35:38 2026 +0200
dev-python/vllm: disable py3.11
commit 1ad8828b49914211d59f9dfc0a50c6a16ba65c95
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 10 18:41:25 2026 +0200
dev-python/vllm: switch to DISTUTILS_SINGLE_IMPL
The whole pytorch/HF stack consumed here is SINGLE_IMPL: pytorch,
transformers, tokenizers, torchvision, plus the now-single-impl
dev-python/{compressed-tensors,xgrammar,flashinfer-python,tilelang,
quack-kernels,runai-model-streamer-bin,tensorizer}. Multi-impl
consumer with bare $ on them produces
python_targets_python3_*(-)? that the children can't expose, blocking
emerge resolution. Convert vllm to single-impl too: SINGLE_IMPL deps
on bare $, remainder wrapped in
python_gen_cond_dep.
--scan=n: pre-existing UnknownRestrict on network-sandbox is policy
(setup.py FetchContent of CUTLASS/spdlog/etc. needs the bypass for
cuda/rocm/cpu builds).
commit 25ffb50ea2ad6216c1e1dadfcd52bc46cbd28579
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 10 15:05:49 2026 +0200
dev-python/vllm: add 0.20.2
commit 51ef2b9d42a48f95d8751ca36f9c80e37a412c9f
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Fri May 8 12:54:18 2026 +0200
dev-python/vllm: add USE=rocm support
Sister to the USE=cuda landing — drives VLLM_TARGET_DEVICE=rocm and
compiles the _C / _moe_C / _rocm_C extensions (csrc/rocm/* +
hipify-converted CUDA sources) via hipcc against the system ROCm
toolchain at /opt/rocm.
Inherits sci-ml/caffe2's MKL-MPI scrub fork (>=2.11.0-r90) — same
public-link-interface pollution caveat as cuda; the cumem_allocator
extension's link step depends on it.
PYTORCH_ROCM_ARCH is derived from AMDGPU_TARGETS via rocm.eclass's
get_amdgpu_flags(); REQUIRED_USE adds the standard rocm? (
$ ) gate so the user's gfx target selection is
enforced. RESTRICT="rocm? ( network-sandbox )" mirrors the cpu/cuda
clauses (CMake FetchContent of CK / spdlog / etc. during the HIP
extension compile).
Build-verified end-to-end on this host's gfx1150 (Strix Point iGPU)
with caffe2[rocm,amdgpu_targets_gfx1150,-nccl,-cusparselt],
hip-7.2.3 + the full hipBLAS/hipBLASLt/hipFFT/hipRAND/hipSOLVER/
hipSPARSE/hipCUB stack, and AMDGPU_TARGETS=gfx1150. All three HIP
extensions link cleanly:
_C.abi3.so 103 MB
_rocm_C.abi3.so 50 MB
_moe_C.abi3.so 5.3 MB
and import + initialise in CPython 3.13 (vllm.platforms.rocm + the
extension modules).
amd-quark (in upstream's requirements/rocm.txt) is intentionally
omitted: vllm core never imports it directly, only the
vllm.model_executor.layers.quantization.quark internals reach for it
when Quark-quantized models are loaded — and dev-python/amd-quark-bin
in this overlay caps PYTHON_COMPAT to 3., which would block
vllm on 3.13/3.14. Users wanting Quark quantization install
amd-quark-bin separately and accept the python target restriction.
commit 5b1b120898e76ccae36b694df10a91f13cb7e49a
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Fri May 8 07:21:00 2026 +0200
dev-python/vllm: pin >=sci-ml/caffe2-2.11.0-r90 for MKL-MPI link fix
Both USE=cpu and USE=cuda were blocked by ::gentoo's caffe2 exposing
MKL MPI / cluster libs (scalapack, cdft, blacs_intelmpi, intel_thread)
in caffe2::mkl's public link interface, breaking cumem_allocator's
link step on hosts without Intel Cluster Edition + Compiler. The fix
sits in this overlay's sci-ml/caffe2-2.11.0-r90 fork — pin both the
cpu? and cuda? RDEPEND blocks at >=2.11.0-r90 so Portage's solver
won't silently fall back to ::gentoo's caffe2-2.11.0-r3.
Verified 2026-05-08: USE=cuda compile completes 340/340 ninja steps,
all CUDA C++ extension modules (_C, _C_stable_libtorch, cumem_allocator,
_moe_C, _vllm_fa2_C, _vllm_fa3_C) build and install cleanly under the
gcc-15 host pin + MAX_JOBS=4 throttle.
The CAVEAT block in the ebuild header is rewritten as historical:
the blocker is no longer present for users on this overlay. Drop the
>=r90 pin once an equivalent upstream fix lands in pytorch.
commit f0e0d829052c69481410903b8de35b89403e642b
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Thu May 7 20:25:30 2026 +0200
dev-python/vllm: add USE=cuda support
Tier 6 — wires up the full Tier-0..5 CUDA dependency stack we just
landed (apache-tvm-ffi, cuda-bindings, cuda-python, cuda-tile-bin,
flashinfer-cubin, flashinfer-python, nvidia-cudnn-frontend,
nvidia-cutlass-dsl + libs-base + libs-cu13, nvidia-ml-py, numba,
quack-kernels, fastsafetensors, tilelang, torch-c-dlpack-ext,
torchaudio, torchvision) under the new IUSE flag.
REQUIRED_USE pins cuda and cpu as mutually exclusive (VLLM_TARGET_DEVICE
is single-valued). src_configure exports the relevant target plus
the gcc-15 host-compiler pin for nvcc — CUDA 13.2's host_config.h
hard-#errors with __GNUC__>15, and on this overlay's reference host
the active gcc is 16. MAX_JOBS=4 throttles ninja's CUDA template
parallelism so the heavy paged_attention_v* / layernorm_quant_*
files don't OOM-kill cudafe++ (each peaks at 3-4 GiB; on a 31 GiB
host -j24 is fatal). Tune MAX_JOBS per host.
RESTRICT="cuda? ( network-sandbox )" mirrors the cpu? clause —
both targets FetchContent at CMake configure time (CUTLASS / spdlog
for cuda; oneDNN for cpu).
Verified 2026-05-07 against 0.20.1: all 339 CUDA-compiled objects
(_C, _moe_C, _vllm_fa2_C, _vllm_fa3_C — including the full Hopper
flash-attn instantiation matrix) build cleanly under the gcc-15
host pin. Same MKL-MPI link pollution as USE=cpu blocks the final
cumem_allocator.abi3.so link step on this partial-MKL host;
workaround documented in the caveat block. The vllm-side packaging
itself is complete; the link blocker is the existing sci-ml/pytorch
TorchConfig.cmake issue tracked in project_vllm_packaging_plan.md.
commit d54b7695dbea18d81e9d865345684b71e03e97c2
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Thu May 7 09:54:54 2026 +0200
dev-python/vllm: add USE=cpu support
USE=cpu (default off) flips VLLM_TARGET_DEVICE from "empty" to "cpu"
so the Python entrypoints can actually drive inference on CPU
hardware. Pulls sci-ml/torchaudio and dev-python/numba (both landed
this session for this purpose). USE=-cpu retains the empty target —
useful for development when only the API surface is needed.
Adds vllm-0.20.1-cpu-system-libgomp.patch: relax the find_library()
call in cmake/cpu_extension.cmake so it can pick up gcc's libgomp
when the upstream torch.libs/-style probe fails (which it does for
::gentoo's pytorch — the system gcc runtime is at
/usr/lib/gcc/x86_64-pc-linux-gnu/<ver>/, not in any torch
site-packages dir). HINTS list both gcc-15 and gcc-16 paths.
Adds RESTRICT="cpu? ( network-sandbox )" because cmake/cpu_extension
.cmake fetches oneDNN v3.10 from GitHub via FetchContent at configure
time. Same network-sandbox bypass pattern as the kokoros and lemonade
live ebuilds.
Caveat documented in the ebuild: ::gentoo sci-ml/pytorch's public
TorchConfig.cmake link interface exports MKL MPI/cluster libs that
require a full Intel oneAPI install. Hosts with a partial install
(MKL but no MKL-MPI) hit linker-not-found errors. This is a
sci-ml/pytorch packaging issue. Workarounds: build pytorch with
USE=-mkl, or install the full MKL stack.
commit bc8f0171e02bd78fcdf51e231acfe52e94ec02a7
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Thu May 7 08:19:32 2026 +0200
dev-python/vllm: new package, 0.20.1 (Python-only)
First-cycle landing of vLLM in our overlay. Built with
VLLM_TARGET_DEVICE=empty so only common.txt deps are required and no
per-device CMake C++ extensions are compiled.
What works:
* Python entrypoints import cleanly: `vllm.LLM`, `vllm serve …`,
the OpenAI-compatible HTTP API surface
* All common-tier deps resolve from the 24 stuff-overlay packages
landed in this session (partial-json-parser, openai-harmony,
model-hosting-container-standards, lm-format-enforcer, depyf,
gguf, mistral-common, compressed-tensors, the
opentelemetry-exporter-otlp + semantic-conventions-ai chain,
pybase64, outlines-core, llguidance, tensorizer, einops,
prometheus-fastapi-instrumentator, plus 9 guru forks: fastapi,
tiktoken, pydantic-extra-types, anthropic, openai, mcp,
httpx-sse, sse-starlette, jiter)
What doesn't (yet):
* Backend kernels fail at first model-load. Subsequent cycles will
add USE flags for cpu/cuda/rocm targets once the missing deps
(torchaudio, numba, intel-openmp; flashinfer/tilelang/
apache-tvm-ffi for cuda; amd-quark py3.13/3.14 gap for rocm)
are packaged.
Two upstream pins relaxed because ::gentoo doesn't ship the older
versions: lark==1.2.2 → >=1.2.2 (1.3.x API-compat); opencv >=4.13.0 →
>=4.12.0 (no functional regression).
NonsolvableDepsInStable is policy noise — pytorch and the sci-ml
stack are ~amd64 only too.

