Gentoo Portage Overlays - sci-ml/lm-eval

sci-ml/lm-eval

A framework for evaluating language models (lm-evaluation-harness)

Screenshots

lm-eval-0.4.12

~amd64

+api ifeval math sentencepiece statsmodels vllm python_single_target_python3_12 python_single_target_python3_13 python_single_target_python3_14

View

Download

Browse License: MIT

Overlay: stuff

lm-eval-0.4.11

~amd64

+api ifeval math sentencepiece statsmodels vllm python_single_target_python3_12 python_single_target_python3_13 python_single_target_python3_14

View

Download

Browse License: MIT

Overlay: stuff

ChangeLog USE Flags Dependencies Reverse Deps Related Bugs

ChangeLog

commit 3babf19bc10669f93e39b40b42769d2c0cdd5d60
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun Jul 5 14:30:48 2026 +0200

sci-ml/lm-eval: condense the dependency comments

commit 89137542e87669b8f82c5479b03dad4e6f53614d
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Wed May 13 14:03:03 2026 +0200

sci-ml/lm-eval: add 0.4.12

commit 09a61b5c85e760aae3a0d01a5fe1af9385f8a413
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Mon May 11 17:38:59 2026 +0200

sci-ml/lm-eval: wire ifeval USE flag

Upstream's [project.optional-dependencies].ifeval extra is
{langdetect, immutabledict, nltk>=3.9.1} at v0.4.11. immutabledict
is in ::gentoo and nltk is already in this overlay; langdetect was
just forked from ::guru in the previous commit. With all three
reachable, ifeval joins api/math/sentencepiece/statsmodels/vllm as
a wirable extra (default off — it pulls a language-detection model
and the punkt tokenizer that are only useful for the
leaderboard_ifeval task battery).

The >=nltk-3.9.1 bound is load-bearing, not advisory: lm_eval's
instructions_util.py asserts the version at module import (older
nltk has a remote-code-exec via the `punkt` tokenizer downloader,
see nltk/nltk#3266). Note inline in the ebuild header so a future
bumper does not relax the bound thinking it is cosmetic.

Verified 2026-05-11: USE='api ifeval math sentencepiece statsmodels'
emerge sci-ml/lm-eval solves and installs cleanly on python3_13;
lm_eval.tasks.TaskManager loads the leaderboard_ifeval task end to
end with no ModuleNotFoundError.

commit f2c612aff555b51090727951547f2b36ed9ae54b
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Mon May 11 00:48:02 2026 +0200

sci-ml/lm-eval: wire math USE flag

Adds support for the minerva_math / leaderboard math / hendrycks_math
task families that grade LLM math answers via symbolic equality.
lm_eval/tasks/minerva_math/utils.py asserts
version("antlr4-python3-runtime").startswith("4.11")
at task-load, so the antlr4-4.11.* pin under math? is load-bearing,
not advisory; flipping USE=math triggers an antlr4 downgrade from
4.13.2 to the overlay-local 4.11.0. End-to-end verified on this host:
parse + verify of a boxed LaTeX answer returns the correct verdict.

commit da3078db530336c99abe95bd226ffb26ca2afc9f
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Mon May 11 00:58:48 2026 +0200

sci-ml/lm-eval: default-on api

Most lm-eval users running --model openai-chat-completions / openai-completions
/ anthropic / textsynth / generic API backends hit a NameError on ClientSession
at first request because lm_eval/models/api_models.py wraps the imports in a
try/except ModuleNotFoundError that silently swallows the missing-aiohttp at
import-time and only surfaces later when ClientSession is referenced. The
api-extra deps (aiohttp, requests, tenacity, tiktoken, tqdm) are small and
most lm-eval consumers will want them; default-on matches expected ergonomics.
Users who only run HF/vLLM backends can still set USE=-api.

commit 384e189360f1f03228878937e8c4342f2227578d
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 10 21:07:11 2026 +0200

sci-ml/lm-eval: move vllm dep to $

dev-python/vllm is now SINGLE_IMPL; keeping it inside python_gen_cond_dep
with $ silently auto-satisfies via [X(-)?]. Move the
vllm? optional path out of the multi-impl wrap and pin to
$.

commit 1d01d45249503a960546427cdcb8956fa9a6556a
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sun May 10 14:45:18 2026 +0200

sci-ml/lm-eval: switch to DISTUTILS_SINGLE_IMPL

sci-ml/{datasets,evaluate} are SINGLE_IMPL; depending on them from a
multi-impl ebuild yields python_targets_python3_*(-)? that the child
can't expose. Make lm-eval single-impl, split SINGLE_IMPL deps onto bare
$ and wrap the multi-impl remainder (including
optional-USE deps for api/sentencepiece/statsmodels/vllm) in
python_gen_cond_dep.

commit 5bb48f7173234083230f6146e701cac83bfb8cc5
Author: Ivan S. Titov <iohann.s.titov@gmail.com>
Date: Sat May 9 22:09:03 2026 +0200

sci-ml/lm-eval: new package, EleutherAI lm-evaluation-harness 0.4.11