Search
Keyword, semantic, and hybrid search across multiple benchmark datasets — MMLU, HumanEval, GSM8K, HellaSwag, TruthfulQA, ARC, and more. Find exactly the examples you need.
Search, cherry-pick, and export examples from public AI evaluation datasets. Build custom eval suites for your models — from the terminal, the API, or your agent.
Building custom eval suites shouldn't mean writing one-off scripts for every dataset and format. Cherry Evals gives you a unified interface to search, curate, and export from any supported benchmark — so you can focus on what actually matters: evaluating your models.
Keyword, semantic, and hybrid search across multiple benchmark datasets — MMLU, HumanEval, GSM8K, HellaSwag, TruthfulQA, ARC, and more. Find exactly the examples you need.
Curate custom collections by selecting individual examples from search results. Build targeted eval suites for specific skills, domains, or difficulty levels.
Download collections as JSON, JSONL, or CSV. Push directly to Langfuse for tracing and evaluation. Your data, your format.
Cherry Evals is built for both humans and AI agents. Every capability is available through all four interfaces.
FastAPI-powered HTTP endpoints for programmatic access. Integrate search and export into any pipeline or workflow.
Full-featured command-line interface built with Click. Ingest datasets, run searches, and manage collections from your terminal.
Model Context Protocol server for AI agents. Let Claude, GPT, or any MCP-compatible agent search and curate evals autonomously.
React-based interface for interactive search and curation. Browse results visually, build collections, and export with a click.
Ingest any of the supported datasets and search across them uniformly. More datasets added regularly.
Self-hosted, no account required. Requires uv and Docker Compose.
# Clone the repository git clone https://github.com/marinone94/cherry-evals.git cd cherry-evals # Install dependencies (requires uv) uv sync # Start Postgres + Qdrant docker compose up -d # Run database migrations uv run alembic upgrade head # Ingest the MMLU benchmark dataset uv run python -m cherry_evals.cli ingest mmlu # Generate embeddings for semantic search uv run python -m cherry_evals.cli embed mmlu # Start the API server uv run fastapi dev api/main.py
The API is now available at
http://localhost:8000.
See the
README
for full configuration options.