Switch to English?
Yes
Переключитись на українську?
Так
Переключиться на русскую?
Да
Przełączyć się na polską?
Tak
Web dashboard for regression testing of prompts and models. Running a test set through two models/prompts — comparison based on 4 sub-scores.

Technically interesting aspects:
— LLM-as-judge through 5 providers (OpenRouter, Anthropic via tool-use, Gemini, Groq, mock)
— 4 sub-scores for each case: correctness, relevance, completeness, prompt_quality
— Cap on final score for poor prompt — prevents strong model from masking poor prompt
— Per-provider throttle and retry with backoff + Retry-After
— Mock mode for running without API keys (CI-friendly, $0)
— Editing secrets in logs

Stack: FastAPI, async SQLAlchemy, Alembic, httpx, Pydantic, vanilla JS, Docker.
Work details
Added 15 June
4 views
Freelancer
Dmytro Staroselskyi
Ukraine Lvov
No reviews

Available for hire Available for hire
On the service 6 years