Switch to English?
Yes
Переключитись на українську?
Так
Переключиться на русскую?
Да
Przełączyć się na polską?
Tak
Web dashboard for regression testing of prompts and models. Running a test set through two models/prompts — comparison based on 4 sub-scores.

Technically interesting aspects:
— LLM-as-judge through 5 providers (OpenRouter, Anthropic via tool-use, Gemini, Groq, mock)
— 4 sub-scores for each case: correctness, relevance, completeness, prompt_quality
— Cap on final score for poor prompt — prevents strong model from masking poor prompt
— Per-provider throttle and retry with backoff + Retry-After
— Mock mode for running without API keys (CI-friendly, $0)
— Editing secrets in logs

Stack: FastAPI, async SQLAlchemy, Alembic, httpx, Pydantic, vanilla JS, Docker.
Work details
Added 15 June
45 views
Freelancer
Dmytro Staroselskyi
Ukraine Lvov
No reviews

Available for hire Available for hire
On the service 6 years