Now in preview · LLM observability + localization quality
Observability for the words your LLM ships.
LexiMetric grades the quality of what your model actually says — in every language you ship. OTel-native traces, CI release gates, online scoring, and the only platform with native COMET / BERTScore / BLEURT / chrF on every trace.
OTel-native ingestPython + JS SDKs, OpenAI auto-instrument, OTLP/HTTP ingest.
CI release gatesGitHub Action that posts per-engine quality diffs on every PR.
Localization verticalCOMET, BERTScore, BLEURT, chrF — scored per locale, gated per locale.
Hosted access opening soon — join the waitlist
Or self-host today (FSL → Apache 2.0 in 2 years)
pip install leximetric leximetric-cli
import leximetric
leximetric.init(api_key="lxm_...")
with leximetric.trace("rag.answer", {"locale": "es-MX"}):
with leximetric.span("llm.call", kind="llm") as s:
s.set("gen_ai.request.model", "claude-sonnet-4-6")
s.set_usage(prompt_tokens=120, completion_tokens=45)
leximetric.score(value=0.84, scorer="comet")Or try the playground below
Score one prompt across every major LLM.
Run a prompt across GPT, Claude, Gemini, Grok and DeepSeek at once. Score every response across 9 industry-standard metrics and see which model wins, and why.
Free tier: up to 100 words per run. to remove the limit.