Now in preview · LLM observability + localization quality

Observability for the words your LLM ships.

LexiMetric grades the quality of what your model actually says — in every language you ship. OTel-native traces, CI release gates, online scoring, and the only platform with native COMET / BERTScore / BLEURT / chrF on every trace.

OTel-native ingestPython + JS SDKs, OpenAI auto-instrument, OTLP/HTTP ingest.

CI release gatesGitHub Action that posts per-engine quality diffs on every PR.

Localization verticalCOMET, BERTScore, BLEURT, chrF — scored per locale, gated per locale.

Hosted access opening soon — join the waitlist

Or self-host today (FSL → Apache 2.0 in 2 years)

pip install leximetric leximetric-cli
import leximetric
leximetric.init(api_key="lxm_...")

with leximetric.trace("rag.answer", {"locale": "es-MX"}):
    with leximetric.span("llm.call", kind="llm") as s:
        s.set("gen_ai.request.model", "claude-sonnet-4-6")
        s.set_usage(prompt_tokens=120, completion_tokens=45)
    leximetric.score(value=0.84, scorer="comet")

Or try the playground below

Score one prompt across every major LLM.

Run a prompt across GPT, Claude, Gemini, Grok and DeepSeek at once. Score every response across 9 industry-standard metrics and see which model wins, and why.

Free tier: up to 100 words per run. to remove the limit.

Provide a system prompt. LLMs generate responses and all outputs are scored against your golden reference.

System Prompt

(required)

LLM Models

Detecting engines…

Source Content

SEG-1

Source text

Golden reference(optional)

Observability for the words your LLM ships.

Score one prompt across every major LLM.

System Prompt

LLM Models

Language Pair

Source Content