🎯

LLM Consistency Tester

Paste N outputs — measure variance, similarity, stability

Workflow: Run your prompt 3-10× at your chosen temperature. Paste each output below. We measure the variance across them.
📚
Learn more — how it works, FAQ & guide
Click to expand

LLM consistency tester

Measure LLM output variance. Jaccard, cosine, stability score across N runs.

How to use this tool

  1. 1

    Run your prompt 3-10 times

    Same prompt, any model, save the outputs.

  2. 2

    Paste them below

    One output per text area. Use the + button to add more.

  3. 3

    See consistency metrics

    Length variance, Jaccard similarity, cosine, stability score.

Frequently Asked Questions

Why test LLM consistency?
For production LLMs, variance is a killer. A prompt that works 9/10 times but fails the 10th is production-fragile. Teams need to know: is the variance acceptable? Should I lower temperature? Is this task fundamentally non-deterministic?
What is Jaccard similarity?
Jaccard = |A ∩ B| / |A ∪ B| for word sets. Value 1.0 = identical words, 0 = no shared words. For most LLM outputs, 0.3-0.6 = semantically similar with different wording, 0.7+ = very close phrasing.
How to improve consistency?
Lower temperature (0.0-0.3 for deterministic tasks). Add more few-shot examples. Use structured output (JSON schema). Pick a lower-creativity model. For truly deterministic tasks consider fine-tuning.

You might also like

🔒
100% Privacy. This tool runs entirely in your browser. Your data is never uploaded to any server.