Repo ↗ SCT paper ↗
Try an example →
⚠ Contains a recognizable token ≥5 chars — credential is potentially a hybrid/passphrase. If it is, vibetell's analysis does not apply.
⚠ This password is shorter than 12 characters. vibetell requires at least 12 characters for reliable analysis.
Conventional Strength Metrics — what existing tools say
Entropy
Measures how unpredictable the character choices are. Higher means more varied characters. But a password that perfectly alternates character types can still score maximum entropy — it doesn't check ordering patterns.
Shannon bits/char × password length
KeePass (Simplified)
Estimates strength based on what types of characters are used and how long the password is. It doesn't look at the order of characters — so a perfectly alternating LLM password scores the same as a truly random one.
Pool-size × length bit estimate
zxcvbn
Estimates how long it would take to crack a password by looking for common patterns like dictionary words, keyboard walks, and dates. It's great at catching human patterns, but sees LLM passwords as random noise and rates them as strong.
Pattern-matching strength estimator
(beta)
Enter a credential and press Analyze. vibetell checks whether a credential's pattern is consistent with autoregressive generation.
Signal Strength
Shows how many detection signals fired and how strongly they agree. More filled boxes = more evidence. This is not a probability — it shows what was measured, not how likely a conclusion is.
SCT Rate
How often two neighboring characters are the same type (uppercase, lowercase, digit, symbol). LLMs almost never repeat the same type back-to-back (~0.9%), while truly random passwords do it ~28% of the time.
threshold < 0.024
E[SCT]
What the SCT rate should be if the characters were randomly shuffled, based on this password's own mix of character types. Random passwords land close to this number; LLM passwords fall far below it.
expected under randomness
Delta (Δ)
The gap between the actual SCT and what's expected. Negative means the password avoids repeating character types more than chance would explain. LLMs average around −0.21; random passwords hover near zero.
SCT − E[SCT]
Z-Score
How far the SCT falls below what's normal, measured in standard deviations. More negative = more unusual. Values below −2.2 are very rare for random passwords. Shown for context.
std. devs from baseline
LLR Total
Compares the specific characters used against what a random generator would pick. Positive means the character choices look more like an LLM's preferences than pure randomness.
threshold > 0
LLR Digits
How LLM-like the digit choices are. LLMs favor certain digits over others in ways that differ from a uniform random draw. Positive means the digits lean toward LLM preferences.
digit class component
LLR Letters
How LLM-like the letter choices are. LLMs favor certain letters over others in ways that differ from a uniform random draw. Positive means the letters lean toward LLM preferences.
letter class component
LLR Symbols
How LLM-like the symbol choices are. LLMs concentrate on a small subset of symbols and rarely use others. Positive means the symbols lean toward LLM preferences.
symbol class component
Soft Indicators
Extra context clues that don't affect the verdict. Rare symbols and repeated characters are more common in random passwords, but can appear in LLM output too. For longer passwords, repeats are expected by chance.
info only, no verdict effect
Max Run
The longest streak of consecutive characters of the same type. LLMs almost never go above 1 (they switch types every character). Random passwords average about 3, humans about 4.
longest same-class streak
Classes
How many character types are present: uppercase, lowercase, digits, symbols. LLMs almost always use all 4. Random passwords and human-typed passwords sometimes use fewer.
U · l · D · S present
Active Path
Which combination of signals triggered the result. When both structure and vocabulary signals fire together, the evidence is strongest. A single signal alone is weaker but still noteworthy.
signals fired

Introduction

vibetell is a tool designed to analyze credentials for indicators of LLM generation. It does not evaluate for cryptographic randomness; instead, it identifies signatures of LLM-generation that password strength tools often miss.
Each analysis returns one of three verdicts:

  • LLM Likely — Multiple signals fired with strong agreement.
  • LLM Possible — One signal fired or partial agreement; warrants review.
  • Inconclusive — No detection signals fired.

These verdicts reflect signal intensity, not a conclusive determination of origin. A credential flagged as LLM Likely is consistent with LLM generation, but vibetell cannot prove how a credential was created.

The blind spot

The gap of existing tools

Modern strength meters measure what characters appear, not how they are ordered, and LLM-generated credentials score highly on all of them. Autoregressive generation appears to leave a structural fingerprint that CSPRNG passwords rarely carry — consistent character-class alternation. vibetell's core metric, the Same-Class Transition rate (SCT), measures exactly this. For the full methodology, see the paper.

G7$kL9#mQ2&xP4!wN8@v

Entropy
6.49 bits/char · 99% of theoretical max Strong ✓
KeePass (Simplified)
128.5 bits · pool 94 × length 20 Very strong ✓
zxcvbn
Score 4/4 · centuries to crack offline Very strong ✓
vibetell
SCT 0.000 · LLR +19.41 LLM_LIKELY ✗
Apparent entropy ~105 bits what strength meters see
Minus class structure ~92 bits assumption of rigid cycling
With known biases ~73 bits in reality, it's even less secure.

Signal distribution

How LLM and random passwords score

LLM-generated CSPRNG
← more LLM-like more random →

The two distributions barely overlap. LLMs cycle character types so rigidly that the vast majority of their passwords have zero same-class adjacent pairs, pulling the entire distribution to the left. Toggle to multi-layered to add other signals that catch LLM-generated credentials harder to spot with structure alone.

Holdout validation

Tested on 18 models from 12 labs it was never trained on

Holdout recall · 752 passwords 97.5% Detected at LIKELY+POSSIBLE across 18 models from 12 labs, with zero overlap with training data.
False positive rate · CSPRNG <0.001% At LLM_LIKELY: less than 1 in 100,000 flagged. At LLM_POSSIBLE: about 9 in 1,000 (at length = 16).

The holdout experiment suggests the detection approach generalizes well, though recall varies across models. The low FPR makes vibetell a valuable tool in an auditing pipeline.

Why should I care?

LLMs are embedded in the tools that write your code — Copilot, Claude Code, ChatGPT. If a coding agent generates credentials for a .env file or a service configuration on its own, the result looks strong by every conventional measure. A report by security lab Irregular ↗ (2026) estimated an LLM-generated password carries roughly 27 bits of realistic entropy despite appearing to have ~98 bits — a gap large enough to make brute-force feasible. An attacker who correctly guesses a credential was LLM-generated can apply a mask brute-force attack and crack it in mere hours. The same problem applies to naive users who generate passwords with LLM tools. vibetell is the first tool to detect whether a credential seems LLM-generated.

How is vibetell useful?

  1. A new verification axis. Existing password strength tools measure one dimension: can this credential be guessed by a rule-based attack? None ask whether it has an anomalous structure. These are orthogonal questions. vibetell adds a missing axis to password strength assessment.
  2. Credential auditing. Scan codebases and config files for credentials silently generated by AI agents — ones that pass conventional strength checks but are actually weak. This can be an integration to tools like trufflehog.
  3. Verifying CSPRNG delegation. LLMs sometimes generate credentials directly instead of delegating to a secure generator as instructed. There is now a way to confirm that a key-generation task was most likely delegated to a secure generator.
  4. Breach forensics. Knowing whether a leaked credential is structurally consistent with LLM generation can narrow down how it was created and what else the same system may have produced. Ongoing testing suggests model-specific quirks might exist — particular character preferences and structural templates that differ by model.
  5. Research and education. A live demonstration of an easily measurable problem with LLM random sampling; and a concrete illustration of why high apparent entropy and actual strength are not the same thing.

Can vibetell detect passwords from any LLM?

Yes. The structural signal (SCT) is parameter-free — it measures deviation from a mathematical baseline, not a fitted profile. The vocabulary signal (LLR) was built from Claude and GPT output but fires on models outside that set. Our most likely hypothesis is that it's not capturing model-specific quirks — it's capturing what instruction-tuned autoregressive generation does to character preferences in general. In the same sense that no one would say zxcvbn is "fitted to specific humans" because it was built on human password data, vibetell isn't "fitted to specific LLMs".

Why does this work on a single credential?

vibetell isn't measuring randomness — it's detecting indicators of autoregressive generation. One sample is enough when you know what pattern to look for, the same way malware signatures or EXIF metadata work. Most LLM passwords carry a specific, measurable structural fingerprint that genuinely random passwords almost rarely produce by chance.

Does this only apply to passwords?

The tool uses "password" throughout, but the detection applies to any gibberish looking credential — API keys, secret tokens, .env values, signing keys, and so on. The structural bias is a property of how LLMs generate character sequences, not of how those sequences are used. If an LLM produced it, the fingerprint is there regardless of what it's called.

Why doesn't vibetell analyze passwords containing words or recognizable patterns?

vibetell is designed specifically for gibberish credentials — strings that look random to the eye. Passwords built from words, phrases, or word-plus-number combinations occupy a completely different structural space and require different detection methods. More importantly, they're already caught by existing tools: zxcvbn and similar analyzers are excellent at identifying dictionary words, keyboard walks, and predictable substitutions. The blind spot vibetell fills is the credential that defeats all of those checks — pure gibberish that scores maximum entropy everywhere yet was produced by an LLM.

What does INCONCLUSIVE mean?

No indicators of autoregressive generation were found. It does not mean the password is random or safe — the tool detects specific patterns, and their absence is honest silence, not a certificate of randomness.

Why LLM_POSSIBLE instead of LLM_LIKELY?

LLM_LIKELY requires both signals to agree. LLM_POSSIBLE means one fired without the other — usually the structure is LLM-like but the character choices don't match expected vocabulary, or vice versa. It's a real signal, not a near-miss.

Could a genuine random password get flagged?

Yes, but rarely. At LLM_LIKELY fewer than 1 in 100,000 genuinely random passwords are flagged — a threshold deliberately tuned for precision, so that a LIKELY verdict can be acted on directly. At LLM_POSSIBLE the net is wider by design, catching more LLM passwords at the cost of a higher false positive rate of about 3 in 1,000 across mixed lengths. The FPR is highest at shorter lengths and lower at longer lengths.

Does it work on short passwords?

Detection degrades below 16 characters — fewer adjacent pairs means less structural signal. The tool stays conservative rather than false-alarming: at length 12, the FPR at LLM_LIKELY is still near zero. The minimum supported length is 12 characters. Below that, the tool returns no verdict rather than guess.

What if someone tries to evade detection?

The realistic threat is AI coding agents generating credentials silently, with no evasion intent. For deliberate evasion, the simplest path is just calling secrets.token_urlsafe() — which vibetell correctly classifies as random, so the problem solves itself. In our testing, explicitly instructing models to avoid the pattern didn't produce independence.

Will this still work as models improve?

In our testing, the bias has been observed across different architectures, parameter scales, and labs. We don't have a clear answer to what would fix it short of training models to delegate credential generation to CSPRNG. Until then, we expect vibetell to correctly classify LLM-generated secrets in the vast majority of cases.

Is my password sent anywhere?

No. All analysis runs entirely in your browser. No data leaves your device. You can disconnect your internet after loading the page and it would still work.