Why your AI resume scores all cluster at 50–60%: what relevance-aware scoring fixes
Why your AI resume scores all cluster at 50–60%: what relevance-aware scoring fixes
Open the output of almost any AI resume screen and you’ll notice something odd: the scores barely spread. A strong candidate comes back at 61%. A wildly irrelevant one comes back at 53%. The whole field compresses into a narrow band in the low-to-mid 50s and low 60s, and no matter where you set the rejection threshold, you either reject everyone or no one. That clustering isn’t the model being careful. It’s the model measuring the wrong things, and once you see why, you can’t unsee it.
The root cause: relevance-blind scoring
Most resume scorers compute an overall score from a few sub-scores (usually skills, experience, and education) and then average them. The trap is in how experience and education get scored.
Experience is typically scored on total years. Education on degree level. Neither asks the one question that matters: is any of it relevant to this role?
So a fourteen-year facility manager applying for a customer success role gets a high experience sub-score (lots of years) and a high education sub-score (has a degree), even though zero of that experience transfers. Skills might be the only sub-score that reflects reality, and it gets averaged in with two sub-scores that are quietly rewarding the wrong things. The result: non-skill factors alone drag almost every candidate up to roughly 50–60%, regardless of fit. A relevance-blind engine doesn’t separate fit from non-fit. It compresses them into the same band.
This is exactly why no threshold works. The signal you’d reject on, relevance, was never in the score.
Why this is invisible until someone complains
The reason teams live with compressed scores for months is that the numbers look reasonable. 54%, 58%, 61%: nothing screams broken. It only surfaces when a recruiter looks at a specific candidate, sees a low-50s score next to a résumé that obviously doesn’t fit the role, and asks why the tool didn’t reject it.
That’s usually the moment the penny drops: the screen wasn’t being conservative, it was being indiscriminate. And every “shortlist” it produced was really just the original pile with a thin coat of numbers on top.
The fix: score relevance, not volume
The repair is conceptual before it’s technical. Stop asking “how much has this person done” and start asking “how much of it is relevant to this role.”
Concretely, relevance-aware scoring changes three things:
- Irrelevant experience contributes close to nothing. Years in an unrelated field no longer inflate the experience sub-score. Fourteen years of the wrong thing scores like what it is.
- No relevant skills and no relevant experience caps the score low. A profile that doesn’t match on the dimensions that matter can’t float to the middle on tenure and a degree. It lands where it belongs, well below any sane threshold.
- The score finally spreads. When relevance drives the math, a strong candidate and a mismatched one stop looking alike. The distribution opens up, and a rejection threshold becomes meaningful again because there’s now real distance between fit and non-fit.
We shipped this into production in June 2026 after exactly the feedback above: a customer success role where clearly-off-target profiles were scoring in the 50s and surviving the filter. The hierarchy we settled on: use the job description’s own structured rubric first if one exists, fall back to the relevance-aware scorer, and only use a relevance-gated keyword matcher as a last resort. The throughline is that every layer is asking about fit, not volume.
Relevance is also why one rubric per role matters
There’s a sibling problem worth naming. Even a relevance-aware scorer needs to know what “relevant” means for the role in front of it, and that definition can’t be a single template applied to every job. The skills that matter for a site reliability engineer are not the skills that matter for a customer success manager, and a screen that scores both against the same generic axes will drift no matter how good its relevance logic is. That’s the case we made in JD-aware AI screening: the rubric should come from the actual job description, not the vendor’s onboarding deck.
Put the two together and you get a screen that’s worth trusting with a reject decision: a role-specific definition of relevance, applied by a scorer that actually rewards relevance over raw tenure.
What to ask your vendor
Resume scoring built on keyword overlap rewards candidates who mirror the job description rather than those who actually fit it, a limitation researchers studying résumé–job matching have flagged for years. If you’re evaluating an AI screening tool, the diagnostic is simple. Pull the score distribution across a real campaign and look at the spread. If everything sits inside a 10-point band, ask the vendor one question: does experience score on total years, or on relevant years? The answer tells you whether you’re buying a screen or a random-number generator with a nice UI.
A screen that can’t say “irrelevant” can’t say “no.” And a tool that can’t say “no” isn’t screening. It’s just sorting.
Want to see what an honest score distribution looks like on your own roles? Talk to us.
See HireQwik in action
Run a free pilot with your next batch of candidates. Screen up to 100 candidates at no cost.