For the 2026 hiring market
Evaluator is the technical assessment that grades how skillfully candidates collaborate with AI — reading it, fixing it, prompting it, overriding it — on top of the fundamentals that still matter: reading, writing, debugging.
10 free / monthNo card requiredSee what's tested
AI · CritiqueQuestion 14 of 17
20 pts
An AI assistant produced this. It looks reasonable. It is not. Find every flaw and fix it.
async function fetchUserPosts(userId: string) {
const res = await fetch(`/api/users/${userId}/posts`)
const posts = res.json.parse()
return posts.filter((p, i) => i <= posts.length)
}Candidate found
res.json.parse() — hallucinated. It's await res.json().i <= posts.length — off-by-one. Should be < or just drop the filter.Caught the hallucination
The shift
Every shop now has Copilot, Cursor, Claude Code. The bottom quartile of every team is the one that takes the AI's first answer. The top quartile catches the hallucinated import, rewrites the over-engineered class, and ships something that actually works. We test for the top quartile.
The differentiator
No other platform does this. Most still treat AI as a thing to detect. We treat it as a tool to grade.
01Prompt quality
We give them a feature spec. They write the prompt they would actually send. We score for context, constraints, edge cases, and acceptance criteria — not for verbosity.
Strong candidate response
Implement a debounced search hook for the Postgres-backed /api/search endpoint we already use in SearchBar.tsx. 300ms debounce. Cancel in-flight requests on new input (use the AbortController we use elsewhere). Return { data, error, loading }. Don't introduce a new fetch library — we use native fetch. Cover the empty-query case (return early, no request).
+ context+ constraints+ edge case
02Reading AI code
We show them AI-written code that runs. They explain what it does, flag the AI-shaped tells — over-engineered classes, defensive try/catch eating real errors, non-idiomatic patterns — and say what they would change.
class UserDataManager {
private cache: Map<string, User | null>
constructor() {
this.cache = new Map()
}
async getUserById(id: string | null): Promise<User | null> {
if (!id) return null
try {
if (this.cache.has(id)) return this.cache.get(id)!
return await fetchUser(id)
} catch (e) { return null }
}
}Candidate
“A class for what should be a function. Swallows errors silently — caller can't tell a 500 from a missing user. Doesn't actually write to the cache, so it never warms.”
03Fixing AI code
We plant exactly one realistic bug in an AI-written function. They find it and patch it minimally. We penalize broad refactors that miss the actual problem.
04Critique
We give them code with multiple planted flaws — fake APIs, off-by-ones, swallowed errors. We grade thoroughness: did they catch them all, or did they stop at the first one and say "looks good"?
Found by candidate · 3 / 3
lodash.deepFlatten doesn't exist — _.flattenDeep does.catch (e) swallows the error. Should at least log or rethrow.05Live collaboration
On the final question, the candidate gets an AI sidebar built into the editor. We record every prompt they send, every suggestion they accept, every chunk they reject, and every keystroke they make on top. The transcript goes to you.
function debouncedSearch(query: string) {
// accepted from AI
if (!query) return
if (controller) controller.abort()
// candidate edit: was 200, made it 300
timeout = setTimeout(...)
}Sidebar transcript
You: use AbortController for cancellation
AI: <draft>
You: debounce is wrong — should be 300ms not 200ms
4 prompts · 2 accepts · 1 reject · 38% manual edits
Six dimensions
Every assessment is generated for the specific role you're hiring for, in the specific tech stack you use. The questions change. The dimensions don't.
AI
The differentiator
AI Collaboration
Five sub-tests: prompt quality, reading AI code, fixing AI code, critique, and live collaboration. The first assessment that grades AI fluency as a first-class skill.
The flow
01
Or describe the role in a sentence. We pick up seniority, tech stack, and what the candidate will actually be doing.
02
A custom test across all six dimensions, calibrated to the role. Reading, writing, debugging, communication, tradeoffs, AI collaboration.
03
Candidates take the test async. You get per-question feedback, integrity flags, and — for AI questions — the full collaboration transcript.
Integrity
On the AI Collaboration section, the sidebar is right there — we're scoring how they use it. On every other section, behavioral analysis, keystroke pacing, paste pattern, and LLM fingerprinting flag candidates trying to outsource the fundamentals.
allowed
On AI questions
Sidebar visible. Every prompt, accept, and edit logged for the reviewer.
flagged
LLM fingerprint on a no-AI question
Uniform structure, hedging language, suspiciously polished prose under time pressure.
flagged
Pure paste
Non-trivial answer arrived with zero keystrokes — pasted from somewhere off-page.
flagged
Burst pattern
Long idle, then a 400-CPM burst, then submit. The 'they alt-tabbed to ChatGPT' fingerprint.
flagged
Tab switches
Five or more focus changes during a single question.
Pricing
The free tier is the full product. Upgrade when you're moving real volume, not before.
Start hiring the ones who can ship working software in 2026 — with AI, around AI, despite AI.