AI.news
主页教程研究工具模型AI创业讨论新闻每日简报WIKI🚀 创业库★ 投稿
AI+医疗机器人教育金融能源健康娱乐思考

Evaluator — Hire engineers who use AI well

For the 2026 hiring market

Every engineer uses AI now.
Hire the ones who use it well.

Evaluator is the technical assessment that grades how skillfully candidates collaborate with AI — reading it, fixing it, prompting it, overriding it — on top of the fundamentals that still matter: reading, writing, debugging.

10 free / monthNo card requiredSee what's tested

AI · CritiqueQuestion 14 of 17

20 pts

An AI assistant produced this. It looks reasonable. It is not. Find every flaw and fix it.

async function fetchUserPosts(userId: string) {
  const res = await fetch(`/api/users/${userId}/posts`)
  const posts = res.json.parse()
  return posts.filter((p, i) => i <= posts.length)
}

Candidate found

  • res.json.parse() — hallucinated. It's await res.json().
  • i <= posts.length — off-by-one. Should be < or just drop the filter.

Caught the hallucination

The shift

You've been screening for the wrong thing.

Every shop now has Copilot, Cursor, Claude Code. The bottom quartile of every team is the one that takes the AI's first answer. The top quartile catches the hallucinated import, rewrites the over-engineered class, and ships something that actually works. We test for the top quartile.

The differentiator

Five tests for how someone works with AI.

No other platform does this. Most still treat AI as a thing to detect. We treat it as a tool to grade.

  1. 01Prompt quality

    Can they brief an AI like they brief a junior?

    We give them a feature spec. They write the prompt they would actually send. We score for context, constraints, edge cases, and acceptance criteria — not for verbosity.

    Strong candidate response

    Implement a debounced search hook for the Postgres-backed /api/search endpoint we already use in SearchBar.tsx. 300ms debounce. Cancel in-flight requests on new input (use the AbortController we use elsewhere). Return { data, error, loading }. Don't introduce a new fetch library — we use native fetch. Cover the empty-query case (return early, no request).

    + context+ constraints+ edge case

  2. 02Reading AI code

    Can they tell "works" from "good"?

    We show them AI-written code that runs. They explain what it does, flag the AI-shaped tells — over-engineered classes, defensive try/catch eating real errors, non-idiomatic patterns — and say what they would change.

    class UserDataManager {
      private cache: Map<string, User | null>
      constructor() {
        this.cache = new Map()
      }
      async getUserById(id: string | null): Promise<User | null> {
        if (!id) return null
        try {
          if (this.cache.has(id)) return this.cache.get(id)!
          return await fetchUser(id)
        } catch (e) { return null }
      }
    }

    Candidate

    “A class for what should be a function. Swallows errors silently — caller can't tell a 500 from a missing user. Doesn't actually write to the cache, so it never warms.”

  3. 03Fixing AI code

    Can they surgically fix one bug?

    We plant exactly one realistic bug in an AI-written function. They find it and patch it minimally. We penalize broad refactors that miss the actual problem.

  4. 04Critique

    Can they catch every hallucination?

    We give them code with multiple planted flaws — fake APIs, off-by-ones, swallowed errors. We grade thoroughness: did they catch them all, or did they stop at the first one and say "looks good"?

    Found by candidate · 3 / 3

    • lodash.deepFlatten doesn't exist — _.flattenDeep does.
    • catch (e) swallows the error. Should at least log or rethrow.
    • Loop runs O(n²) — switch the outer to a Set lookup.
  5. 05Live collaboration

    Watch them work with the assistant.

    On the final question, the candidate gets an AI sidebar built into the editor. We record every prompt they send, every suggestion they accept, every chunk they reject, and every keystroke they make on top. The transcript goes to you.

    function debouncedSearch(query: string) {
      // accepted from AI
      if (!query) return
      if (controller) controller.abort()
      // candidate edit: was 200, made it 300
      timeout = setTimeout(...)
    }

    Sidebar transcript

    You: use AbortController for cancellation

    AI: <draft>

    You: debounce is wrong — should be 300ms not 200ms

    4 prompts · 2 accepts · 1 reject · 38% manual edits

Six dimensions

Five fundamentals. Plus the one nobody else tests.

Every assessment is generated for the specific role you're hiring for, in the specific tech stack you use. The questions change. The dimensions don't.

AI

The differentiator

AI Collaboration

Five sub-tests: prompt quality, reading AI code, fixing AI code, critique, and live collaboration. The first assessment that grades AI fluency as a first-class skill.

See all five sub-tests

The flow

From a JD to a scored candidate, in one sitting.

01

Paste a job description.

Or describe the role in a sentence. We pick up seniority, tech stack, and what the candidate will actually be doing.

02

Get an assessment in 30 seconds.

A custom test across all six dimensions, calibrated to the role. Reading, writing, debugging, communication, tradeoffs, AI collaboration.

03

Share a link. Get a scored report.

Candidates take the test async. You get per-question feedback, integrity flags, and — for AI questions — the full collaboration transcript.

Integrity

We allow AI where it's expected. We catch it where it isn't.

On the AI Collaboration section, the sidebar is right there — we're scoring how they use it. On every other section, behavioral analysis, keystroke pacing, paste pattern, and LLM fingerprinting flag candidates trying to outsource the fundamentals.

allowed

On AI questions

Sidebar visible. Every prompt, accept, and edit logged for the reviewer.

flagged

LLM fingerprint on a no-AI question

Uniform structure, hedging language, suspiciously polished prose under time pressure.

flagged

Pure paste

Non-trivial answer arrived with zero keystrokes — pasted from somewhere off-page.

flagged

Burst pattern

Long idle, then a 400-CPM burst, then submit. The 'they alt-tabbed to ChatGPT' fingerprint.

flagged

Tab switches

Five or more focus changes during a single question.

Pricing

Per assessment, not per seat.

The free tier is the full product. Upgrade when you're moving real volume, not before.

Stop hiring engineers who can ace a 2019 coding test.

Start hiring the ones who can ship working software in 2026 — with AI, around AI, despite AI.

Show HN: Don't ask if devs cheat with AI, test if they're good with it | AI.News