The Hiring Delusion: Why We Test for the Wrong Thing

There is a fundamental absurdity in modern tech hiring. You spend your career building distributed systems, debugging race conditions, and managing complex trade-offs. Then, to get a new job, you are forced to pause your actual work and spend weeks memorizing algorithmic patterns you haven't used since university. The industry has confused aptitude—can they do the job?—with compliance—will they jump through hoops?

The Solved Science of Performance

We don't need to guess what predicts job performance. We have over 85 years of data.

In a landmark 1998 meta-analysis, researchers Frank Schmidt and John Hunter reviewed thousands of employment studies to determine which selection methods actually correlate with future success. Their findings were later revisited in a 2016 update by Schmidt, Oh, and Shaffer, and then more critically reexamined by Sackett, Zhang, Berry, and Lievens in 2022. The story got more nuanced over time—but the relative ranking of methods held up.

The gold standard is the Work Sample Test. Schmidt and Hunter originally reported a validity of r = 0.54 (meaning it explains roughly a quarter of the variance in real-world outcomes—a strong signal by social science standards). A later meta-analysis by Roth, Bobko, and McFarland (2005) revised this downward to r = 0.33, and Sackett et al. (2022) argued that validity estimates for most selection methods had been systematically inflated due to overcorrection for range restriction. The absolute numbers are debated; the relative hierarchy is not. A work sample means giving the candidate a slice of the actual job. Not a puzzle, but a reality. "Here is a broken API endpoint. Debug it." "Here are messy requirements. Design a schema."

coeff for test

General Cognitive Ability has consistently ranked near the top—Schmidt and Hunter reported r = 0.51, though Sackett et al. revised this closer to r = 0.31. And Structured Interviews, with consistent, calibrated questions across candidates, remain among the strongest predictors regardless of which correction methodology you use. Campion, Palmer, and Campion (1997) demonstrated that adding structure to interviews—standardized questions, anchored rating scales, independent scoring—dramatically improves both reliability and validity.

The 2016 Schmidt update found that some previously hyped methods—like the unstructured interview—performed even worse than originally estimated. And even after Sackett et al.'s downward revisions, the core message remains: work samples, cognitive ability tests, and structured interviews consistently outperform résumé screens, years of experience, and unstructured conversations. Test for the work, not for the performance of being tested.

The LeetCode Trap: IQ Wrapped in Syntax

If Work Samples are the best predictor, why is the industry obsessed with LeetCode?

Because LeetCode is not a Work Sample. LeetCode is a proxy for IQ.

Algorithmic puzzles test working memory, processing speed, and pattern recognition. They test how fast you can rotate a 3D shape in your mind (or a binary tree on a whiteboard).

While raw cognitive ability is a decent predictor for junior talent—who have few demonstrable skills yet—it becomes significantly less predictive when combined with years of experience. For a senior engineer, what matters is the hard-earned database of knowledge: maintainability instincts, system design judgment, the ability to navigate ambiguity. IQ measures CPU speed. It tells you very little about what's on the hard drive.

The "Google Defense": Scale Over Accuracy

So why do Big Tech companies stick to a method that is scientifically inferior?

Because they aren't optimizing for accuracy. They are optimizing for scale and risk mitigation. When you hire 10,000 engineers a year, you face a set of industrial constraints that dictate your tools, and those constraints are real.

Start with the leakage problem. A good work-sample project takes days to design. If a massive tech company rolled one out, the solution would be on GitHub within four hours. Algorithmic problems, by contrast, are standardized, interchangeable, and essentially infinite. Then there is the liability shield. Work samples are inherently subjective—"I didn't like his code structure"—and subjectivity invites lawsuits. Algorithms produce objective outputs: "She failed 3 out of 5 hidden test cases." Objective metrics are legally defensible. Finally, there is the false-positive asymmetry. For a trillion-dollar company, hiring a bad engineer is a disaster—costly severance, morale damage, bad code propagating through the monorepo. Rejecting a good engineer, on the other hand, is a rounding error. They have infinite applicants. Their system is rationally designed to be hyper-cautious, even at the cost of throwing away talent.

These are legitimate engineering trade-offs. The problem is not that Google uses this system. The problem is what happens next.

The Arbitrage Opportunity

Small companies copy the Big Tech interview process without having Big Tech problems.

If you are a startup or a mid-sized company, you don't have the leakage problem—your work-sample project won't end up on GitHub because nobody is farming your interview. You don't need the liability shield—you're making judgment calls on ten hires a year, not ten thousand. And you absolutely cannot afford the false-positive asymmetry. Every rejected candidate who could have shipped product is a real cost to you, not a rounding error.

This creates a massive arbitrage opportunity. While the giants are filtering for "people who are good at interviewing," you can hire the False Negatives—the incredible builders who refuse to play the LeetCode game but can build circles around the competition.

hiring arbitrage

What to Do Instead

The alternative is not complicated. It just requires caring more about signal than process theater.

Pay candidates for a short, realistic work sample. Give them a focused task—four hours, compensated—that mirrors actual work. A broken service to debug. A feature spec to turn into a technical design. A small PR to review with real trade-offs embedded in it. Paying for their time respects their expertise and widens your candidate pool to include people who won't (or can't) grind LeetCode for six weeks.

Pair with them on a real problem. Replace the whiteboard with a live collaboration session. Pull up a real (or lightly anonymized) codebase, describe a real bug or feature, and work on it together. You will learn more about how a person thinks, communicates, and navigates ambiguity in 90 minutes of pairing than in any number of algorithm rounds.

Run structured interviews that probe for judgment, not trivia. Ask about real decisions they've made: architecture trade-offs, times they chose boring technology over clever technology, how they handled a production incident. Use a consistent rubric. Score independently before debriefing.

None of this is novel. It is simply what the research has recommended for decades—the research that the industry has systematically ignored in favor of convenience.

Stop testing for binary trees. Start handing candidates a broken API endpoint and watching what they do.

References

Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274.
Schmidt, F. L., Oh, I.-S., & Shaffer, J. A. (2016). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 100 years of research findings. Working Paper, Department of Management and Organizations, University of Iowa.
Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2022). Revisiting meta-analytic estimates of validity in personnel selection: Addressing systematic overcorrection for restriction of range. Journal of Applied Psychology, 107(11), 2040–2068.
Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2023). Revisiting the design of selection systems in light of new findings regarding the validity of widely used predictors. Industrial and Organizational Psychology, 16(3), 283–300.
Roth, P. L., Bobko, P., & McFarland, L. A. (2005). A meta-analysis of work sample test validity: Updating and integrating some classic literature. Personnel Psychology, 58(4), 1009–1037.
Campion, M. A., Palmer, D. K., & Campion, J. E. (1997). A review of structure in the selection interview. Personnel Psychology, 50(3), 655–702.
Levashina, J., Hartwell, C. J., Morgeson, F. P., & Campion, M. A. (2014). The structured employment interview: Narrative and quantitative review of the research literature. Personnel Psychology, 67(1), 241–293.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96(1), 72–98.