Guide
Usability testing, methods, types & best practices
How to run usability tests that actually surface UX friction. Methods, sample size, common mistakes, and the moderated/unmoderated trade-off.
Usability testing is the practice of giving real users a real task in your product and watching where they hesitate, misclick, or quit. It measures whether the interface works, not whether the idea is good.
What usability testing actually catches#
Usability testing catches a specific kind of problem: the gap between what your design assumes and what a person actually does. That gap is invisible to most other methods.
Analytics tell you that 40% of users drop off at step three. They do not tell you why. A/B testing tells you that variant B converts 8% better. It does not tell you that both variants confuse half the users in ways neither version fixes. Surveys tell you what people remember feeling. Usability testing shows you what they actually do, in the moment, with their hands on the product.
Concretely, a good usability test surfaces things like: the user did not see the primary button because it sat below the fold on their laptop, the label "Settings" did not match their mental model of "Account", they assumed the search bar was a chat input, they tapped the logo expecting it to go home and got dumped on a marketing page. None of those show up in a funnel chart.
If you want to understand the broader research stack this fits into, see user research and user interviews.
Moderated vs unmoderated#
The first decision is whether a researcher is present. Both formats are legitimate. They optimize for different things.
Moderated
A facilitator runs the session live, on a call or in person. Strength: you can ask follow-ups in the moment, probe hesitation, and adjust tasks based on what you see. Weakness: slow, expensive, capped by your calendar. Best for early prototypes and complex flows where the "why" matters.
Unmoderated
The participant completes tasks alone using a tool that records screen, voice, and clicks. Strength: cheap, fast, scales to hundreds of sessions across time zones. Weakness: when the user pauses for 12 seconds on the pricing page, you cannot ask why. Best for validating known flows and comparing variants.
A reasonable default: moderated for discovery work, unmoderated for validation at scale. Many teams run both inside the same project.
The main test types#
Usability testing is an umbrella. Underneath it sit several distinct formats, each suited to a different question.
- 5-second test:
Show a screen for five seconds, then ask what the user remembers. Measures first-impression hierarchy: what stood out, what they think the page is for, what action they expect to take. Useful for landing pages, dashboards, and empty states.
- First-click test:
Show a static screen and ask "where would you click to do X?" Research from Bob Bailey and others shows that users who get the first click right finish the task far more often. A cheap way to validate navigation and CTA placement before you build anything interactive.
- Task-based test:
The classic. Give the user a realistic scenario ("you want to cancel your subscription before the next renewal") and watch how they attempt it on a working prototype or live product. Measures completion rate, time on task, and error count. If you are testing onboarding, ask the user to start fresh and say what they expect to happen at each step.
- Tree test:
Strip the UI away and test the information architecture in isolation. The user navigates a text-only hierarchy to find a target item. Tells you whether your taxonomy is the problem before you spend design cycles styling it.
These compose well. A common sequence: tree test the IA, first-click test the new nav, then a task-based test on the full prototype.
The "five users" finding (and its caveats)#
Jakob Nielsen's well-known finding from the 1990s, building on work with Tom Landauer, is that around five users surface roughly 85% of the usability issues in a given flow. The math is intuitive. The first user finds the most obvious problems. The second finds some of the same ones plus a few new ones. By the fifth, additional sessions mostly repeat what you already saw. Steve Krug makes a similar argument in Rocket Surgery Made Easy: three users, run early and often, beats fifteen users run once a quarter.
Five is enough when:
- You are testing a single, well-scoped flow with one user type.
- You plan to iterate and re-test, not ship and walk away.
- You care about the existence of issues, not their precise frequency.
You need more (15 to 30 plus) when:
- The product serves multiple distinct user segments. Run five per segment.
- You need quantitative metrics, like statistically reliable completion rates.
- The flow is broad with many possible paths.
The common mistake is treating "five is enough" as a budget cap rather than a per-segment minimum. A B2B product with admins, end-users, and finance buyers needs at least fifteen sessions, not five.
The 85% number assumes a competent moderator and tasks that exercise the flow. Five users running through tasks that do not match real use will surface 85% of nothing useful.
Common mistakes and how to recover#
The mechanics of usability testing are easy. The discipline is in not contaminating the result.
Leading the user. Saying "you might want to look at the top right" or even "good!" after a correct click trains the user to perform for you. Recovery: write a script, read the task aloud once, then go silent. If they ask "am I doing this right?", deflect with "what would you do if I were not here?"
Testing on the team. Colleagues know the jargon, the product, and what you want to hear. They glide through tasks real users will fail. Internal tests are useful for catching prototype bugs, not usability findings. Recovery: recruit by behavior ("people who booked a flight in the last 60 days"), not demographics or convenience.
Confusing usability with desirability. A user can complete a task perfectly and still hate the product. A user can love a product they cannot use. Recovery: separate the questions. Run the task first ("can you?"), then ask the attitudinal questions ("would you?") in the debrief.
Watching but not synthesizing. Watching ten sessions is not the same as analyzing them. Without tagged transcripts or structured notes, you will remember the most recent session and forget the first three. Recovery: tag observations as you watch, then review the tags across sessions before you propose fixes. One person stuck is an anecdote, three is a finding.
Fixing every reported issue. Users will mention things that bother them but do not block them. Treat severity and frequency as separate axes. A blocker hit by one user beats a cosmetic complaint from five.
AI-moderated tools like Diaform can run friction interviews after a usability task, asking the user to explain where they got stuck and probing "what made you hesitate at that step" in real time. That recovers some of the depth that unmoderated testing usually loses, without booking a researcher's calendar. For more on the format, see AI-moderated interviews.
One last thing#
Usability testing rewards frequency more than rigor. Five users every two weeks, on whatever is in front of you, will improve the product faster than a 30-participant study run twice a year. Pick the smallest version of the test you can run this week, run it, and ship the fix.