The Gap Between What People Say and What People Do
In 1948, anthropologist Horatio Hale observed something that would later become a foundational problem in social science research: people frequently do not do what they say they do, and do not say what they do.
This observation is not about dishonesty. It reflects a genuine gap between self-reported behavior and actual behavior — a gap that arises from limited self-awareness, the distorting effects of language, social desirability bias, and the simple fact that much of what we do is automated and unavailable to conscious inspection.
This gap is a central challenge in UX research. Users say a design is "intuitive" — then you watch them use it and see them struggle for 40 seconds to find the navigation. Users say they "had no problems with checkout" — then the session replay shows them abandoning the cart at the address field before returning. Users rate a page as "easy to use" — but their cursor traces erratic loops around a CTA they eventually clicked by accident.
Multimodal feedback capture is the methodological response to this gap: rather than choosing between what users say and what users do, you capture both simultaneously and look at them together.
This is a spoke article in our series on The Science of User Feedback: Behavioral Psychology in Web Design.
---
Defining the Two Streams
Direct feedback is any data that comes from a user explicitly stating their experience, opinion, or intent. Surveys, usability interviews, written bug reports, verbal narration, and NPS scores are all forms of direct feedback. The user is consciously constructing and transmitting a message about their experience.
Inferred feedback (sometimes called implicit or behavioral data) is derived from observed user actions without explicit self-report. Session replay footage, click heatmaps, scroll depth, hover patterns, time-on-task measurements, cursor movement, and error rates are inferred data. The user is not reporting their experience; they are simply doing something, and the system records what they do.
Both streams are valuable. Both are incomplete. Direct feedback tells you what users think, feel, and interpret — but is filtered through language, self-awareness, and the limitations of memory (see: the Ebbinghaus forgetting curve and memory decay). Inferred data tells you what users actually did — but provides no interpretation. A cursor loop above a navigation element might mean confusion, might mean accidental movement, or might mean the user was reading the content carefully.
Multimodal capture combines both in a single event, enabling you to interpret each stream in light of the other.
---
The Science of Dual Coding
The case for multimodal capture has a theoretical foundation in cognitive psychology. Allan Paivio's dual-coding theory (1971) proposes that the human mind processes verbal and visual information in separate but interconnected channels. Information encoded in both channels simultaneously is remembered more strongly, processed more deeply, and understood more completely than information encoded in either channel alone.
For feedback reviewers generating a report, dual coding works in their favor: speaking about what they see as they see it produces richer encoding than either writing about it later or recording silently. For developers consuming the report, dual coding works in their favor too: watching a screen recording while hearing the reviewer's voice produces better comprehension than either a written description or a silent video.
This is why the comparison between voice and text feedback consistently favors voice: it activates multiple cognitive channels simultaneously, in the reviewer and the recipient.
Multimodal capture extends this advantage by adding the behavioral channel. Now the developer is not just hearing the reviewer's voice over a screen recording — they are watching the reviewer's actual cursor movements, seeing which elements they interact with, and noticing the behavioral signals that neither the reviewer's words nor the visual framing would have conveyed alone.
---
What You Find When the Streams Contradict
The most valuable moments in multimodal feedback are the contradictions — when what a user says and what they do diverge.
Contradiction type 1: "No problems" + visible struggle. A reviewer narrates: "Okay, I'm on the checkout page, everything looks good." Meanwhile, the session replay shows their cursor circling the promo code field three times before they locate the submit button. The verbal report said no problems. The behavioral data showed a navigation issue. Without the session replay, the issue would never have been reported.
Contradiction type 2: "It was confusing" + successful task completion. A reviewer says "I wasn't sure where to click here" but the behavioral data shows they found and clicked the correct element within five seconds on their first attempt. The verbal report flagged a potential issue; the behavioral data suggests it may be a labeling problem (the element was hard to identify by name) rather than a discoverability problem (they found it immediately). The combination narrows the diagnosis significantly.
Contradiction type 3: Confident narration + rage clicking. A reviewer narrates calmly but the session replay records rapid repeated clicks on a non-responsive button. The rage clicking signal — discussed in detail in our companion piece on frustration analytics and emotional tracking — tells you the reviewer experienced more frustration than their tone communicated. This is diagnostically critical for prioritization.
In each case, the contradiction is the finding. Single-stream feedback would have produced an incomplete or misleading picture. Multimodal capture reveals the truth.
---
The Practical Architecture of Multimodal Capture
For multimodal capture to work in practice, three conditions need to be met:
Simultaneous recording. The voice narration and the screen session must be captured in real time, synchronized, and stored as a linked artifact. If they are captured separately — a voice memo and a screen recording taken at different times — the interpretive value of the pairing is lost.
In-context triggering. The capture must happen in the environment where the experience occurs, for reasons explored in our article on in-situ capture and memory decay. A voice recording narrated away from the site is narrating a memory, not an experience. A session replay without accompanying narration has no interpretive layer.
Low barrier to entry. The capture mechanism must be frictionless enough that reviewers use it reflexively, not selectively. If capture requires deliberate effort, reviewers will invoke it only for issues they consciously notice and decide are worth reporting. The most valuable signals — the unnoticed confusion, the near-miss, the subtle hesitation — are precisely those that fall below the threshold of conscious reporting. A low-CES capture mechanism (see our piece on Customer Effort Score and feedback quality) makes it more likely that these subtle signals are captured.
GiveFeedback's widget meets all three conditions: it records voice and screen simultaneously, it runs in the staging site context, and its low-friction interface means reviewers trigger it for small issues they might not bother to formally report through other channels.
---
Multimodal Feedback in the Broader Research Toolkit
Multimodal capture as implemented in feedback tools like GiveFeedback sits in the middle of a spectrum of user research methods:
At one end, observational lab studies (think-aloud protocols, usability labs) capture rich multimodal data but require scheduled sessions, trained facilitators, and artificial environments that may not reflect real-world behavior.
At the other end, quantitative analytics (heatmaps, funnel data, A/B tests) capture behavioral data at scale but provide no interpretive layer — you see what happened, but not what users thought about it.
In-context multimodal feedback capture occupies a valuable middle ground: it captures rich, layered data (voice + behavior) in the real environment where experience occurs, without requiring formal study conditions. It is what researchers call an ecological method — data collected in the natural habitat.
This is not a replacement for formal usability studies or quantitative analytics; it is a complement that fills the gap between them. The real-world context that lab studies cannot replicate, and the interpretive layer that analytics tools cannot provide, are both present in a 45-second voice-and-screen feedback clip.
---
From Multimodal Capture to Actionable Tickets
Multimodal recordings are only useful if they are efficiently converted into actionable work. This is where AI plays a critical role.
Raw voice recordings and session replays require someone to watch or listen — which is time-consuming at scale. AI-powered transcription and task extraction can convert a 60-second voice recording into a structured ticket: title, page, description, steps to reproduce, priority indicator. The session replay is attached as evidence.
GiveFeedback's AI extraction layer does exactly this: voice narration is transcribed, the key issue is isolated, and the result is a structured task a developer can act on without watching the full recording. The recording remains available for cases where the structured ticket raises questions — but for the majority of issues, the extracted task is sufficient.
This closes the loop from capture to action, making multimodal feedback scalable rather than aspirational.
---
Conclusion
The gap between what users say and what users do is not a flaw in users — it is a fundamental characteristic of human self-reporting, well documented across decades of psychology and UX research. Single-stream feedback methods (text, surveys, voice alone, analytics alone) capture half the picture at best.
Multimodal capture — pairing direct voice narration with inferred session replay data — produces a form of evidence that neither stream could generate alone. Contradictions between the streams are often the most valuable findings, revealing hidden struggles, unspoken frustrations, and behavioral patterns that conscious self-report misses entirely.
For the full framework this fits into, see the hub article: The Science of User Feedback: Behavioral Psychology in Web Design. For a complementary look at how emotional signals in behavioral data enable urgency-based triage, read Rage Clicking and Sentiment: Tracking the Emotional State of Your Users.