Data science, done correctly, can help teams avoid common pitfalls
Self-report survey data is everywhere, but bias and subjectivity confound interpretation and predictive modeling. Here we describe common pitfalls and outline best practices for handling surveys toward optimal data science outcomes.
Predicting human psychology remains one of the toughest challenges for data science, but also one of the most valuable, with far-reaching consequences in all industries that interface with humans. Some examples of psychology-driven machine learning include:
Self-report surveys* are the most common way to quantify how people think, and there are so many applications that we often fail to realize we are looking at or even taking one! Product and restaurant reviews on a 5-star scale, rating the call quality of your Zoom meeting, and surveys administered by clinicians to assess pain and diagnose and monitor neurological disorders are all forms of self-report that measure relative experiences on a common scale.
Despite their prevalence, surveys present some frustrating downsides that impact machine learning models and their predictions. The biggest problem is that we lack an objective measure of individual human thoughts and experiences that shape a person’s psychology, so we must rely on subjective self-reported measures. In this post, we discuss the recommended strategies to identify and address pitfalls in survey data.
*In this article, we focus on self-report surveys, which encompass subjective questions about individual thoughts and experiences rather than objective questions about location, age, etc.
Bias is a confounding factor when analyzing survey data. Carefully analyzing survey responses for clusters and outliers can help identify candidates for exclusion when training machine learning models. Nevertheless, because bias is necessarily built into the development and administration of surveys, it is important to consider the following:
Survey scales influence responses — Most surveys we take allow responses on recognizable 2-, 3-, 4-, 5-, 7-, or 9-point “Likert” scales, i.e. “strongly agree, agree, neutral, disagree, strongly disagree.” These assume, intuitively, that our perceptions can be represented on a linear scale. The number of response choices can influence the distribution of answers reported and subtitles such as the sequence of responses can measurably influence how survey-takers respond. Likewise, the number and order of questions can influence survey responses. Even the time of day that a survey taker chooses to respond can impact the outcome.
Humans are heterogeneous — It is impossible to entirely remove bias when developing, administering, and analyzing surveys; sampling bias is one of the most common. While in theory, it would be possible to acquire a uniform sample on which generalizable models could be constructed, the reality is that humans are far too complex to know what constitutes a truly uniform sample. Demographic breakdowns can provide a sense of the populations to which subsequent machine learning models are likely to apply. However, the statistical assumptions of machine learning provide no guarantee of success in data that differs significantly from the training population.
Human nature biases survey responses — When analyzing surveys, we must assume that the survey takers are honest, but our psychology does not always encourage us to reveal our truest selves. Generally, we want to be seen as agreeable. This tendency toward acquiescence reveals itself as a bias toward reporting positive feelings and culturally accepted character traits. Some of these factors can be reduced through anonymity; however, it is ultimately impossible to deconvolve our psychological tendencies from survey responses, meaning that models built on them will include rather than explain these aspects of our human nature.
Survey interpretation depends on tabulation — Surveys are most often scored on a Likert scale where 0 represents a lesser severity or frequency and the maximum number represents a greater severity or frequency per question. Clinical interpretations of scales vary depending on the assessment. For example, one can use the Beck Depression Inventory-2’s total score (sum of each question) to categorize survey-takers into stages of depression from minimal to major. However, some surveys such as the Five-Factor Inventory are broken down into sub-scales for personality traits where only certain questions are totaled for each trait, making a total score from the whole scale clinically meaningless. It is important to understand how surveys are meant to be scored clinically before analysis to ensure that data used in AI or data science is clinically meaningful and interpretable.
While we cannot normalize survey responses to compare individual experiences directly, we can reduce the impact of such differences on models built from survey data using the following data science strategies:
Validate the instrument — Validating a survey is a critical step in removing unwanted biases and noise. Many validated surveys already exist to assess psychological states and experiences. Choosing one of these surveys negates the need to develop and validate one’s own, which would involve many steps of validation including evaluating content, consistency, reliability, and relationships with similar measures. In other words, it is necessary to ensure that a survey is assessing what you think it is by determining if the content makes sense, is inclusive of the entire domain you are assessing, is replicable, is consistent, and correlates with similar measures as expected.
Measure test-retest and cross-survey validity — Even when using a validated and standardized survey, it is common to ensure the internal validity of the scale (questions are related predictably) and the test-retest reliability of the scale (scores from two separate time points are related predictably). Also, because sometimes one expects the scale to change over time based on new circumstances, it is helpful to correlate the survey scores with additional correlated surveys. By using within-subject correlations across time and between-subject correlations across related surveys, outliers can be detected and evaluated to determine if individual data points are valid. This step of quality controlling helps cut through the noise.
Model individual questions — Avoiding the temptation to consider survey responses in aggregate can lead to better outcomes. When using a sum across all question responses, individuals with very low and very high cumulative scores can offer insight into these sub-populations; however, intermediate scoring individuals become more challenging to interpret. There are numerous ways to achieve an intermediate score whereas low and high scores require the majority of responses to be low or high. For this reason, we recommend breaking surveys into individual questions and modeling responses on their intended scales.
Regression from classification — Total survey response regression models can be constructed from individual question models with enhanced interpretability and specificity. Even in cases where classification is used at the individual question level, a total score regression output can be generated based either on binarized model predictions or continuous model probabilities. Breaking survey score sums down in this way offers the ability to investigate which questions provide the strongest signal for the prediction task or tasks at hand.
Consider perceived severity — While there is no acknowledged strategy to normalize survey responses on commonly administered Likert scales, one can improvise in situations where survey takers respond to questions about the relative impact of survey questions on their day-to-day life. For example, when diagnosing behavioral disorders, the Diagnostic and Statistical Manual (DSM) requires that “symptoms cause clinically significant distress or impairment in social, occupational, or other important areas of functioning” to diagnose. Therefore, if an individual reports severe symptoms but suggests that they are impacted very little, either the survey taker’s symptoms should be interpreted with less weight or they are not responding truthfully.
Human psychology can be incredibly challenging to measure, even with the best tools available. While we are continually learning more about the connections between human thoughts and behaviors, self-report surveys remain a blunt instrument for assessing nuanced individual details. Careful analysis like the methods described above can help reduce bias and noise in survey data, but self-reports will always have inherent limitations.
These limitations are pressing scientists and business people alike toward the search for more objective methods for quantifying psychology. One promising example is combining self-report with video analytics, which augments individual experiences with contextual behavioral details that enhance interpretability. In an upcoming blog post, we discuss opportunities to leverage video footage for human behavior analysis.