It’s Not What You Say But How You Say It

We are constantly sending signals that communicate emotion, intent, and even underlying medical conditions.

Speech and video analytics to characterize human behavior

We are constantly sending signals that communicate emotion, intent, and even underlying medical conditions. These signals can take the form of facial macro and micro expressions, variations in posture and body movements, as well as minute fluctuations in vocal pitch, frequency, and pronunciations.  AI is now at a point where we can read and interpret these signals for both business and health care purposes.

Ongoing development of digital diagnostics and increased adoption of telemedicine technologies are opening doors to extract visual, auditory, and text-based biomarkers that point toward diagnoses, both mental and physical, to improve patient access and outcomes. Beyond healthcare, the current acceleration in the adoption of video-based conferencing and recruiting also creates the opportunity for interesting applications and the development of novel datasets outside of medicine and research. For instance, we can contemplate the ability to provide detailed real-time feedback on emotional cues in employee and customer interactions to optimize hiring and sales.

Comprehensive AI-based video analytics requires a multimodal approach, combining insights from video, audio, and text into a single data stream. Individual features are often limited to predicting the most obvious of human behaviors, (e.g. smiling means happy or saying “I’m sad.” means sad). Likewise, when analyzed separately, modes like transcribed text can be brittle, as they are subject to deceit, sarcasm, or misinterpretation. Combining modes and passing them through statistical machine learning models enables the identification of important features and the creation of higher-dimensional features, unlocking deeper insight and better predictive ability.  

Leveraging the power of multimodal analysis

The power of a multimodal approach stems from each mode contributing unique information that complements features from the other types. Here, we unpack some of the key insights conveyed by each mode.

Video: Video features tend to be fairly intuitive, focusing on facial expressions and body postures in each frame (e.g. pupil dilation, blink rate, gaze location, eye movements, head orientation, and head movements). Tracking gaze and identifying subtle emotional differences between facial expressions is possible through trained neural networks. Further, state-of-the-art approaches apply transfer learning from existing face analysis models to predict more complex phenomena including visible physiological conditions such as strokes, seizures, and other movement disorders of the face.

Audio: Frequency information can be extracted from audio recordings, of speech or other sound waves, to measure intonation and inflection.  This helps to distinguish how people speak, as opposed to simply what they say. From the time and frequency representations of audio signals, we can also extract features related to phonation--the way in which a speaker produces sounds--and use these to quantify and predict emotions or physiological conditions.

Text: Text can be transcribed from video and audio recordings of human speech. Cloud service providers now host a variety of API services with state-of-the-art tools that not only provide transcripts in real-time, but also additional features such as word-timing, transcription confidence, and alternative interpretations. Machine learning models and natural language processing techniques can then be applied on top of raw text features to extract keywords, sentiment, and semantic context, which enable topic modeling and automated summarization.

Combining these modalities enables us to quantify human behaviors in novel ways. In the coming years, progress in real-time video analytics will move us toward more human-like interactions with the machines we use on a daily basis. Through advances not only in processing speed but also in our ability to decode natural language and extract context and sentiment, we are providing the foundation for more conversant AI platforms. Additionally, as cloud computing companies continue to devote more resources toward healthcare applications, we expect to see more API offerings for audio-based speech analytics, which will be required to drive automation in telemedicine.

Developing a custom, scalable pipeline remains necessary for specialized video analytics applications. While numerous software packages are available, there are significant barriers preventing adoption: organizations must first identify software, train or hire qualified developers, do the actual development, deploy, and then maintain the solution. With full-stack experience building custom video analytics solutions, Mercury Data Science helps companies accelerate application development from ideation to deployment.

The Bottom Line

  • The use of video analytics is poised to bring advances across many disciplines including digital health, recruiting, and sales.
  • Measuring and predicting human behavior remains a challenge, but video analytics are moving us toward better prediction outcomes, often in real-time.
  • Many technologies exist to rapidly and easily analyze audio and video content to extract mode-specific features for use within behavioral AI applications.
Written by:
Published on:
June 17, 2020
Back to All Blog Posts
View more recent blog posts