论文记录-2017-A review of affective computing: From unimodal analysis to multimodal fusion

论文内容
- 摘要
- 1 简介
- 2 情感分析
- 3 数据集
- - （1）意见分析（sentiment analysis）数据集
  - （2）情感识别（emotion recognition）数据集
- 4 单模态特征
- - （1）视觉Visual
  - （2）音频Audio
  - （3）文本Textual
- 5 多模态特征
- - （1）信息融合技术Information fusion techniques
  - （2）最新成果

论文内容

A review of affective computing: From unimodal analysis to multimodal fusion¹
情感计算综述：从单模态分析到多模态融合
以下仅为作者阅读论文时的记录，学识浅薄，如有错误，欢迎指正。

摘要

Affective computing is an emerging interdisciplinary research field bringing together researchers and practitioners from various fields, ranging from artificial intelligence, natural language processing, to cognitive and social sciences.
情感计算是一个新兴的跨学科研究领域，汇集了来自人工智能、自然语言处理到认知和社会科学等各个领域的研究人员和从业者。
With the proliferation of videos posted online (e.g., on YouTube, Facebook, Twitter) for product reviews, movie reviews, political views, and more, affective computing research has increasingly evolved from conventional unimodal analysis to more complex forms of multimodal analysis.
随着在网上发布的有关产品评论、电影评论、政治观点等视频（例如在YouTube, Facebook, Twitter上）数量的激增，情感计算研究越来越从传统的单模态分析发展为更复杂的多模态分析形式。
This is the primary motivation behind our first of its kind, comprehensive literature review of the diverse field of affective computing.
这是我们对不同领域的情感计算进行全面文献综述的第一个主要动机。
Furthermore, existing literature surveys lack a detailed discussion of state of the art in multimodal affect analysis frameworks, which this review aims to address.
此外，现有的文献综述缺乏对多模态情感分析框架研究现状的详细讨论，本文旨在解决这些问题。
Multimodality is defined by the presence of more than one modality or channel, e.g., visual, audio, text, gestures, and eye gage.
多模态的定义是存在多个模态或通道，例如，视觉、音频、文本、姿态和眼距。
In this paper, we focus mainly on the use of audio, visual and text information for multimodal affect analysis, since around 90% of the relevant literature appears to cover these three modalities.
在本文中，我们主要关注使用音频、视觉和文本信息进行多模态情感分析，因为大约90%的相关文献似乎涵盖了这三种模式。
Following an overview of different techniques for unimodal affect analysis, we outline existing methods for fusing information from different modalities.
在概述了单模态情感分析的不同技术之后，我们概述了融合不同模态信息的现有方法。
As part of this review, we carry out an extensive study of different categories of state-of-the-art fusion techniques, followed by a critical analysis of potential performance improvements with multimodal analysis compared to unimodal analysis.
作为本文的一部分，我们对不同类别的先进融合技术进行了广泛的研究，然后对多模态分析与单模态分析相比的潜在性能改进进行了批判性分析。
A comprehensive overview of these two complementary fields aims to form the building blocks for readers, to better understand this challenging and exciting research field.
对这两个互补领域的全面概述旨在为读者形成基石，以更好地理解这个具有挑战性和令人兴奋的研究领域。

1 简介

当时大多数研究都聚焦在视觉（visual）和听觉（aural）信息的多模态情感识别（emotion recognition）上，关于意见分析（sentiment analysis）的多模态研究很少，几乎都在自然语言处理领域，但随着社交媒体的发展，人们广泛使用图像、视频、音频等方式发表观点。
人类也依赖多模态信息理解世界，例如听演讲时会根据发言人的语调、表情来理解TA的意图；在车祸中即使没有看到火焰也可以根据烧焦的橡胶味判断出我们需要逃离现场。
该领域待解决的挑战：
- Continuous data from real noisy sensors may generate incorrect data.
  来自真实噪声传感器的连续数据可能会产生不正确的数据。
- Identifying whether the extracted audio and utterance refer to the same content.
  识别提取的音频和话语是否指相同的内容。
- Multimodal affect analysis models should be trained on Big Data from diverse contexts, in order build generalized models.
  多模态情感分析模型应该在背景多样的大数据上进行训练，以便建立通用模型。
- Effective modeling of temporal information in the Big Data.
  在大数据的时序信息上进行情感建模。
- For real-time analysis of multi-modal Big Data, an appropriately scalable Big Data architecture and platform needs to be designed, to effectively cope with the heterogeneous Big Data challenges of growing space and time complexity
  为了对多模态大数据进行实时分析，需要设计一个适当可伸缩的大数据架构和平台，以有效地应对空间和时间复杂性带来的多种大数据挑战。

2 情感分析

情感计算（Affective computing）是一组旨在从不同模态和不同的粒度尺度上对数据进行情感识别的技术。例如，意见分析（Sentiment analysis）是粗粒度的情感识别（affect recognition），通常是二元分类任务（正向和负向），而情感识别（emotion recognition）是将一大组数据分成不同情感标签的细粒度任务。
情感（emotions）研究的发展：
- 情感的哲学研究可以追溯到古希腊人和罗马人；
- 在早期的斯多葛学派（the early Stoics）之后，西塞罗Cicero将情绪列举并分为四个基本类别：metus (恐惧fear), aegritudo (疼痛pain), libido (性欲lust), and laetitia (快乐pleasure)；
- 19世纪晚期，达尔文Darwin发起情绪进化论（the evolutionary theory of emotions）的研究，他认为，情感是通过自然选择进化而来的，所以具有跨文化的通用性；
- 20世纪70年代初，埃克曼Ekman发现了人类共有六种基本情绪的证据：快乐（happiness）、悲伤（sadness）、恐惧（fear）、愤怒（anger）、厌恶（disgust）和惊喜（surprise）；
- 2000年前后，很少有人尝试去检测的非基本情绪状态也被提出了，如疲劳（fatigue）、焦虑（anxiety）、满足（satisfaction）、困惑（confusion）或沮丧（frustration）。
- 1980年，阿维里尔Averill提出了情感不能严格根据生理或认知术语来解释的观点。相反，他声称情绪主要是社会建构；因此，社会层面的分析是必要的，以真正理解情感的本质。建构主义者和人类学家以情感和语言的关系来质疑埃克曼研究的普遍性，认为他的研究以美国为中心，因为他用的英文情感标签在某些语言中不存在（例如有些语言没有fear）。
- 上述分类（categorical）方法很难覆盖日常交流中的复杂情感，而维度（dimensional）方法可以将情感表示为一个多维情感空间中的坐标。一个早期的例子是罗素Russell的环绕式模型，它使用唤醒（arousal）和效价（valence）两个维度来绘制150个情感标签。
- 类似地，威士塞尔Whissell将情绪视为一个连续的二维空间，其维度是评估（evaluation）和激活（activition）。
- 1980年，普鲁奇克Plutchik的情感轮，它提供了一个基于进化原则的综合理论，由8个基本情绪和8个高级情绪组成，每个情绪轮由2个基本情绪组成。在该模型中，垂直维度表示强度（intensity），径向维度表示情绪之间的相似度（similarity）。
- 除了二维情感空间，还有三维的，例如唤醒-效价-支配（dominance），在其他文献中有不同的名字：评估-激活-力度（power），以及更高维度的情感空间。这种维度表示方法比单词更容易处理，而且也更好处理情绪状态随着时间的变化；但缺点是很难表示复合情感状态。
- 基于普鲁奇克Plutchik情感轮提出的沙漏模型（the Hourglass of Emotions）打破了这些限制，这是第一次明确地尝试连接意见分析（sentiment analysis）和情感识别（emotion recognition）。

3 数据集

两种主要的数据集收集方法：自然视频和基于预设脚本行动的视频记录（受试者根据情感脚本表演），多模态框架相比单模态在后者建立的数据集上改善更多。
上述方法的缺点：创建数据集花费较多时间、标签有偏差；Morency提出了一种数据集收集的方法：从流行的社交网站上抓取产品评论视频，然后贴上情感和意见标签。
这两种方法都是在话语级别（utterance level）进行标记。

（1）意见分析（sentiment analysis）数据集

（2）情感识别（emotion recognition）数据集

4 单模态特征（1）视觉Visual

1.1面部动作编码系统（Facial action coding system）
1.2主要的面部表情识别技术（Main facial expression recognition techniques）
1. Active Appearance Models (AAM)：采用基于梯度的模型拟合方法解耦对象的形状和纹理；
2. Optical flow models：用于基于梯度计算物体的运动或两个图像帧的运动；
3. Active Shape Models (ASM)：一种统计模型，以与训练数据一致的方式匹配图像中的数据或对象，主要用于增强有噪声或杂乱环境下图像的自动分析；
4. 3D Morphable Models (3DMM)
5. Muscle-based models
6. 3D wireframe models
7. Elastic net model
8. Geometry-based shape models
9. 3D Constrained Local Model (CLM-Z)
10. Generalized Adaptive View-based Appearance Model (GAVAM)
1.3从视频中提取时序特征的方法
1.4身体姿态（Body gestures）
1.5深度学习方法

（2）音频Audio

一些重要的音频特征：
1. Mel Frequency Cepstral Coefficients (MFCC)
2. Spectral centroid
3. Spectral flux
4. Beat Histogram
5. Beat sum
6. Strongest beat
7. Pause duration
8. Pitch
9. The Perceptual Linear Predictive Coefficients (PLP)
音频特征提取工具包：OpenSMILE
2.1局部特征vs全局特征
2.2独立于说话者的应用
2.3深度学习方法

（3）文本Textual

到目前为止，基于文本的观点和情绪识别方法主要依赖于基于规则的技术、使用大量观点或情感词汇的词袋模型(BoW)，或假设大数据集可用，使用带有极性或情绪性标签标注的统计方法。

5 多模态特征（1）信息融合技术Information fusion techniques

特征级（Feature-level）或早期融合（early fusion）
决策级（Decision-level）或晚期融合（late fusion）
混合融合（Hybrid multimodal fusion）
模型级融合（Model-level fusion）
- 基于规则的融合方法（Rule-based fusion methods）
- 基于分类的融合方法（Classification-based fusion methods）
- 基于估计的融合方法（Estimation-based fusion methods）

（2）最新成果

略

Poria S, Cambria E, Bajpai R, et al. A review of affective computing: From unimodal analysis to multimodal fusion[J]. Information Fusion, 2017, 37: 98-125. ↩︎

论文记录-2017-A review of affective computing: From unimodal analysis to multimodal fusion

大数据系统相关栏目本月热门文章