Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics
Human Language Technologies: Tutorial Abstracts, pages 33 - 38
July 10-15, 2022 ©2022 Association for Computational Linguistics
Tutorial on Multimodal Machine Learning
Louis-Philippe Morency, Paul Pu Liang, Amir Zadeh
Carnegie Mellon University
{morency,pliang,abagherz}@cs.cmu.edu
https://cmu-multicomp-lab.github.io/mmml-tutorial/naacl2022/
Abstract
Multimodal machine learning involves inte-
grating and modeling information from multi-
ple heterogeneous and interconnected sources
of data. It is a challenging yet crucial area
with numerous real-world applications in mul-
timedia, affective computing, robotics, finance,
HCI, and healthcare. This tutorial, building
upon a new edition of a survey paper on mul-
timodal ML as well as previously-given tutori-
als and academic courses, will describe an up-
dated taxonomy on multimodal machine learn-
ing synthesizing its core technical challenges
and major directions for future research.
1 Introduction
Multimodal machine learning is a vibrant multi-
disciplinary research field that addresses some orig-
inal goals of AI by integrating and modeling multi-
ple communicative modalities, including linguistic,
acoustic, and visual messages. With the initial
research on audio-visual speech recognition and
more recently with language & vision projects such
as image and video captioning, visual question an-
swering, and language-guided reinforcement learn-
ing, this research field brings some unique chal-
lenges for multimodal researchers given the hetero-
geneity of the data and the contingency often found
between modalities.
This tutorial builds upon the annual course on
Multimodal Machine Learning taught at Carnegie
Mellon University and is a revised version of the
previous tutorials on multimodal learning at CVPR
2021, ACL 2017, CVPR 2016, and ICMI 2016.
These previous tutorials were based on our earlier
survey on multimodal machine learning, which in-
troduced an initial taxonomy for core multimodal
challenges (Baltrusaitis et al., 2019). The present
tutorial is based on a revamped taxonomy of the
core technical challenges and updated concepts
about recent work in multimodal machine learn-
ing (Liang et al., 2022). The tutorial will be cen-
tered around six core challenges in multimodal
machine learning:
1. Representation:
A first fundamental challenge
is to learn representations that exploit cross-modal
interactions between individual elements of differ-
ent modalities. The heterogeneity of multimodal
data makes it particularly challenging to learn mul-
timodal representations. We will cover fundamen-
tal approaches for (1) representation fusion (in-
tegrating information from 2 or more modalities,
effectively reducing the number of separate repre-
sentations), (2) representation coordination (inter-
changing cross-modal information with the goal
of keeping the same number of representations but
improving multimodal contextualization), and (3)
representation fission (creating a new disjoint set
of representations, usually larger number than the
input set, that reflects knowledge about internal
structure such as data clustering or factorization).
2. Alignment:
A second challenge is to identify
the connections between all elements of different
modalities using their structure and cross-modal in-
teractions. For example, when analyzing the speech
and gestures of a human subject, how can we align
specific gestures with spoken words or utterances?
Alignment between modalities is challenging since
it may exist at different (1) granularities (words,
utterances, frames, videos), involve varying (2) cor-
respondences (one-to-one, many-to-many, or not
exist at all), and depend on long-range (3) depen-
dencies.
3. Reasoning
is defined as composing knowledge
from multimodal evidences, usually through mul-
tiple inferential steps, to exploit multimodal align-
ment and problem structure for a specific task. This
relationship often follows some hierarchical struc-
ture, where more abstract concepts are defined
higher in the hierarchy as a function of less ab-
stract concepts. Multimodal reasoning involves
the subchallenges of capturing this (1) structure
(through domain knowledge or discovered from
33