多模态机器学习(2022)

VIP文档

ID:67557

大小:0.11 MB

页数:6页

时间:2023-07-27

金币:10

上传者:战必胜
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics
Human Language Technologies: Tutorial Abstracts, pages 33 - 38
July 10-15, 2022 ©2022 Association for Computational Linguistics
Tutorial on Multimodal Machine Learning
Louis-Philippe Morency, Paul Pu Liang, Amir Zadeh
Carnegie Mellon University
{morency,pliang,abagherz}@cs.cmu.edu
https://cmu-multicomp-lab.github.io/mmml-tutorial/naacl2022/
Abstract
Multimodal machine learning involves inte-
grating and modeling information from multi-
ple heterogeneous and interconnected sources
of data. It is a challenging yet crucial area
with numerous real-world applications in mul-
timedia, affective computing, robotics, finance,
HCI, and healthcare. This tutorial, building
upon a new edition of a survey paper on mul-
timodal ML as well as previously-given tutori-
als and academic courses, will describe an up-
dated taxonomy on multimodal machine learn-
ing synthesizing its core technical challenges
and major directions for future research.
1 Introduction
Multimodal machine learning is a vibrant multi-
disciplinary research field that addresses some orig-
inal goals of AI by integrating and modeling multi-
ple communicative modalities, including linguistic,
acoustic, and visual messages. With the initial
research on audio-visual speech recognition and
more recently with language & vision projects such
as image and video captioning, visual question an-
swering, and language-guided reinforcement learn-
ing, this research field brings some unique chal-
lenges for multimodal researchers given the hetero-
geneity of the data and the contingency often found
between modalities.
This tutorial builds upon the annual course on
Multimodal Machine Learning taught at Carnegie
Mellon University and is a revised version of the
previous tutorials on multimodal learning at CVPR
2021, ACL 2017, CVPR 2016, and ICMI 2016.
These previous tutorials were based on our earlier
survey on multimodal machine learning, which in-
troduced an initial taxonomy for core multimodal
challenges (Baltrusaitis et al., 2019). The present
tutorial is based on a revamped taxonomy of the
core technical challenges and updated concepts
about recent work in multimodal machine learn-
ing (Liang et al., 2022). The tutorial will be cen-
tered around six core challenges in multimodal
machine learning:
1. Representation:
A first fundamental challenge
is to learn representations that exploit cross-modal
interactions between individual elements of differ-
ent modalities. The heterogeneity of multimodal
data makes it particularly challenging to learn mul-
timodal representations. We will cover fundamen-
tal approaches for (1) representation fusion (in-
tegrating information from 2 or more modalities,
effectively reducing the number of separate repre-
sentations), (2) representation coordination (inter-
changing cross-modal information with the goal
of keeping the same number of representations but
improving multimodal contextualization), and (3)
representation fission (creating a new disjoint set
of representations, usually larger number than the
input set, that reflects knowledge about internal
structure such as data clustering or factorization).
2. Alignment:
A second challenge is to identify
the connections between all elements of different
modalities using their structure and cross-modal in-
teractions. For example, when analyzing the speech
and gestures of a human subject, how can we align
specific gestures with spoken words or utterances?
Alignment between modalities is challenging since
it may exist at different (1) granularities (words,
utterances, frames, videos), involve varying (2) cor-
respondences (one-to-one, many-to-many, or not
exist at all), and depend on long-range (3) depen-
dencies.
3. Reasoning
is defined as composing knowledge
from multimodal evidences, usually through mul-
tiple inferential steps, to exploit multimodal align-
ment and problem structure for a specific task. This
relationship often follows some hierarchical struc-
ture, where more abstract concepts are defined
higher in the hierarchy as a function of less ab-
stract concepts. Multimodal reasoning involves
the subchallenges of capturing this (1) structure
(through domain knowledge or discovered from
33
资源描述:

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭