多模态机器学习（2022）

VIP文档

ID：67557

阅读量：1

大小：0.11 MB

页数：6页

时间：2023-07-27

金币：10

上传者：战必胜

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics

Human Language Technologies: Tutorial Abstracts, pages 33 - 38

Tutorial on Multimodal Machine Learning

Louis-Philippe Morency, Paul Pu Liang, Amir Zadeh

Carnegie Mellon University

{morency,pliang,abagherz}@cs.cmu.edu

https://cmu-multicomp-lab.github.io/mmml-tutorial/naacl2022/

Abstract

Multimodal machine learning involves inte-

grating and modeling information from multi-

ple heterogeneous and interconnected sources

of data. It is a challenging yet crucial area

with numerous real-world applications in mul-

timedia, affective computing, robotics, ﬁnance,

HCI, and healthcare. This tutorial, building

upon a new edition of a survey paper on mul-

timodal ML as well as previously-given tutori-

als and academic courses, will describe an up-

dated taxonomy on multimodal machine learn-

ing synthesizing its core technical challenges

and major directions for future research.

1 Introduction

Multimodal machine learning is a vibrant multi-

disciplinary research ﬁeld that addresses some orig-

inal goals of AI by integrating and modeling multi-

ple communicative modalities, including linguistic,

acoustic, and visual messages. With the initial

research on audio-visual speech recognition and

more recently with language & vision projects such

as image and video captioning, visual question an-

swering, and language-guided reinforcement learn-

ing, this research ﬁeld brings some unique chal-

lenges for multimodal researchers given the hetero-

geneity of the data and the contingency often found

between modalities.

This tutorial builds upon the annual course on

Multimodal Machine Learning taught at Carnegie

Mellon University and is a revised version of the

previous tutorials on multimodal learning at CVPR

2021, ACL 2017, CVPR 2016, and ICMI 2016.

These previous tutorials were based on our earlier

survey on multimodal machine learning, which in-

troduced an initial taxonomy for core multimodal

challenges (Baltrusaitis et al., 2019). The present

tutorial is based on a revamped taxonomy of the

core technical challenges and updated concepts

about recent work in multimodal machine learn-

ing (Liang et al., 2022). The tutorial will be cen-

tered around six core challenges in multimodal

machine learning:

1. Representation:

A ﬁrst fundamental challenge

is to learn representations that exploit cross-modal

interactions between individual elements of differ-

ent modalities. The heterogeneity of multimodal

data makes it particularly challenging to learn mul-

timodal representations. We will cover fundamen-

tal approaches for (1) representation fusion (in-

tegrating information from 2 or more modalities,

effectively reducing the number of separate repre-

sentations), (2) representation coordination (inter-

changing cross-modal information with the goal

of keeping the same number of representations but

improving multimodal contextualization), and (3)

representation ﬁssion (creating a new disjoint set

of representations, usually larger number than the

input set, that reﬂects knowledge about internal

structure such as data clustering or factorization).

2. Alignment:

A second challenge is to identify

the connections between all elements of different

modalities using their structure and cross-modal in-

teractions. For example, when analyzing the speech

and gestures of a human subject, how can we align

speciﬁc gestures with spoken words or utterances?

Alignment between modalities is challenging since

it may exist at different (1) granularities (words,

utterances, frames, videos), involve varying (2) cor-

respondences (one-to-one, many-to-many, or not

exist at all), and depend on long-range (3) depen-

dencies.

3. Reasoning

is deﬁned as composing knowledge

from multimodal evidences, usually through mul-

tiple inferential steps, to exploit multimodal align-

ment and problem structure for a speciﬁc task. This

relationship often follows some hierarchical struc-

ture, where more abstract concepts are deﬁned

higher in the hierarchy as a function of less ab-

stract concepts. Multimodal reasoning involves

the subchallenges of capturing this (1) structure

(through domain knowledge or discovered from

资源描述：

当前文档最多预览五页，下载文档查看全文

侵权申诉



1 1 2 3 4 5 / 6



此文档下载收益归作者所有

当前文档最多预览五页，下载文档查看全文

版权提示

温馨提示：
1. 部分包含数学公式或PPT动画的文件，查看预览时可能会显示错乱或异常，文件下载后无此问题，请放心下载。
2. 本文档由用户上传，版权归属用户，天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容，确认文档内容符合您的需求后进行下载，若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误，付费完成后未能成功下载的用户请联系客服处理。

大家都在看

近期热门

多模态机器学习（2022）

最近更新

大家都在看

相关文章

相关标签