单语言嵌入空间的聚类

ID:38930

大小:0.32 MB

页数:19页

时间:2023-03-14

金币:2

上传者:战必胜
Citation: Bhowmik, K.; Ralescu, A.
Clustering of Monolingual
Embedding Spaces. Digital 2023, 3,
48–66. https://doi.org/
10.3390/digital3010004
Academic Editors: Phivos Mylonas,
Katia Lida Kermanidis and Manolis
Maragoudakis
Received: 16 January 2023
Revised: 13 February 2023
Accepted: 14 February 2023
Published: 23 February 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
Clustering of Monolingual Embedding Spaces
Kowshik Bhowmik
1,
* and Anca Ralescu
2
1
The College of Wooster, Mathematical and Computational Sciences, Wooster, OH 44691, USA
2
Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221, USA
* Correspondence: kbhowmik@wooster.edu
Abstract:
Suboptimal performance of cross-lingual word embeddings for distant and low-resource
languages calls into question the isomorphic assumption integral to the mapping-based methods
of obtaining such embeddings. This paper investigates the comparative impact of typological
relationship and corpus size on the isomorphism between monolingual embedding spaces. To that
end, two clustering algorithms were applied to three sets of pairwise degrees of isomorphisms. It is
also the goal of the paper to determine the combination of the isomorphism measure and clustering
algorithm that best captures the typological relationship among the chosen set of languages. Of the
three measures investigated, Relational Similarity seemed to capture best the typological information
of the languages encoded in their respective embedding spaces. These language clusters can help us
identify, without any pre-existing knowledge about the real-world linguistic relationships shared
among a group of languages, the related higher-resource languages of low-resource languages. The
presence of such languages in the cross-lingual embedding space can help improve the performance
of low-resource languages in a cross-lingual embedding space.
Keywords:
cross-lingual word embeddings; low-resource languages; bilingual lexicon induction;
degree of isomorphism
1. Introduction
Mapping-based methods of inducing cross-lingual word embeddings are based on
the assumption that semantic concepts are language-independent [
1
]. This assumption
led to learning an orthogonal or isomorphic map from one monolingual embedding space
to another using known translation word pairs between two languages. Cross-lingual
embedding space induced in this manner can enable tasks such as Bilingual Lexicon In-
duction and Machine Translation and can also lead to the transferring of knowledge from
one language to another. However, the reported performance of such tasks showed sub-
optimal performance for low-resource languages, which are also often languages that are
typologically distant from English and other resource-rich European languages [
2
]. These
results show the weakness of the isomorphic assumption. Henceforth, researchers have pro-
posed several explanations for the varying degrees of isomorphism between independently
trained monolingual embedding spaces, notable among which are typological differences
among the languages in question and the comparative resources on which the word em-
beddings were trained [
3
]. To investigate the comparative impact of these two factors, this
research employs two clustering algorithms: Hierarchical and Fuzzy C-Means on pairwise
similarity/distance values computed among the chosen set of languages. The languages
are diverse both in terms of the language families they belong to and the amount of avail-
able resources. Three measures of isomorphism reported in the literature were utilized:
Eigensimilarity [
1
], Gromov–Hausdorff distance [
4
], and Relational Similarity [
3
]. Another
aim of this research was to determine the combination of the measure of isomorphism
and clustering algorithms that best aligns with our existing knowledge of the language
families. This was performed with the view to finding out the measure of isomorphism
that can substitute for a measure of linguistic similarities among a group of languages. The
Digital 2023, 3, 48–66. https://doi.org/10.3390/digital3010004 https://www.mdpi.com/journal/digital
资源描述:

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭