单语言嵌入空间的聚类

ID：38930

阅读量：1

大小：0.32 MB

页数：19页

时间：2023-03-14

金币：2

上传者：战必胜

Citation: Bhowmik, K.; Ralescu, A.

Clustering of Monolingual

Embedding Spaces. Digital 2023, 3,

48–66. https://doi.org/

10.3390/digital3010004

Academic Editors: Phivos Mylonas,

Katia Lida Kermanidis and Manolis

Maragoudakis

Received: 16 January 2023

Revised: 13 February 2023

Accepted: 14 February 2023

Published: 23 February 2023

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

Article

Clustering of Monolingual Embedding Spaces

Kowshik Bhowmik

* and Anca Ralescu

The College of Wooster, Mathematical and Computational Sciences, Wooster, OH 44691, USA

Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221, USA

* Correspondence: kbhowmik@wooster.edu

Abstract:

Suboptimal performance of cross-lingual word embeddings for distant and low-resource

languages calls into question the isomorphic assumption integral to the mapping-based methods

of obtaining such embeddings. This paper investigates the comparative impact of typological

relationship and corpus size on the isomorphism between monolingual embedding spaces. To that

end, two clustering algorithms were applied to three sets of pairwise degrees of isomorphisms. It is

also the goal of the paper to determine the combination of the isomorphism measure and clustering

algorithm that best captures the typological relationship among the chosen set of languages. Of the

three measures investigated, Relational Similarity seemed to capture best the typological information

of the languages encoded in their respective embedding spaces. These language clusters can help us

identify, without any pre-existing knowledge about the real-world linguistic relationships shared

among a group of languages, the related higher-resource languages of low-resource languages. The

presence of such languages in the cross-lingual embedding space can help improve the performance

of low-resource languages in a cross-lingual embedding space.

Keywords:

cross-lingual word embeddings; low-resource languages; bilingual lexicon induction;

degree of isomorphism

1. Introduction

Mapping-based methods of inducing cross-lingual word embeddings are based on

the assumption that semantic concepts are language-independent [

]. This assumption

led to learning an orthogonal or isomorphic map from one monolingual embedding space

to another using known translation word pairs between two languages. Cross-lingual

embedding space induced in this manner can enable tasks such as Bilingual Lexicon In-

duction and Machine Translation and can also lead to the transferring of knowledge from

one language to another. However, the reported performance of such tasks showed sub-

optimal performance for low-resource languages, which are also often languages that are

typologically distant from English and other resource-rich European languages [

]. These

results show the weakness of the isomorphic assumption. Henceforth, researchers have pro-

posed several explanations for the varying degrees of isomorphism between independently

trained monolingual embedding spaces, notable among which are typological differences

among the languages in question and the comparative resources on which the word em-

beddings were trained [

]. To investigate the comparative impact of these two factors, this

research employs two clustering algorithms: Hierarchical and Fuzzy C-Means on pairwise

similarity/distance values computed among the chosen set of languages. The languages

are diverse both in terms of the language families they belong to and the amount of avail-

able resources. Three measures of isomorphism reported in the literature were utilized:

Eigensimilarity [

], Gromov–Hausdorff distance [

], and Relational Similarity [

]. Another

aim of this research was to determine the combination of the measure of isomorphism

and clustering algorithms that best aligns with our existing knowledge of the language

families. This was performed with the view to ﬁnding out the measure of isomorphism

that can substitute for a measure of linguistic similarities among a group of languages. The

Digital 2023, 3, 48–66. https://doi.org/10.3390/digital3010004 https://www.mdpi.com/journal/digital

资源描述：

当前文档最多预览五页，下载文档查看全文

侵权申诉



1 1 2 3 4 5 / 19



此文档下载收益归作者所有

当前文档最多预览五页，下载文档查看全文

版权提示

温馨提示：
1. 部分包含数学公式或PPT动画的文件，查看预览时可能会显示错乱或异常，文件下载后无此问题，请放心下载。
2. 本文档由用户上传，版权归属用户，天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容，确认文档内容符合您的需求后进行下载，若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误，付费完成后未能成功下载的用户请联系客服处理。

大家都在看

近期热门

单语言嵌入空间的聚类

最近更新

大家都在看

相关文章

相关标签