Citation: Bhowmik, K.; Ralescu, A.
Clustering of Monolingual
Embedding Spaces. Digital 2023, 3,
48–66. https://doi.org/
10.3390/digital3010004
Academic Editors: Phivos Mylonas,
Katia Lida Kermanidis and Manolis
Maragoudakis
Received: 16 January 2023
Revised: 13 February 2023
Accepted: 14 February 2023
Published: 23 February 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
Clustering of Monolingual Embedding Spaces
Kowshik Bhowmik
1,
* and Anca Ralescu
2
1
The College of Wooster, Mathematical and Computational Sciences, Wooster, OH 44691, USA
2
Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221, USA
* Correspondence: kbhowmik@wooster.edu
Abstract:
Suboptimal performance of cross-lingual word embeddings for distant and low-resource
languages calls into question the isomorphic assumption integral to the mapping-based methods
of obtaining such embeddings. This paper investigates the comparative impact of typological
relationship and corpus size on the isomorphism between monolingual embedding spaces. To that
end, two clustering algorithms were applied to three sets of pairwise degrees of isomorphisms. It is
also the goal of the paper to determine the combination of the isomorphism measure and clustering
algorithm that best captures the typological relationship among the chosen set of languages. Of the
three measures investigated, Relational Similarity seemed to capture best the typological information
of the languages encoded in their respective embedding spaces. These language clusters can help us
identify, without any pre-existing knowledge about the real-world linguistic relationships shared
among a group of languages, the related higher-resource languages of low-resource languages. The
presence of such languages in the cross-lingual embedding space can help improve the performance
of low-resource languages in a cross-lingual embedding space.
Keywords:
cross-lingual word embeddings; low-resource languages; bilingual lexicon induction;
degree of isomorphism
1. Introduction
Mapping-based methods of inducing cross-lingual word embeddings are based on
the assumption that semantic concepts are language-independent [
1
]. This assumption
led to learning an orthogonal or isomorphic map from one monolingual embedding space
to another using known translation word pairs between two languages. Cross-lingual
embedding space induced in this manner can enable tasks such as Bilingual Lexicon In-
duction and Machine Translation and can also lead to the transferring of knowledge from
one language to another. However, the reported performance of such tasks showed sub-
optimal performance for low-resource languages, which are also often languages that are
typologically distant from English and other resource-rich European languages [
2
]. These
results show the weakness of the isomorphic assumption. Henceforth, researchers have pro-
posed several explanations for the varying degrees of isomorphism between independently
trained monolingual embedding spaces, notable among which are typological differences
among the languages in question and the comparative resources on which the word em-
beddings were trained [
3
]. To investigate the comparative impact of these two factors, this
research employs two clustering algorithms: Hierarchical and Fuzzy C-Means on pairwise
similarity/distance values computed among the chosen set of languages. The languages
are diverse both in terms of the language families they belong to and the amount of avail-
able resources. Three measures of isomorphism reported in the literature were utilized:
Eigensimilarity [
1
], Gromov–Hausdorff distance [
4
], and Relational Similarity [
3
]. Another
aim of this research was to determine the combination of the measure of isomorphism
and clustering algorithms that best aligns with our existing knowledge of the language
families. This was performed with the view to finding out the measure of isomorphism
that can substitute for a measure of linguistic similarities among a group of languages. The
Digital 2023, 3, 48–66. https://doi.org/10.3390/digital3010004 https://www.mdpi.com/journal/digital