Citation: Li, H.; Liu, B. An Open
Relation Extraction System for Web
Text Information. Appl. Sci. 2022, 12,
5718. https://doi.org/10.3390/
app12115718
Academic Editors: Katia Lida
Kermanidis, Phivos Mylonas
and Manolis Maragoudakis
Received: 6 May 2022
Accepted: 31 May 2022
Published: 4 June 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
An Open Relation Extraction System for Web Text Information
Huagang Li and Bo Liu *
College of Computer Science and Technology, National University of Defense Technology,
Changsha 410073, China; lihuagang21@163.com
* Correspondence: kyle.liu@nudt.edu.cn
Abstract:
Web texts typically undergo the open-ended growth of new relations. Traditional relation
extraction methods lack automatic annotation and perform poorly on new relation extraction tasks.
We propose an open-domain relation extraction system (ORES) based on distant supervision and
few-shot learning to solve this problem. More specifically, we utilize tBERT to design instance selector
1, implementing automatic labeling in the data mining component. Meanwhile, we design example
selector 2 based on K-BERT in the new relation extraction component. The real-time data management
component outputs new relational data. Experiments show that ORES can filter out higher quality
and diverse instances for better new relation learning. It achieves significant improvement compared
to Neural Snowball with fewer seed sentences.
Keywords: open relation extraction; few-shot learning; knowledge extraction; tBERT; K-BERT
1. Introduction
Information and knowledge are the basis for the development of human society. Text
records 80 (https://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-
80-percent-rule/, accessed on 5 May 2022) percent of the information of human civilization.
The core task of information extraction (IE) is to obtain structured triples from unstruc-
tured text. It relies on two fundamental tasks: entity recognition and relation extraction.
Li et al. [1]
proposed an entity recognition method that performs well. For relation ex-
traction, new relation prediction is a challenge. Traditional relation extraction mainly
adopts supervised learning methods for predefined relations. Its essence is to transform
relation extraction into relation classification. There are two paradigms: pipeline relation
extraction [
2
] and joint relation extraction [
3
]. Traditional RE performs well but faces two
challenges. The first challenge is that predefined relation classifications do not work well on
new relation extraction tasks. The second challenge is that relational data relies too much
on manual cleaning and labeling, which is costly. In addition, for large-scale knowledge
bases such as Wikidata, manual annotation would be challenging to accomplish.
To solve this problem, Banko [
4
] first proposed the concept of open information extrac-
tion. That is, extracting structured relational facts from open and growing unstructured
text. Information extraction should not be limited to a small set of known relations. RE
should be able to extract a wide variety of relations in a text. The scope of its research is that
the entity pair of the relation is known, and the relationship type between the entity pair
is unlimited. Open-domain relation extraction should meet three academic requirements:
automation, non-homologous corpus, and high efficiency.
• Automation
The open relation extraction system can execute automatically, and the algorithm only
needs to go through the corpus once for triple tuples extraction. It should be based on
an unsupervised extraction strategy and cannot be a predefined relation. In addition,
the cost of manually constructing training samples is small, and only a tiny number of
initialization seeds need to be labeled or a small number of extraction templates need
to be defined.
Appl. Sci. 2022, 12, 5718. https://doi.org/10.3390/app12115718 https://www.mdpi.com/journal/applsci