Web文本信息的开放式关系提取系统

ID:38924

大小:0.58 MB

页数:19页

时间:2023-03-14

金币:2

上传者:战必胜
Citation: Li, H.; Liu, B. An Open
Relation Extraction System for Web
Text Information. Appl. Sci. 2022, 12,
5718. https://doi.org/10.3390/
app12115718
Academic Editors: Katia Lida
Kermanidis, Phivos Mylonas
and Manolis Maragoudakis
Received: 6 May 2022
Accepted: 31 May 2022
Published: 4 June 2022
Publishers Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
applied
sciences
Article
An Open Relation Extraction System for Web Text Information
Huagang Li and Bo Liu *
College of Computer Science and Technology, National University of Defense Technology,
Changsha 410073, China; lihuagang21@163.com
* Correspondence: kyle.liu@nudt.edu.cn
Abstract:
Web texts typically undergo the open-ended growth of new relations. Traditional relation
extraction methods lack automatic annotation and perform poorly on new relation extraction tasks.
We propose an open-domain relation extraction system (ORES) based on distant supervision and
few-shot learning to solve this problem. More specifically, we utilize tBERT to design instance selector
1, implementing automatic labeling in the data mining component. Meanwhile, we design example
selector 2 based on K-BERT in the new relation extraction component. The real-time data management
component outputs new relational data. Experiments show that ORES can filter out higher quality
and diverse instances for better new relation learning. It achieves significant improvement compared
to Neural Snowball with fewer seed sentences.
Keywords: open relation extraction; few-shot learning; knowledge extraction; tBERT; K-BERT
1. Introduction
Information and knowledge are the basis for the development of human society. Text
records 80 (https://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-
80-percent-rule/, accessed on 5 May 2022) percent of the information of human civilization.
The core task of information extraction (IE) is to obtain structured triples from unstruc-
tured text. It relies on two fundamental tasks: entity recognition and relation extraction.
Li et al. [1]
proposed an entity recognition method that performs well. For relation ex-
traction, new relation prediction is a challenge. Traditional relation extraction mainly
adopts supervised learning methods for predefined relations. Its essence is to transform
relation extraction into relation classification. There are two paradigms: pipeline relation
extraction [
2
] and joint relation extraction [
3
]. Traditional RE performs well but faces two
challenges. The first challenge is that predefined relation classifications do not work well on
new relation extraction tasks. The second challenge is that relational data relies too much
on manual cleaning and labeling, which is costly. In addition, for large-scale knowledge
bases such as Wikidata, manual annotation would be challenging to accomplish.
To solve this problem, Banko [
4
] first proposed the concept of open information extrac-
tion. That is, extracting structured relational facts from open and growing unstructured
text. Information extraction should not be limited to a small set of known relations. RE
should be able to extract a wide variety of relations in a text. The scope of its research is that
the entity pair of the relation is known, and the relationship type between the entity pair
is unlimited. Open-domain relation extraction should meet three academic requirements:
automation, non-homologous corpus, and high efficiency.
Automation
The open relation extraction system can execute automatically, and the algorithm only
needs to go through the corpus once for triple tuples extraction. It should be based on
an unsupervised extraction strategy and cannot be a predefined relation. In addition,
the cost of manually constructing training samples is small, and only a tiny number of
initialization seeds need to be labeled or a small number of extraction templates need
to be defined.
Appl. Sci. 2022, 12, 5718. https://doi.org/10.3390/app12115718 https://www.mdpi.com/journal/applsci
资源描述:

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭