基于自然语言处理的非结构化双语电子健康记录吸烟状态分类关键词提取算法

ID:38995

大小:0.81 MB

页数:12页

时间:2023-03-14

金币:2

上传者:战必胜
applied
sciences
Article
Keyword Extraction Algorithm for Classifying Smoking Status
from Unstructured Bilingual Electronic Health Records Based
on Natural Language Processing
Ye Seul Bae
1
, Kyung Hwan Kim
2,3,
*, Han Kyul Kim
1
, Sae Won Choi
1
, Taehoon Ko
4
, Hee Hwa Seo
1
,
Hae-Young Lee
5
and Hyojin Jeon
1

 
Citation: Bae, Y.S.; Kim, K.H.; Kim,
H.K.; Choi, S.W.; Ko, T.; Seo, H.H.;
Lee, H.-Y.; Jeon, H. Keyword
Extraction Algorithm for Classifying
Smoking Status from Unstructured
Bilingual Electronic Health Records
Based on Natural Language
Processing. Appl. Sci. 2021, 11, 8812.
https://doi.org/10.3390/
app11198812
Academic Editor: Keun Ho Ryu
Received: 23 August 2021
Accepted: 17 September 2021
Published: 22 September 2021
Publishers Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1
Office of Hospital Information, Seoul National University Hospital, Seoul 03080, Korea;
byeye1313@gmail.com (Y.S.B.); hank110@snu.ac.kr (H.K.K.); swc1@snu.ac.kr (S.W.C.);
heehwaseo@gmail.com (H.H.S.); tarahjjeon@naver.com (H.J.)
2
Department of Thoracic & Cardiovascular Surgery, Seoul National University Hospital, Seoul 03080, Korea
3
Department of Thoracic & Cardiovascular Surgery, College of Medicine, Seoul National University,
Seoul 03080, Korea
4
Department of Medical Informatics, The Catholic University of Korea, Seoul 06591, Korea;
taehoonko@snu.ac.kr
5
Department of Internal Medicine, Seoul National University Hospital, Seoul 03080, Korea; hylee612@snu.ac.kr
* Correspondence: kkh726@snu.ac.kr
Featured Application: The study presents an improved and easily obtainable method in terms of
automatic smoking classification from unstructured bilingual electronic health records.
Abstract:
Smoking is an important variable for clinical research, but there are few studies regarding
automatic obtainment of smoking classification from unstructured bilingual electronic health records
(EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using
natural language processing (NLP). With acronym replacement and Python package Soynlp, we
normalize 4711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current
smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point
Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between
these word vectors, keywords denoting the same smoking status are identified. Compared to other
keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed
approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are
used in classifying 4 smoking statuses from our bilingual EHRs. Given an identical SVM classifier,
the F1 score is improved by as much as 1.8% compared to those of the unigram and bigram Bag
of Words. Our study shows the potential of SPPMI in classifying smoking status from bilingual,
unstructured EHRs. Our current findings show how smoking information can be easily acquired for
clinical practice and research.
Keywords:
smoking; natural language processing; electronic health records; document classification;
lifestyle modification
1. Introduction
Smoking is a major risk factor in developing coronary artery disease, chronic kidney
disease, cancer, and cardiovascular disease (CVD) [
1
,
2
]. It is also considered as a modifiable
risk factor for CVDs and other conditions associated with premature death worldwide
[36]
.
Consequently, smoking status can be used to assess the risk of certain diseases and to
suggest first-line interventions based on clinical guidelines.
Despite the effectiveness and importance of smoking cessation for disease prevention,
smoking information is under-utilized and not easily measured. It is often buried in a nar-
rative text rather than in a consistent coded form. The rapid adoption of electronic health
Appl. Sci. 2021, 11, 8812. https://doi.org/10.3390/app11198812 https://www.mdpi.com/journal/applsci
资源描述:

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭