Session 35: Text Analytics: You Need More than NLP Eric Just Senior Vice President Health Catalyst Learning Objectives • Why text search is an important part of clinical text analytics • The fundamentals of how search works • How clinical text search can be refined with natural language processing (NLP) and other techniques 2 Poll Question #1 For my organization, I see text analytics as: a) Completely unnecessary for analytics b) A “nice to have” for analytics c) Very important for a few key areas of analytics d) Mission critical across nearly all areas of analytics e) Unsure or not applicable High-Risk Population: Peripheral Arterial Disease PAD • PAD affects over 3 million patients per year • Narrowed arteries reduce blood flow to limbs • Patients with PAD are considered high risk For organizations trying to understand their risk, not being able to find high-risk patients is a problem. ICD/CPT N=9,592 Natural Language Processing (NLP) N=41,741 Peripheral artery disease Claudication Rest pain Ischemic Limb Duke J, Chase M, Ring N, Martin J, Fuhr R, Chatterjee A, Hirsh AT. Use of natural language processing of unstructured data significantly increases the detection of peripheral arterial disease in observational data. American College of Cardiologists Scientific Session. Chicago, IL, April 2016. 4 Analytics has a problem Most organizations ignore text analytics – because it is expensive and difficult Up to 80% of clinical data stored in text Most text analytics requires advanced technical skillsets 5 Typical Scenario “As a healthcare system administrator, I want to understand my high-risk population better. I want to find all patients with peripheral arterial disease (PAD). I know there are more patients than I was able to find by simply querying diagnosis and procedure codes.” Data scientist develops PAD text mining algorithm Algorithm validated The patient cohortto Results returned isinvestigator defined 6 Better Scenario “As a healthcare system administrator, I want to understand my high-risk population better. I want to find all patients with peripheral arterial disease (PAD). I know there are more patients than I was able to find by simply querying diagnosis and procedure codes.” Data scientist develops PAD algorithm Algorithm validated PAD algorithm run nightly and stored in data warehouse PAD algorithm output combined with coded data to create PAD registry 7 How to Best Leverage Text Analytics PAD Algorithm Diabetes Algorithm Ejection Fraction Algorithm PreDiabetes Algorithm CHF Algorithm Hypertension Breast Cancer 8 Poll Question #2 I see text analytics being most important in the area of: (Choose 3, if applicable) a) Clinical care improvement b) Regulatory reporting c) Research d) Operational improvement e) Financial analytics f) Unsure or not applicable Google 10 Why do we love Google? Simple, effective interface Fast Accurate 11 c How Would You Build ‘Google’ For Clinical Text? 12 The Basis of Text Search: The Inverted Index Document 0 Patient is a 67 year old female with NIDDM and hypertension. Document 1 The patient has no diabetes or hypertension. Document 2 Patient’s mother is diabetic. Patient’s sister is diabetic. Words Document Inverted Index 67 0 {(0,3)} diabet 1,2 {(1,4),(2,3),(2,7)} female 0 {(0,6)} hyperten 0,1 {(0,8),(1,6)} mother 2 {(2,1)} niddm 0 {(0,8)} no 1 {(1,3)} old 0 {(0,5)} patient 0,1,2 {(0,0),(1,1), (2,0),(2,4)} sister 2 {(2,5)} year 0 {(0,4)} 13 Tools To Quickly Index Text and Provide Search Capability Originally written in 2004 Open Source Enterprise Search Built on Lucene Scalability (distributed indexes) REST APIs Plugin architecture Additional features over Lucene Originally written in 1999 Open-Source Java API • • • • • • Originally written in 2010 Open Source Enterprise Search Built on Lucene Scalability (distributed indexes) REST APIs Plugin architecture Additional features over Lucene Create index Maintain index Search index Hit ranking Result sorting .. Much more Provides the foundation for more advanced search engine capabilities. Most users use through SOLR or ElasticSearch. Used directly by Twitter. 14 diabetes Results: 2 records, 0.0 ms Document 2: Patient’s mother is diabetic. Patient’s sister is diabetic. Document 1: The patient has no diabetes or hypertension. Go Found both diabetes and diabetic (word stemming) Missed mention of NIDDM (synonyms) Neither result is relevant to a medical cohort query for diabetics (context) What Works? • Simple, familiar interface • Using inverted index means fast results What Doesn’t • Results display not optimized for use cases • Want more results! • Need better ability to view aggregate results Medical language has many synonyms. (How do we find NIDDM?) Want less results! Context matters for different search types (How do we exclude ‘no diabetes’) 16 Showing the results • Many users are more interested in exploring aggregate results than reviewing individual records • Aggregating results opens up to users without access to PHI 17 Get More Results: Synonyms When you say ‘diabetes’ what do you really mean? "diabetes" OR "diabetes mellitus" OR "diabetic" OR "brittle diabetes" OR "diabetes brittle" OR "diabetes mellitus insulin-dependent" OR "diabetes mellitus juvenile onset" OR "iddm" OR "insulin dependent diabetic" OR "insulin-dependent diabetes mellitus" OR "juvenile diabetes" OR "ketosis-prone diabetes mellitus" OR "type i diabetes mellitus" OR "type i diabetes mellitus without mention of complication" OR "type 1 diabetes mellitus" OR "diabetes mellitus maturity onset" OR "diabetes mellitus non insulin-dep" OR "diabetes mellitus non-insulin-dependent" OR "maturity onset diabetes" OR "maturity-onset diabetes of the young" OR "niddm" OR "non-insulin-dependent diabetes mellitus" OR "type ii diabetes mellitus" OR "type ii diabetes mellitus without mention of complication" OR "type 2 diabetes mellitus" 18 Leveraging Medical Terminologies 19 Expanding Search With Terminologies 20 A more complex example: Diabetic patients who are on an ACE/ARB or who had their microalbumin checked during the calendar year Queries free text for all reports that contain “Diabetes” AND “(ace OR arb)” AND “microalbumin” Filtered for reports within the last year note: terms are selected by synonym finder, or grouped terms of all trade name, generic name, or active medication ingredients ("diabetes" OR "diabetes mellitus" OR "diabetic" OR "brittle diabetes" OR "diabetes brittle" OR "diabetes mellitus insulindependent" OR "diabetes mellitus juvenile onset" OR "iddm" OR "insulin dependent diabetic" OR "insulin-dependent diabetes mellitus" OR "juvenile diabetes" OR "ketosis-prone diabetes mellitus" OR "type i diabetes mellitus" OR "type i diabetes mellitus without mention of complication" OR "type 1 diabetes mellitus" OR "diabetes mellitus maturity onset" OR "diabetes mellitus non insulin-dep" OR "diabetes mellitus non-insulin-dependent" OR "maturity onset diabetes" OR "maturity-onset diabetes of the young" OR "niddm" OR "non-insulin-dependent diabetes mellitus" OR "type ii diabetes mellitus" OR "type ii diabetes mellitus without mention of complication" OR "type 2 diabetes mellitus") AND ( ("benazepril" OR "lotensin" OR "captopril" OR "capoten" OR "enalapril" OR "vasotec" OR "epaned" OR "fosinopril" OR "monopril" OR "lisinopril" OR "prinivil" OR "zestril" OR "moexipril" OR "univasc" OR "perindopril" OR "aceon" OR "quinapril" OR "accupril" OR "ramipril" OR "altace" OR "trandolapril" OR "mavik") OR ("azilsartan" OR "edarbi" OR "candesartan" OR "atacand" OR "eprosartan" OR "teveten" OR "irbesartan" OR "avapro" OR "telmisartan" OR "micardis" OR "valsartan" OR "diovan" OR "losartan" OR "cozaar" OR "olmesartan" OR "benicar") ) AND ("albumin urine" OR "urine microalbumin" OR "urine microalbumin present”) 21 Get Less Results: ConText Matters ConText is a NLP pattern matching algorithm published in 2009 To be useful for clinical applications such as looking for genotype/phenotype correlations, retrieving patients eligible for a clinical trial, or identifying disease outbreaks, simply identifying clinical conditions in the text is not sufficient—information described in the context of the clinical condition is critical for understanding the patient’s state. J Biomed Inform. 2009 Oct; 42(5): 839–851. Detects conditions and whether they are • Negated (e.g., “ruled out pneumonia”) • Historical (“past history of pneumonia”) • Experienced by someone else (e.g., “family history of pneumonia”) 22 ConText Algorithm Wendy W. Chapman, David Chu, John N. Dowling J Biomed Inform. 2009 Oct; 42(5): 839–851. Chest tightness Chest tightness Negation: Negation: affirmed negated Experiencer: patient Experiencer: patient Temporality: recent Temporality: historical ConText CHF CHF Negation: Negation: affirmed affirmed Experiencer: patient Experiencer: other Temporality: recent Temporality: historical ”No history of chest tightness but family history of CHF.” Condition Negation trigger Historical trigger Condition Termination Termination Historical trigger Other experiencer trigger 23 ConText: Negation The patient had no diabetes or hypertension. Experiencer Diabetes Negation: negated Negation trigger Clinical conditions Termination Hypertension Negation: negated Experiencer: patient Experiencer: patient Temporality: recent Temporality: recent 24 ConText: Experiencer Patient’s mother has diabetes. Diabetes Negation: affirmed Experiencer: other Experiencer Clinical conditions Termination Patient’s sister has hypertension. Temporality: recent Hypertension Negation: affirmed Experiencer Clinical conditions Termination Experiencer: other Temporality: recent 25 How? • Analysis of context uses a sentence as an operand • Identifying sentences in clinical text is not straightforward • Have you ever seen punctuation in a clinical note? An NLP analysis pipeline ties it all together Sentence detection Search results Entity recognition (i.e. diabetes) Context Algorithm Present user with additional filters NLP Pipeline Frameworks • • • Apache Unstructured Information Management Architecture (UIMA) General Architecture for Text Engineering (GATE) Natural Language Toolkit (NLTK) 26 ConText: Apply to Search Results Filter Diabetes Results 27 28 Other Pieces to the NLP Pipeline: Extract Values ef_phrase qualifiers ef_low ef_high ef_mid ef_word ejection fraction is at least 70-75 is at least 70 75 72.5 NULL ejection fraction of about 20 of about 20 20 20 NULL of 60 60 60 NULL of greater than 65 65 65 NULL of 55 55 55 NULL by visual inspection is 65 65 65 NULL is NULL NULL NULL normal ejection fraction of 60 ejection fraction of greater than 65 ejection fraction of 55 ejection fraction by visual inspection is 65 LVEF is normal \b(((LV)?EF)|(Ejection\s+Fraction))\s+(?<qualifiers>([^\s\d]+\s+){0,5})\(?(((?<ef_low>\d+)(?<ef_high>\d+))|(?<ef_mid_txt>\d+)|(?<ef_word>([^\s]*?normal)|(moderate)|(severe))) 29 Other Extraction Projects • Aortic Root Size • Blood Pressure • Breast Cancer ER Biomarker • Cancer Staging, TNM, and stage • Abdominal fistula • Height/Weight/BMI • Hypoglycemia with low blood sugars • Microalbumin • Ankle Brachial Index 30 High-risk Population: Peripheral Arterial Disease PAD ABI < 0.9 N=4,349 • PAD affects over 3 million patients per year • Narrowed arteries reduce blood flow to limbs • Patients with PAD are considered high risk • Measured by Ankle Brachial Index (ABI) ICD/CPT N=9,592 Natural Language Processing (NLP) N=41,741 Peripheral artery disease Claudication Rest pain Ischemic Limb This is a precise patient registry! Duke J, Chase M, Ring N, Martin J, Fuhr R, Chatterjee A, Hirsh AT. Use of natural language processing of unstructured data significantly increases the detection of peripheral arterial disease in observational data. American College of Cardiologists Scientific Session. Chicago, IL, April 2016. 31 Validation • Build studies to review results of query • Assign to team members to review results • Randomly selects records to represent study • Highlights key words for easy chart review 32 Text Analytics Must Be Interoperable! Validated Text Analytics Diabetes Cohort PAD Cohort Data Warehouse Ejection Fraction Tumor Sizes • • • • • Population Analytics Care Improvement Operational Improvement Financial Improvement Research … 33 A ‘Late Binding’ Approach to Text Analytics Context Filtering • • Search: Easy starting point • • • • Uses terminologies Allow user to find synonyms Extraction of discrete values: • Ejection Fractions • ABI Integration Validation Regular Expression Synonym Finding • Excludes negated concepts Good for cohort queries Expert review of algorithm output Performance measurement • • Operationalize algorithm Incorporate into analytics Many More Techniques • • • • Section tagging Entity recognition N-gram analysis Document clustering 34 Final Thought To leverage the power of text analytics… Make the data accessible first! 36 Lessons Learned • Using search technology for clinical text is an engaging and accessible entry point for text analytics problems. Searching clinical text is powered by an inverted index that catalogs words present in the documents, which documents they are present in, and their position in the documents. • Medical terminologies provide a dictionary of relevant terms, synonyms, and logical structures that can enhance clinical text exploration. • NLP algorithms that are based on the context surrounding clinical terms can identify when the term is negated (“no evidence of pneumonia”) or applies to another person (“patient's grandmother had breast cancer”). Regular expressions can be applied to text to identify patterns and extract discrete values, like ejection fraction and ankle brachial index, that are stored in text. • Text analytics should be validated and integrated with an enterprise data warehouse where the information extracted from text can be combined with discrete, coded data. 37 Analytic Insights Questions & Answers A 38 What You Learned… Write down the key things you’ve learned related to each of the learning objectives after attending this session 39 Thank You 40
© Copyright 2025 Paperzz