Chapter 2 Computational Classification Techniques for Biomedical and Clinical Big Data

Kelsey Bourque, Cognitive Technologies, Spring 2018

2.1 Abstract

In this paper, we review popular computational classification methods and their application to biomedical and clinical data. These techniques are necessary to make sense of biomedical “big data”, as more and more of it is available every day. There is a definite lack of knowledge, as no person could possibly keep up on all this information. Therefore, computational classification is needed to create and sustain systems that will ultimately benefit the medical community.

2.2 Introduction

“Big data” is a popular buzzword thrown into seemingly every conversation about data mining technology. Simply put, “big data” is just massive, sometimes unstructured, data sources collected from many odds and ends but most typically from the internet. Much time and effort go into finding effective and efficient ways to make sense of this data, specifically ways to accurately classify it at minimal computational expense. Computational classification techniques have been around long before big data was a buzzword—and some existed even before the age of the internet. These classification techniques can be supervised or unsupervised, meaning that they learn from human-annotated data, or they learn on their own without annotation. While supervised learning tends to yield superior results, the amount of human time and effort put into annotating is extremely expensive and not always practical. Especially when considering the popularity of big data, few researchers have the time to put into annotating and therefore unsupervised learning has become more popular in recent years. Slowly but surely, unsupervised learning is evolving and becoming more and more accurate.

One field that would benefit greatly from computational classification techniques is the biomedical domain. Medical resources are rich in information and many of them are available publicly, such as medical journals published on PubMed and Medline, two online databases for medical and clinical scholarly texts. However, clinical data is much more difficult to come by given issues with privacy and consent. Despite this, there are publicly available datasets which are large enough to construct more complicated systems, such as neural networks. There is a true need for research in biomedical and clinical texts not only as a practical task for computational classification, but a life saving one too. Artificial intelligence systems such as IBM Watson use classification in tasks to help physicians provide top care for their patients. It would be humanly impossible to read and retain everything published in any medical subfield, but Watson can analyze thousands of documents daily. Not only can Watson maintain and update a database, but it can also help provide patients with better care by keeping doctors updated on relevant findings that may help that person.

Computational classification is also useful on the clerical side of the medical field, including bettering the medical billing and coding system. Currently, medical billers and coders work with specialized vocabularies to properly annotate clinical charts to bill insurance companies. While this system is necessary, medical billers and coders are only human and can make mistakes. Automated systems for medical billing and coding has been experimented with and continues to be a difficult task to execute as well as trained professionals can. In this paper, the argument that the medical field would most definitely benefit from more computational classification applications will be made, including reviewing relevant previous general work in these algorithms, as well as their application to the medical field to date.

2.3 Previous Work

Computational classification techniques have been around for decades and their application is nothing short of diverse. In this section, we will review more text-specific classification techniques, such as those that fall under machine learning and are associated with natural language processing, and have been widely tested with text. Much medical and clinical data is written text, ranging from clinical notes and patient charts to papers written in medical journals. Machine learning applications, such as classification, have a wide range of practical applications that help to make sense out of all this data. Popular classification techniques include topic modeling, neural networks, and clustering, but these algorithms often need the support of word sense disambiguation.

2.3.1 Topic Modeling

Topic modeling (also formerly referred to as Latent Semantic Analysis/Indexing, LSA/LSI) is a statistical model that works at the word or sentence level to classify documents into similar categories, or by “topics”. Landauer et al wrote their 1997 paper “Introduction to Latent Semantic Analysis” as a broad introduction to Latent Semantic Analysis and its potential. Landauer et al described the method as an application for extracting and representing the semantic meaning of words by statistical computations applied to a large corpus of text. Landauer et al argue that in a way, LSA mimics human sorting and categorization of words. For example, LSA has been found to be capable of simulating human cognition by developing vocabulary to word recognition, sentence-word semantic priming, discourse comprehension, and judgement of essay quality.

The knowledge derived from LSA can be described as sufficient but lacking in experience. While humans often understand their world though experience, human knowledge is not limited to experience-only learning. LSA’s uniqueness is not just limited to its comparability to human learning, but also because it is unlike other traditional natural language processing or artificial intelligence applications of its time. LSA takes raw text as input and does not utilize dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, or morphology. Instead, the raw text is parsed into words represented as unique character strings and are then organized into a matrix where each row is a unique word and each column is text passage (most typically documents), and singular value decomposition is applied to the matrix. Landauer et al tested their LSA on multiple judgment tasks and reported good performance but concluded that LSA lacks the necessity of raw experience that makes it somewhat incomparable to human cognition. However, as an overall computational classification technique, LSA did lead the way for more sophisticated topic modeling.

Even just two years after the publication of “An Introduction to Latent Semantic Analysis”, the paper “Latent Semantic Indexing: A Probabilistic Analysis” by Papadimitriou et al sought to improve LSA by introducing the technique of random projection. Random Projection is a mathematical application used to reduce dimensionality, which Papadimitriou et al believed would increase speed while maintaining accuracy. The application of random projection to the initial corpus by Papadimitriou et al was intended to reduce the bottleneck that often comes with LSI, which they achieved with some success. While the model performed faster, Papadimitriou et al were somewhat dissatisfied with the LSI performance. Both Papadimitriou et al and Landauer et al agreed that LSA is somewhat lacking in polysemy and synonymy. The bag of words model employed by LSA could be to blame, given that bag of words considers the context in which words appears independent from the word itself.

Despite the effort that LSA puts forward in classifying documents and retrieving relevant information, LSA is indeed limited in disambiguating word senses. More recent versions of LSA, now called topic models, have attempted to address these issues with polysemy and synonymy. Wallach attempted to bridge this gap in her 2006 paper “Topic Modeling: Beyond Bag-of- Words” by employing both bag of words and n-gram statistics. She extended Latent Dirichlet allocation (Blei et al., 2003), which represents documents as random mixtures over latent topics, where each topic is characterized by a distribution over words, by introducing a bigram model that takes word order into account. Her results showed that the predictive accuracy of her model is significantly better than that of either latent Dirichlet allocation or the hierarchical Dirichlet language model. Also, her model automatically infers a separate topic for function words, meaning that the other topics are less dominated by these words. This contribution is especially important because Wallach’s model uses a larger number of topics than either Dirichlet model and has a greater information rate reduction as more topics are added, again while being able to maintain accuracy.

2.3.2 Neural Networks

Another popular application for text analysis and classification is neural networks. Neural networks are currently very popular for their speed and accuracy across many applications. Partially responsible for the rise of neural network applications in natural language processing of late is the popular Word2Vec model from Mikolov et al of Google. Mikolov el al published their paper “Efficient Estimation of Word Representation in Vector Space” and made the Word2Vec algorithm publicly available in 2013.

Part of Word2Vec’s attractiveness is the speed in which the word vectors are developed. This is due to the structure in which the word vectors are derived, which are shallow neural networks. The Word2Vec model creates a two-layer network, in which one layer is hidden. This structure is supported by the log linear architectures proposed by Mikolov et al that learns distributed representations of words while minimizing computational complexity. This includes a continuous bag of words model (CBOW) and a continuous skip- gram model.

The CBOW architecture is like that of a feedforward neural net language model where the non-linear hidden layer is removed, and the projection layer is shared for all words. Therefore, all words get projected in the same position their vectors are averaged. The continuous skip-gram model is like the CBOW model, but instead tries to maximize classification of a word based on another word in the same sentence. Each current word is used as input to a log-linear classifier with a continuous projection layer. The word vectors returned are capable of various tasks, such as generating similar words and deciding which word does not belong to the set.

Perhaps the most interesting contribution of the Word2Vec model is the ability to conceptualize words. This is explained best through the famous king and queen example. Simply, v(king) - v(man) + v(woman) = v(queen), where the vector man is taken from the vector king while adding the vector woman, which results in the vector queen. While Word2Vec’s application to natural language is more general and flexible, there exists other neural network systems that are more targeted towards specific natural language processing applications. For example, Chen et al proposed a standard neural network dependency parser that utilizes part of speech tag and arc label embeddings to yield 1,000 parses per second with a 92.2% accuracy. Since Word2Vec was released, many other algorithms have been published piggybacking off the model. For example, Sense2Vec was published two years after Word2Vec, which uses part of speech tagging to help disambiguate word sense.

Trask et al argue that Word2Vec does not do enough linguistic preprocessing to accurately disambiguate words such as “duck”, which depending on the context, can either be a noun of a verb. Sense2Vec maintains Word2Vec’s general architecture but adds more linguistic features to better enhance word embeddings. Another popular natural language processing topic is morphology, which Luong et al addressed in their paper “Better Word Representations with Recursive Neural Networks for Morphology”. Luong et al argue that while vector space representations have had success over the past few years, morphological relation has been lacking. Their solution was to create a recursive neural network capable of finding word similarity and distinguishing rare words. Luong et al used both supervised and unsupervised approaches to their experiment and were able to yield comparable results between the two approaches. Neural networks continue to prove to be a practical and useful application in natural language processing and in many areas of machine classification.

2.3.3 Clustering

The final computational classification technique discussed here will be clustering. Clustering is an unsupervised learning task of mapping data to find where it “clusters”, or where data situates itself to other similar data. Clustering is a popular classification technique especially since it can be performed unsupervised. Unsupervised learning techniques have become more popular in the age of the internet where researchers have a constant stream of raw, accessible data to analyze. One especially popular source is the micro blogging site Twitter, with its easy to access APIs and rich textual content.

In the paper “Social Network Data Mining Using Natural Language Processing and Density Based Clustering” Khanaferov et al proposed a system to mine Twitter data for information relevant to obesity and health. Their goal of demonstrating a practical approach to solving a healthcare issue through a computational method focused on mining useful patterns out of public data. First, they used a data warehouse to complete online mining operations. The data warehouse had three distinct layers and served as step for mining processing. After the collected data was cleaned and standardized, a density-based clustering algorithm was implemented to find relevant patterns. The output resulted in sets of transactions, and each transaction included a set of search terms associated with it. To better visualize the cluster data, it was plotted onto a map using Google Maps API, which helped show that tweets coming out of the United States and Europe had a negative sentiment, while those coming out of South Asia, Canada, and Central Africa had a positive sentiment. Overall, they were able to cluster tweets in a somewhat meaningful way, although the loose relationship between healthcare and social media is a tricky one to extract meaningful results from. Cardie et al experimented with noun phrase clustering in their paper “Noun Phrase Coreference as Clustering”. They introduced a new, unsupervised algorithm for this task by approaching each group of coreferential noun phrases as an equivalent class. They identified various features in each noun phrase such as individual words, head noun, position, pronoun type, and more, then defined the distance between two nouns. The clustering algorithm applied worked backwards in the document, since noun phrases tend to reference the noun phrase preceding it. This approach was somewhat accurate in its results, ranging from 41.3%-64.9%, leaving room for improvement.

2.3.4 Word Sense Disambiguation

An important distinction in computational classification of text is that of word sense disambiguation. This applies to choosing the correct sense of a polysemous word, given the context in which is occurs. Word sense disambiguation within the medical field, called biomedical text normalization, is especially relevant given the specialized nature of the data at hand. Applications of biomedical text normalization that work within a medical and clinical context may not be successfully applied to outside subjects, and vice versa. For example, the abbreviation “CA” can mean two things within the medical field: “cancer” or “carbohydrate antigen”. However, outside of the medical field, “CA” could very likely stand for “California”, or other words that have nothing to do with “cancer” or “carbohydrate antigen”. For this reason, research within biomedical text normalization is an important task that will hopefully lead to achieving higher accuracy in classification applications.

Given how much medical data exists, techniques range from supervised to semi-supervised and unsupervised learning, although in recent years unsupervised learning is favored by many researchers. Tulkens et al managed to create a successful biomedical text normalization program in their paper “Using Distributed Representations to Disambiguate Biomedical and Clinical Concepts”. Their approach was an unsupervised learning method that classified concepts by clustering. To achieve this, they utilized the Word2Vec continuous skip-gram model to create their word representations. Those representations were then transformed via compositional functions to become concept vectors, essentially an entire concept representation in one vector. Every ambiguous concept tested in their experiment was defined by having more than one concept unique identifier (CUI) from the Unified Medical Language System (UMLS). Much like the “cancer” or “carbohydrate antigen” example above, the test concepts had multiple, but distinct, meanings. Tulkens et al managed to obtain between 69%-89% accuracy by transforming both the training and test data into concept vectors and measuring the cosine distance between them.

In a semi-supervised approach, Siu et al experimented with semantic type classification of complex noun phrases. Often in medical text, complex noun phrases consist of specific names (diseases, drugs, etc.) and common words such as “condition”, “degree”, or “process”. The common words can have different semantic types depending on their context in the noun phrase, and in their experiment Siu et al attempted to classify these common words into fine-grained semantic types. Siu et al argue that it is crucial to consider these common nouns in information extraction because while they can carry biomedical meaning, but they can also be used in a general, uninformative sense. Their semi- supervised method labeled target words within a noun phrase with its suitable semantic type or tagging it as uninformative. Experiments with this method yielded a 91.34% micro-average and an 83.57% macro-average over 50 frequently appearing target words.

Another unsupervised approach to biomedical word sense disambiguation is that of Henry et al in their paper “Evaluation Feature Extraction Methods for Knowledge-Based Biomedical Word Sense Disambiguation”. They compared vector representations in the 2-MRD WSD algorithm and evaluated four dimensionality reduction methods: continuous bag of words and skip-gram, singular value decomposition, and principal component analysis. Like Tulkens et al, Henry et al also measured their accuracy with cosine similarity. Singular value decomposition performed well in their experiments, however it may not do as well with larger data sets. Regarding dimensionality, low vector dimensionality was sufficient for the continuous bag of words and skip-gram models, but higher dimensionality achieved better results for singular value decomposition. Although principle component analysis is commonly used for dimensionality reduction, in this case it did not improve results for word sense disambiguation. Regardless of the method, normalization of biomedical and clinical text remains a nuanced and necessary step in processing for information retrieval and document classification. Part of the urgency to make advancements in this task is the fact that computational methods are being applied to medical and clinical data every day. Their effectiveness relies on the ability to properly disambiguate terms and classify accurately, as a human would, otherwise they could be rendered useless.

2.4 Medical and Clinical Applications

Computational classification techniques have been applied to medical data for years in hope of contributing to new methods for helping patients by identifying and diagnosing diseases, as well as preventing illnesses from occurring in the first place. This work is predictive and requires copious amounts of data to be proven effective. Just like with any other computational classification technique, there are both supervised and unsupervised approaches to the task. Since there is an ever-growing amount of medical data available, many researchers are turning to semi- supervised and unsupervised techniques to wrangle more data in a more efficient amount of time. While supervised learning is extremely practical and yields highly accurate results, there is always the cost of annotating data. Annotation must be conducted by people and given the size of the data set or task, can be very time consuming. In this section, we will review various contributions researchers are making to the biomedical natural language processing community and the techniques they use.

The ability for a computer or artificial intelligence to aid health care providers in diagnosing patients sounds like something out of a science fiction novel. However, experimental applications for machine diagnosis have become popular in recent years by taking medical data from patients that have specific ailments and using their history to learn models that can predict this occurrence in future patients. Apostolova et al sought to create a system to detect sepsis and septic shock in patients early when treatment is effective. They report that sepsis has an approximate 50% mortality rate worldwide, and often the infection can be detected through clues in nurses’ notes. Using the MIMIC-III corpus, a publicly available data set from ICU patients, alone was unsuccessful. However, once they noticed that when a patient has and infection or is suspected of having one, nurses tend to mention that they are on an antibiotic. Using this heuristic along with a list of commonly prescribed antibiotics, they were able to extract the language used to describe the patient’s state when they had an infection. These notes with infection-hinting and infection-confirmed language were used in combination with notes where infection was not present as training data to train Support Vector Machines (SVM), a type of unsupervised clustering machine learning algorithm, in binary classification of these free-form notes. Using this technique, they were able to achieve an F1-score ranging between 79%-96%. These results are a good start to their end goal of creating an automated system for detecting early sepsis in at risk patients.

SVMs have also been experimented with for cancer diagnosis with mixed reviews. In one paper from 2002, “Gene Selection for Cancer Classification Using Support Vector Machines”, Guyon et al proposed a method to make sense out of the massive amount of data DNA microarrays generate. Briefly, DNA microarrays are microscopic collections of DNA attached to a surface. The task researchers use them for is to classify and predict the diagnostic category of a sample based on its gene expression profile; in this case, the expression is cancer. Guyon et al used samples from both cancer patients and patients without cancer to train their model, an SVM based on Recursive Feature Elimination, which uses weight magnitude as ranking criterion. Their technique was able to extract biologically relevant genes from patients with colon cancer or leukemia and yielded a high classification accuracy of 98% accuracy in colon cancer compared to the 86% of the baseline system. Guyon el al argued that SVMs lend themselves well to this type of gene classification because of their ability to easily handle large feature sets (here, thousands of genes) and a small number of patterns (dozens of patients). However, Koo et al argued in their 2006 paper “Structured polychotomous machine diagnosis of multiple cancer types using gene expression” that even though SVMs are a popular and accurate classification technique, their results are implicit and therefore difficult to interpret. In attempts to change this drawback, Koo et all proposed an extension of import vector machines by using an analysis of variance decomposition and structured kernels, called the structured polychotomous machine.

Import vector machines are like SVMs, but they are typically computationally cheaper than SVMs and can provide estimates for posterior probabilities. The DNA microarray data Koo et al used came from a few sources, including the small round blue cell tumor data set and a leukemia data set. They wanted to create a system that not only improved upon import vector machines, but they also wanted to provide a method for finding genes that accurately discriminate cancer subtypes. Overall, Koo et al were able to achieve 0% error rates and their method was able to select a smaller set of genes and successfully classify among samples. Their model also outperformed the SVM baseline in several tests, as they expected. As seen in these two comparative studies, robust machine learning systems are necessary for biomedical classification applications.

DNA microarrays are one example of an expensive data set commonly used in biomedical classification techniques, however one other commonly used data source is that of Twitter. Mentioned earlier in this paper, Twitter is a frequently used source for text data. In a study from Nadeem et al, “Identifying Depression on Twitter”, data was crowdsourced from Twitter users who have been diagnosed with Major Depressive Disorder in efforts to measure and predict depression in users. From these users, along with a general demographic, tweets from up to a year prior were extracted and a bag of words model was applied to quantify each tweet.

Finally, statistical classifiers were applied to the tweets to analyze the risk of depression. Linguistic features were extracted from the tweets of those with depression as an attempt to see what language people with depression use that varies from the language of those who don’t. From this analysis, Nadeem et al found approximately 20 words (and one sad-face emoticon) that were used at a much higher rate in depressed users. Nadeem et al employed the use of Decision Tree, Support Vector Machine, Logistic Regression, Ridge Classifier, and two Naïve Bayes classifiers. Of the six, the Logistic Regression classifier had the highest precision and F1-score, while the SVM had the highest recall and the Naïve Bayes with 1-gram had the highest overall accuracy of 86%. The statistical classifiers here were trained with supervised learning, resulting in accuracies comparable to the unsupervised experiments of Koo et al and Guyon et al. Both methods proved to be useful and accurate in their classification tasks of diagnosing disease. Like the goal of Nadeem et al, Gorrell et al attempted to identify first episodes of psychosis in psychiatric patient records in their experiment. They filtered thousands of records and obtained 9,109 individual clinical records. Of those, 560 screened positive for psychosis, 5,234 screened negative (but remained at risk) and 3,315 were excluded for various reasons. Gorrell et al chose to use SVM, Random Forests, and JRip algorithms to classify their data for speed and accuracy reasons. They used two- and three-fold validation to define their features. Three-fold features included missing demographic information, such as borough, ethnicity, gender, postcode, first primary diagnosis, and age. However, where available, first primary diagnosis was included (bipolar hypomanic/unspecified and severe depressive with psychotic symptoms). Text features included in three-fold validation consisted of “olanzapine”, “risperidone”, “auditory hallucinations”, “voices”, “paranoid”, “psychotic” and “psychosis”. Two-fold validated features were somewhat less specific, including first primary diagnosis (bipolar, organic delusional schizophrenia-like disorder, organic mood disorder), and text features such as “aripiprazole”, “quetiapine”, “persecutory”, and “schizophrenia”. The three algorithms chosen along with varying feature set size obtained decent results, ranging from 66.46%-82.2%. Surprisingly, the Random Forests classifier had both the weakest and strongest accuracy, scoring 66.46% accuracy with the full feature set size plus unigrams, and an 82.2% accuracy with a reduced feature set size.

While text classification is a useful tool in diagnosing depression and other mental illnesses, researchers have also experimented with multimodal tools to identify and classify these illnesses. Morales et al explored this technique in “OpenMM: An Open-source Multimodal Feature Extraction Tool” where they used text, speech, and face mapping features to identify depression in individuals. Morales et al argue that to usefully model situational awareness, machines must have access to the same visual and verbal cues that humans have. To do this, Morales et al built a pipeline to extract visual and acoustic features that performed automatic speech recognition and use that data to transcribe and extract relevant linguistic features. OpenMM was tested on deception, depression, and sentiment classification showing promising results. Depression detection had a baseline of 55.36% accuracy, and OpenMM’s acoustic feature set was able to produce an accuracy of 76.79%. OpenMM is publicly available for other researchers to experiment with and build upon, which is necessary for making use of all these classification techniques.

Like the ability of machines to diagnose disease, machine learning can also be leveraged to help prevent disease by developing predictive models. This goal can be successfully obtained through text mining techniques on clinical and medical data. Jacobson et al experimented with detecting healthcare-associated infections in patients by applying deep learning techniques to Swedish medical records. The Swedish Health Record Research Bank data obtained contained two million patient records from over 800 clinical units between 2006 and 2014. They also used the special subset Stockholm EPR Detect-HAI Corpus which contains 213 patient records and classified by two domain experts and gold annotated. After the necessary preprocessing of the data, the records were transformed into numerical vectors, one of which being bag of words and tf-idf representations, the other Word2Vec word vectors. Artificial neural networks were then built from these models, including stacked sparse auto encoders and stacked restricted Boltzmann machines. The results were somewhat diverse, ranging from 66% to 91%, but most of the scores hovered between 70%-80%. Jacobson et al admitted that deep learning techniques are often expensive to train and usually researchers sacrifice some agility for increased accuracy, which was unfortunately not seen in this experiment. Despite the shortcomings, this research is still incredibly useful and future work is promising.

Having a broad range of potential diseases to identify and classify is attractive, but narrow topics are also necessary. To this point, Abdinurova et al sought to create a model that could classify epilepsy, namely, the various stages of it. The stages of epilepsy include absence of seizure, pre-seizure, seizure, and seizure-free and are all used in clinical data. Their system utilized artificial neural networks and SVMs combined with supervised learning algorithms, and k-means clustering combined with unsupervised techniques. The result from these various techniques experimented with showed favorable results, all with high accuracies. As expected, the supervised methods performed better than their unsupervised counterparts, however the unsupervised results were not drastically worse. This experiment is an excellent comparison of state of the art supervised and unsupervised learning tasks and proves that both methods can yield comparable results. In addition to being excellent classification systems, the models also performed well on information retrieval tasks, an important function of machine classification.

2.5 Clerical Applications

Computational classification can also be applied to clerical work in the medical field. Currently, medical billers and coders are used to review doctor’s notes and patient charts, annotate them for relevant information, and send that information to insurance companies for the proper billing. Medical billers and coders are skilled people who are trained in vocabularies of the Unified Medical Language System (UMLS), which includes the International Classification of Diseases (ICD) and Current Procedure Terminology (CPT), among others. These vocabularies are used to unambiguously identify various medical concepts for insurance companies to know exactly what their customer visited the doctor for, so they can be billed accordingly. While medical billers and coders are necessary to the modern healthcare system, they can make mistakes in their annotations, and efficiency and always be improved.

Researchers have been applying computational classification methods to medical billing and coding problems to create systems to automate these tasks. Karimi et al have experimented with such tasks in their paper “Automatic Diagnosis Coding of Radiology Reports: A Comparison of Deep Learning and Conventional Classification Methods”. Their system seeks to apply deep learning to auto-coding of radiology reports using the ICD vocabulary and to see how deep learning fares with a smaller data set. Interestingly, they chose to use both domain-specific and out-of-domain data for their training sets. The in-domain set was ICD9, a data set of radiology reports, and the out-of-domain set was IMDB, a movie review data set for sentiment analysis. Their best deep learning neural network classification result was comparable to that of the SVM and logistic regression classifiers also used in this experiment. Automatic billing and coding classification systems have been attempted by other researchers, including Pestian et al in their paper “A Shared Task Involving Multi-label Classification of Clinical Free Text”. In this experiment, Pestian et al sought to use ICD codes to annotate clinical data, much like a medical biller and coder would manually, but instead they used classification techniques and text extraction techniques to pull relevant information from the clinical data and label it with an ICD code, then store it in XML schema. This approach was reasonably successful, but still not as accurate as a manual coder overall.

Another popular medical vocabulary from the UMLS often used in computational classification are Concept Unique Identifiers, or CUIs. Mentioned earlier in this paper, CUI codes are used to map unique concepts to specific codes describing them. CUIs are meant to eliminate ambiguity amongst terms, especially abbreviations and acronyms. CUI codes are used in medical billing and coding, but they are more widely used in medical journal databases such as PubMed and Medline. Jimeno-Yepes et al created a test data set from medical subject headings (MeSH) of Medline articles to extract 203 ambiguous terms, including abbreviations and acronyms. All 203 terms had at least two CUI codes that they could be associated with and the proper term depended on the context. For example, the ambiguous acronym “AA” had two associated CUI codes, one for “alcoholics anonymous” and another for “amino acid”. Jimeno-Yepes et al then found approximately 200 Medline abstracts for each ambiguous term and tagged them with the proper CUI code. They called their final data set “MSH-WSD”, standing for MeSH word sense disambiguation. The MSH-WSD data set is now a popular test dataset in the biomedical text normalization community.

A similar application of classifying ambiguous acronyms in the UMLS was studied by Liu et al in their paper “A Study of Abbreviations in the UMLS”. Liu et al took advantage of the typical format in many papers where an abbreviation was introduced with its expanded form in parentheses next to it. Using this method, they were able to extract 163,666 unique abbreviations and their full forms from the UMLS with a precision of 97.5% and a recall of 96%. About 33% of the abbreviations extracted were ambiguous with six or less characters and multiple possible meanings. This method for extracting abbreviations and full forms of terms has been applied by other researchers, including Jimeno-Yepes et al, where they created the MSH-WSD data set.

Clerical tasks are not limited to medical billing and coding in the clinical field; patients often answer various questions regarding their chart and family history at their doctor’s visit. Llanos et al proposed an automatic classification system of doctor-patient questions where they focused on such questions that would need to be looked up in a chart, such as “do you cough every day?” or “are your parents still alive?”. Questions were classified as rule-based, for example, “do you cough every day?” has the semantic annotation of “symptom” and “frequency”. To test question understanding in hopes of being able to eventually give responses, Llanos et al used a linear SVM and two Naïve Bayes classifiers (Multinomial and Gaussian). Their results across all classifiers were comparable, ranging between 65%-87%. Such applications of classifying questions and answers can help for future applications, such as remote question-answering where a patient has a question, but their physician isn’t available.

2.6 Real World Applications and Future Work

Arguably the most famous example of artificial intelligence is IBM Watson. Ever since competing in both high-stakes chess and the game show Jeopardy!, Watson has applied its skills to more than just strategy games. Currently Watson has taken to oncology and is helping doctors and researchers diagnose cancer. The motivation behind Watson’s involvement is that oncologists need to stay up to date n the latest cancer research and studies for the maximum benefit of their patients. However, there are thousands of medical journals that publish oncological studies daily, and no one person could possibly keep up with them all. Instead, Watson has been trained to maintain a database of sorts for relevant information regarding cancer research that can be used as a supercomputer for physicians to give their patients the best care possible. Doyle-Lindrud explains in “Watson Will See You Now: A Supercomputer to Help Clinicians Make Informed Treatment Decisions” that Watson has paired with big hospitals and healthcare companies around the country including Memorial Sloan Kettering Cancer Center, the University of Texas MD Anderson Cancer Center, and WellPoint, Inc. In all these locations, Watson has been used as a tool to help customize patient care with the best possible treatment options by analyzing and ranking medical literature to rank potential treatment options based on evidence. Along with oncology, Watson has participated in other experiments meant to benefit patients.

Chen et al (2) described various studies that Watson has been a part of, including a drug repurposing study where Watson looked for drugs approved for human use and then cross referenced those drugs with statements suggesting efficacy in treating malaria. After obtaining this final list of cross referenced drugs, Watson then looked at the company’s existing compounds and identified similarities to known malaria treatments in hopes of finding drugs that may not have been intended to treat malaria, but possibly could. Watson also participated in a study at Baylor College of Medicine aimed to enhance insight on cancer kinases. First, Watson read articles discussing known kinases, then with graph and text-based features, Watson found text similarity patterns between kinases. Those models were then applied to Medline abstracts through 2002 to determine whether Watson could identify kinases discovered in 2003 through 2013. Watson was able to identify nine potential kinases successfully, and of these, Baylor validated seven. Watson’s clinical knowledge is a good example of machine classification and information retrieval put to good and accurate use. Watson is a valuable tool for many physicians and its knowledge is expanding daily which will hopefully help save more lives.

2.7 Conclusion

The need and application for computational classification techniques in the medical and clinical domain are diverse, however, these systems are only as useful as their computational intelligence. Obtaining accuracies in the 70s and 80s range is a good start, but higher accuracy is needed to be an impactful industry standard. Outside of the clinical application, machine intelligence is being tested by any person with a smart phone. There is a difference between asking Siri what the symptoms of a heart attack are versus asking Siri to call emergency services because you are having a heart attack. It is possible for Siri to call emergency services for you with various commands, such as “call emergency services”, “dial 911”, or “phone 911” among others (OS X Daily), but the distinction is discrete and vital to correctly make. While it may seem small, the need for machines to understand the simple difference between phrases like “Siri, what are the symptoms of a heart attack?” and “Siri, I’m having a heart attack” are just as important as the biomedical classification techniques discussed in this paper.

All the techniques and systems reviewed are extremely important for driving efficiency and accuracy in the medical field where mistakes and oversight is rarely forgiven. By providing health care professionals with dependable, robust systems, hopefully more people can be helped, and more institutions can have systems like Watson helping them make diagnostic decisions. Big data will surely remain a buzzword for years to come but embracing the practical use of big data can be difficult. The medical field is often regarded as a groundbreaking one, but professionals in the field can be slow to embrace such advancements. Trusting computers to manage our data and help us make informed decisions lies in ensuring top results and allowing little room for error. For this reason, computational classification of biomedical and clinical concepts is a crucial foundational layer that needs to be standardized. Only when a computational system can classify data correctly, can other important results be produced from that data. Technology is ever-changing and ever- expanding, and hopefully with more advancements in computational classification techniques and achievable replication of results, more computational systems will be developed and trusted to help improve our health and save lives.

2.8 References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Abdinurova, Nazgul, et al. “Classification of Epilepsy Using Computational Intelligence Techniques.” 2015 Twelve International Conference on Electronics Computer and Computation (ICECCO), 2015, doi:10.1109/icecco.2015.7416877.

Apostolova, Emilia, and Tom Velez. “Toward Automated Early Sepsis Alerting: Identifying Infection Patients from Nursing Notes.” BioNLP 2017, 2017, doi:10.18653/v1/w17-2332.

Cardie, Claire, and Kiri Wagstaff. “Noun Phrase Coreference as Clustering .”Association for Computational Linguistics.

Chen, Danqi, and Christopher D Manning (1). “A Fast and Accurate Dependency Parser Using Neural Networks.” Association for Computational Linguistics,

Chen, Ying, et al. “IBM Watson: How Cognitive Computing Can Be Applied to Big Data Challenges in Life Sciences Research.” Clinical Therapeutics, vol. 38, no. 4, 2016, pp. 688–701., doi:10.1016/j.clinthera.2015.12.001.

Doyle-Lindrud, S. “Watson Will See You Now: a Supercomputer to Help Clinicians Make Informed Treatment Decisions.” Clinical Journal of Oncology Nursing., U.S. National Library of Medicine, Feb. 2015,

Gorrell, Genevieve, et al. “Identifying First Episodes of Psychosis in Psychiatric Patient Records Using Machine Learning.” Association for Computational Linguistics,

Guyon, I., Weston, J., Barnhill, S. et al. “Gene Selection for Cancer Classification Using Support Vector Machines”. Machine Learning (2002) 46: 389.

Henry, Sam, et al. “Evaluating Feature Extraction Methods for Knowledge-Based Biomedical Word Sense Disambiguation.” BioNLP 2017, 2017, doi:10.18653/v1/w17-2334.

Jacobson, Olof, and Hercules Dalianis. “Applying Deep Learning on Electronic Health Records in Swedish to Predict Healthcare-Associated Infections.” Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 2016, doi:10.18653/v1/w16-2926.

Jimeno-Yepes, Antonio J, et al. “Exploiting MeSH Indexing in MEDLINE to Generate a Data Set for Word Sense Disambiguation.” BMC Bioinformatics, vol. 12, no. 1, 2011, p. 223., doi:10.1186/1471-2105-12-223.

Karimi, Sarvnaz, et al. “Automatic Diagnosis Coding of Radiology Reports: A Comparison of Deep Learning and Conventional Classification Methods.” BioNLP 2017, 2017, doi:10.18653/v1/w17-2342.

Khanaferov, David, et al. “Social Network Data Mining Using Natural Language Processing and Density Based Clustering.” 2014 IEEE International Conference on Semantic Computing, 2014, doi:10.1109/icsc.2014.48.

Koo, J.-Y., et al. “Structured Polychotomous Machine Diagnosis of Multiple Cancer Types Using Gene Expression.” Bioinformatics, vol. 22, no. 8, 2006, pp. 950–958., doi:10.1093/bioinformatics/btl029.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.

Luong, Minh-Thang, et al. “Better Word Representations with Recursive Neural Networks for Morphology.” Association for Computational Linguistics, www.aclweb.rog/anthology/W13- 3512.

Llanos, Leonardo Campillos, et al. “Automatic Classification of Doctor-Patient Questions for a Virtual Patient Record Query Task.” Association for Computational Linguistics,

Liu, H., Y. A. Lussier, and C. Friedman. “A Study of Abbreviations in the UMLS.” Proceedings of the AMIA Symposium (2001): 393–397. Print.

Mikolov, Tomas et al. “Efficient Estimation of Word Representations in Vector Space.” CoRR abs/1301.3781 (2013): n. pag.

Morales, Michelle Renee et al. “OpenMM: An Open-Source Multimodal Feature Extraction Tool.” INTERSPEECH (2017).

Nadeem, Moin, et al. “Identifying Depression on Twitter.” ArXiv:1607.07384v1, 25 July 2016,

Papadimitriou, Christos H. et al. “Latent Semantic Indexing: A Probabilistic Analysis.” J. Comput. Syst. Sci. 61 (1998): 217-235.

Pestian, John P., et al. “A Shared Task Involving Multi-Label Classification of Clinical Free Text.” Proceedings of the Workshop on BioNLP 2007 Biological, Translational, and Clinical Language Processing - BioNLP ’07, 2007, doi:10.3115/1572392.1572411.

“Siri Can Call Emergency Services For You with IPhone If Need Be.” OS X Daily, 20 Mar. 2017,

Trask, Andrew, et al. “sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings.” 19 Nov. 2015,

Tulkens, Stephan, et al. “Using Distributed Representations to Disambiguate Biomedical and Clinical Concepts.” Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 2016, doi:10.18653/v1/w16-2910.

“UMLS - Metathesaurus.” U.S. National Library of Medicine, National Institutes of Health, 12 Apr. 2016,

Wallach, Hanna M. “Topic Modeling: beyond Bag-of-Words.” ACM, Conference: Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), 25 June 2006,

“What Is Medical Billing and Coding?”,, 2018,