Applying CRISP-DM on the MIMIC-3 Dataset

Artificial Intelligence finds disease-related genes, CRISP-DM

mediumThis post was originally published by Marcelo Cunha at Medium [AI]

Having access to accurate, detailed data in Healthcare is crucial. It has the potential of scaling a high-quality healthcare service for everyone. Some open standards like CRISP-DM capture the recurrent nature of the Machine Learning workflow:

This process can be broken into 6 main phases:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

In this post I follow this methodology and apply each of those phases in the Healthcare field:

This project is in the context of Intensive Care Units (ICUs). Every day thousands of patients in critical conditions are admitted in ICUs seeking the best possible medical care, specialized treatments, medical exams, with close monitoring from doctors and nurses in order to survive.

In this complex environment, patients with different demographics, genetics receive treatment with multiple different combinations of medicines, treatments, making it incredibly difficult for doctors and nurses to make decisions. Those professionals depend on decades of studies and clinical trials to decide which would be the optimal specific treatment for that specific patient, moreover, the amount of evidence supporting clinical decisions remains embarrassingly low [1] [2].

This lack of evidence for supporting clinical decisions could be explained by the fact that we are dealing with lives. Obviously, most of us don’t want to provide our personal data, especially related to our health. As a result of the access to this kind of data, even if anonymized, is restricted. In healthcare, protecting vital patient data is a must (e.g. HIPAA in the USA and several other laws make this clear) and at the same time, we need access to data and tools to empower healthcare professionals, researchers in their day-to-day work. In the end, we have 2 giant competing forces: the democratization of data and the need for governance and control of that data.

Moreover, traditional healthcare systems are not easy to integrate, meaning that data from one patient could be in several different systems that don’t talk to each other (and doctors rely on tons of paper, word documents, that were manually joined). We are not even mentioning cases where a patient could initiate treatment in one city/country and have to continue the treatment in another city/country. That’s one of the reasons why the need for an integrated health service is constantly debated [3]. This topic is so complex that I’d like to leave this to another blog post (or a series).

Coming back to our problems involving ICUs (sorry for this little digression), it becomes clear that when it comes to rapidly improve the treatments and the healthcare system we need reliable and accurate data, not just good intentions [4] . With data, we could develop more effective treatments, more personalized care, mitigate risks, quantify in data-driven decisions, scale human work with automation (it’s clear that we don’t have enough healthcare professionals), and so on.

The MIMIC-III dataset is provided by the MIT Laboratory of Computational Physiology (LCP) and comprises of health-related data associated with over 60K admissions of patients who stayed in critical care units of the Beth Israel Deaconess Medical Center, Boston, Massachusetts, between 2001 and 2012. The data is public and freely available, supporting numerous analyses, studies.

The dataset contains data from patients represented by a unique ID (and de-identified), temporal data, lab results, measurements, diagnostic codes, electronic documentation, hospital length of stay, survival data, and more. This data is organized in a database with over 20 tables:

More information on how to get access to this data was very well explained by Andrew Long in his post. Also, you can go to the MIMIC-III website for more information. The data was de-identified to conform with HIPAA, and prior to accessing it, you must sign a data use agreement, promising not to use the data for any unlawful purpose and participate in an interesting training related to data, ethics, history, among other topics in healthcare research.

Given the richness of this dataset, one could come with an infinite number of questions and see limitless possibilities. Here we just touch the tip of the iceberg:

  1. How diverse is the dataset? Does it include multiple genders, demographics, religions, or just a specific population in Boston, Massachusetts, that was admitted to the Beth Israel Deaconess Medical Center?
  2. What is the most common diagnosis for being admitted to the ICU?
  3. What is the average length of stay in the ICUs?
  4. Given the data, would it be possible to predict whether a patient in the ICU will decease?

In order to prepare the data with optimal cost and performance, I used the data provided by MIT LPC with AWS that was in an Amazon S3 bucket (stored in optimized Parquet format), performed SQL queries with Amazon Athena without having to provision a database, and processed and analyzed from an Jupyter notebook hosted in Amazon SageMaker. This way I could use managed cloud services and scale if necessary according to the volume of data. Instructions of this setup are provided in this AWS Blog post. For making SQL queries in Athena from the Jupyter notebook, I used a library called AWS Wrangler, making it easy to interact with Athena, in a serverless way and visualize with multiple tools.

Answering the questions

How diverse is the dataset? Does it include multiple genders, demographics, religions or just a specific population in Boston, Massachusetts, that was admitted to the Beth Israel Deaconess Medical Center?

From the graphics above, it looks like that gender is fairly balanced among admissions, independent of religion. In addition, there are individuals from several marital statuses.

2. What are the most common diagnosis for being admitted to the ICU?

Looks like the Heart diseases are by far the most common (coronary, artery, aortic, infarction, myocardial). We can see all some other causes like overdoses, brain, and others.

3. What is the average length of stay in the ICUs?

As I couldn’t expect (probably because of my lack of knowledge in the healthcare field) most of the patients stay just a few days in the ICUs.

A few more interesting plots:

The areas of the hospital where patients are discharged and what is the distribution of patients being discharged from the hospital in the different segments and areas (Radiology, Surgery, Anesthesia, etc.):

With the cool feature from Scipy Hierarchical clustering and dendrograms, we can cluster missing data points and understand how they are related. The Missingno library makes it easy:

For a single patient, show how was his/her evolution in the ICU. How was the heart rate and respiratory rate and if an alarm for high or low rate was triggered (pink lines):

Patient’s level of consciousness, analyzing the Glasgow Coma Scale (GCS) measure of consciousness:

For more information, please check my notebook for this 3 steps of CRISP.

Finally let’s remember the final question:

  1. Given the data, would it be possible to predict whether a patient in the ICU will decease?

For answering that I created a simple model for testing if given the features of minimum and maximum respiratory rate and heart rate see whether we can predict if a patient will decease. A simple pipeline with Scikit-Learn using a Logistic Regression model, with cross-validation and performing imputation and z-scaling of inputs. In the SageMaker notebook instance first created and trained the model locally.

After training the model locally, without any hyperparameter optimization, nor testing other models other features, the average Area Under Receiver Operator Characteristic Curve (AUROC) over 5 folds was: 0.62

Given the simplicity of this model, one could say we have much to explore here and improve it by creating new features, testing other algorithms and hyperparameters!

Hence we could finally answer the 4th question: yes! It is possible (even though we have more “CRISP-DM cycles” ahead of us). After testing it locally (inside the notebook) we can train in the AWS cloud and deploy the model with SageMaker. All the infrastructure will be managed and we will have a REST API with our model hosted and ready to be invoked.

In this post we could explore about the MIMIC-III dataset in the broader context of the healthcare sector, data ethics, access, tools for analyzing data in AWS, and much more, starting from where all begins: a world problem (not purely technical, but involving people, organizations, society).

A final note here is that although this post was explained in a fairly “linear” way, in reality, there were lots of back-and-forths. For example, once I started analyzing the MIMIC-III, data more questions came up as I started to better understand the healthcare segment (which I don’t know). Also, the next step to improve our model could be creating more features (just 4 obviously isn’t enough in this toy example). The CRISP-DM process reveals exactly this recurrent nature of Machine Learning.

Finally, to impact the business and help with the problem, we saw how we could deploy our model and create a system that could respond in ms if a patient is in critical state.

Thanks for reading!!

Spread the word

This post was originally published by Marcelo Cunha at Medium [AI]

Related posts