Thanks for the sponsorship of this trip to Australia by Taipei Medical University.


  • Lecture: how to ask clinically significant and answerable questions?
  • Lecture: Philips eICU collaborative research database
  • During datathon
  • Notes: data manipulation: the steps
  • Thoughts on this datathon

Lecture: How to Ask Clinically Significant and Answerable Questions?

  • Move from facts (e.g. how many people in mimic-iii received insulin?) to information and knowledge.
  • A deeper question: the relationship of hypoglycemia and insulin dosage in ICU patient?
    • A research process looked like this:
      • All episodes of hypoglycemia in insulin-treated patients were preceded by an insulin bolus in the previous XX hours.
      • And discovered that those who didn’t received bolus had no event of hypoglycemia.
      • This may lead to a more strict randomized clinical trials.
    • Deeper!
      • A literature review of bolus-associated hypo?
      • What might be the mechanism and pathophysiology? Can we prove this by using the same dataset (MIMIC-III)?
  • We need knowledge that enhances previous understanding; expands it, challenges it! Don’t tell people what they already know.
  • This datathon is aimed to published papers that have an impact.

Lecture: Philips eICU Collaborative Research Database

  • What is Philips eICU?
    • All information is collected from Philips’ standardized data integration system.
    • The staffs in ICU are changing to a more data-driven healthcare.
  • Centralized tale-ICU model
    • Connected with more than 400 hospitals, across >40 US states.
    • This system has more than 3.5 million admission information with 300 million lab values.
  • The website the same webpage layout as MIMIC-III lol
  • Some example questions:
    • Descriptive studies
    • Epidemiological studies - hypothesis generating with expanding insights (by choosing correct variable and visualization)
      • Hyperglycemia with mortality rate
      • PRIS-like syndrome case with propofol usage
    • Predictive algorithms (ICU discharge readiness score (DRS)): a combination with basic information, saturation and hemodynamics
      • The histogram showed an increased curve in non-survivors: may this be another APACHE-II or SAPS score?

Notes: Data manipulation: The steps

The waveform data was already downloaded by a postdoc in TMU. So our job is to get the correct files to operate. According to our inclusion criteria, only adult patients who were younger than 65 years old will be included, so we compare the MATCHED list to the MIMIC-III and found there are only around 1000 patients fit our role. Then, we filtered out those who re-admitted to the ICU. After these steps, there are only 860 patients who discharged from ICU and 91 died in the admission. We call the id list of survivors SUR and non-survivors DIE.


Originally we want to perform propensity score matching to eliminate the influence of confounding factors but failed because these patients had no difference in any therapy. A small matching script which I used before was modified to fit our need. The algorithm is simple: divided the patient into several groups based on their properties (age, gender, APACHE III score when they admitted to ICU), then choose the same magnification of patients from control group randomly.

## reading data.frame
die = read.csv('die.csv')
sur = read.csv('sur.csv')

## grouping, the non-survivor
diefactor <- with(die, interaction(sex, age, apacheiii))
diesplit <- split(die, diefactor, drop = TRUE)
dieid <- lapply(diesplit, function(subtable) {length(subtable$id)})

## grouping, the survivor
surfactor <- with(sur, interaction(sex, age, apacheiii))
sursplit <- split(sur, surfactor, drop = TRUE)
surid <- lapply(sursplit, function(subtable) {length(subtable$id)})

## largest magnification
magnification <- as.integer(min(as.numeric(surid) / as.numeric(dieid)))

## matching
result <- lapply(names(dieid), function(k) {sample(sursplit[[k]]$id, magnification * dieid[[k]])})

Steps for getting data

A small copy-by-id script which copies the need files from waveform data folder:

for id in `cat id`;
    if [ ! -d /mimic-waveforms/dat/$id ]; then
        mkdir /mimic-waveforms/dat/$id;
    cp /transcend/Data/matched/$id/[0-9]*.* /mimic-waveforms/dat/$id;

Put the corrected header which links the MIMIC-III database to waveform data. It’s a simple case because I just unzip the headers to corresponding folders. A RESOURCE file which provided the information of what files to convert. which was written by Alistair was used in this step. This script ignored those file without waveform, so we don’t need to filter out the unnecessary headers. We just don’t copy the waveform. After, I used a one line command to list all RR interval files:

find . -type f -printf "%T@ %p\n" | sort -nr | cut -d\  -f2- | grep rr_ > filelist

The most complicated step:

  1. There are a lot of RR interval of same patient at the same time: they come from the different lead. I have concluded a best choices order from my clinical experience: lead II > lead I or aVF > lead III or aVR > other leads.
  2. Prof. Shabbir emphasizes that a single value of HRV is useless. The important part was the trend of HRV. Therefore, I devided the RR interval data into several files which only contained five minutes of RR interval.
  3. Generate a final shell script (which was expected to have as many rows as the number of final 5 mins RR interval files)
storage = {}
with open('filelist', 'r') as f:
for originline in f.readlines():
    line = originline.split('/')
    id = line[2]
    filename = line[3].split('.')[0]
    filetype = line[3].split('.')[1]
    if filename in storage.keys():
        if '02' in filetype:
            storage[filename] = originline.replace('\n', '')
        storage[filename] = originline.replace('\n', '')

from datetime import datetime
from datetime import timedelta

def f2i(total):
    return int(total // 300)

with open('', 'w') as g:
    for line in storage.values():
        separate = line.split('/')
        startpoint = datetime.strptime(separate[3][7:23], '%Y-%m-%d-%H-%M')
        head = './rr/' + separate[3][0:7]
        tail = '.rr'
        # split the RR interval file into several minutes 
        with open(line, 'r') as h:
            temp = {} 
            for interval in h.readlines():
                timestamp = float(interval.split('	')[0])
                if f2i(timestamp) in temp.keys():
                    temp[f2i(timestamp)] = [interval]
            for filenum in list(temp.keys())[:-1]:
                filename = head + startpoint.strftime('%Y-%m-%d-%H-%M') + tail
                with open(filename, 'w') as k:
                    for line in temp[filenum]:
                g.write('get_hrv -R ' + filename + ' -L >> output\n') 
                # hack: startpoint was defined above
                startpoint = startpoint + timedelta(minutes = 5)

The final shell script writes the output of get_hrv into a file name output. I wrote another small python to modified the formation to CSV (a comma separated file):

with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as g:
        for line in f.readlines():
            if 'bad' in line or 'na' in line:
            newline = line.replace(':', '').replace('\n', '').replace('  ',' ').replace(' ', ',')
            sid = newline.split(',')[0].split('/')[2][0:6]
            timestamp = newline.split(',')[0].split('/')[2][7:23]
            g.write(sid + ',' + timestamp + ',' + newline[32:] + '\n')

Notes: Statistics and Modeling of the Data

After all the steps described above, we have:

  1. The clinical data, including the electrolyte, renal and liver function, parameters of hemodynamics and cardio functions, APACHE III score.
  2. The waveform data, which is separated into five minutes HRV analysis result in CSV format.

Then we join the table together and got a huge table like this:

id age gender timestamp APA-III HRV-LF HRV-HF
0001 55 M 2175-06-23-14-27 63 0.004 0.006
0001 55 M 2175-06-23-14-32 63 0.003 0.008

What we want to do is to compare the effect of pure HRV data’s prediction ability to APACHE-III. On a daily basis, APACHE-III would be more powerful, but in a spectrum in 30 minutes, HRV could be more effective to predict what would happen in the next hour. We choose

Thoughts on This Datathon

This section is translated from my article After ANZICS datathon.

  1. Use what people have done: The repository of MIT-LCP on GitHub contained lots useful toolkit developed by the MIT engineers and people over the world. Many of the tools are not only suitable in MIMIC dataset but also can be used in another scenario. For example, the gqrs which catches the P, Q, R, S waves from EKG waveform can be used in other research; besides, the algorithm behinds the gqrs was approved by the academical society. I would like to master and even contribute to these tools after my MD license examination in June 2017. Younger physicians who have no resource to hire an engineer to generate their own programs could use these tools to promote clinical studies.
  2. The importance of remote processing: It’s embarrassed to show “still running” on the last slide over a bunch of intensive care specialist and data scientists. Because the dataset of ANZICS core or MIMIC-III was released publically, it’s possible to store this dataset in a remote, more powerful server. We can write code, do statistics and build models on these machines, not on our notebooks. Some of the tools, such as get_hrv, requires the Unix-like environment (Mac, Linux) to be compiled and performed. Though one of our teammate had a latest macbook pro, he didn’t install the Xcode command line toolkit. If Taipei Medical University wants to promote biomedical data sciences research, building a cluster for computation is necessary.
  3. Call for help, earlier: The tutorials of ann2rr and gqrs were too complicated for me to understand how to use. We downloaded the data from PhysioNet ATM, and tried to make our own wave recognition tool in the morning of first day of datathon. But one of our teammates, Owen Hsu, called the core engineer of MIMIC-III, Alistair, to help us coding. Alister provided a shell script (which had described above) to help convert waveform data to annotation data to RR interval smoothly. This script boosted our progress. I think our jobs is to fit these tools to our study, so if got stuck in a step, call for help and don’t by shy.
  4. Write down the workflow: Another notable thing I’ve done in this datathon was trying to write down the workflow. Because there’re a lot of data, script and program, it’s easy to be confused after hours of working. The workflow was a simple list which describes: why and how to execute this script.
  5. Find my position: I play the role as the medical advisor in our team. But one of TMU’s professor, Shabbir, who had both MD licence and real reasarch experience participated in this datathon. Therefore, all the jobs of research design and explaination were provided by him. Instead of literature review, I spent most of my time writing waveform transforming script.