BigQuery
BigQuery is a columnar, distributed relational database management system. BigQuery accesses only the columns specified in the query, making it ideal for data analysis workflows. BigQuery can be used to query a cloud based instance of MIMIC-III through the web browser. To access MIMIC-III on BigQuery, see the cloud data access guide.
MIMIC-III on BigQuery
MIMIC-III is organized into three “datasets” on BigQuery. BigQuery uses datasets to organize data into subgroups. If you are familiar with the database concept of a schema, then you can understand BigQuery datasets as the logical equivalent of a schema. The three datasets on BigQuery are:
mimiciii_clinical
- Almost all of MIMIC-III is stored here.mimiciii_notes
- The NOTEEVENTS table is stored here.mimiciii_derived
- Useful derivations which have been extracted from MIMIC-III are stored here for reuse by the community. All code to generate these views are open source and publicly available on the mimic-code repository.
This tutorial will focus on querying the bulk of MIMIC-III data stored in mimiciii_clinical
.
1. Overview
This tutorial provides an introduction to the structure and content of the MIMIC-III database. By the end of this tutorial you will be able to:
- Obtain metadata from the various database objects (tables, views, etc);
- Perform basic queries on a single table;
- Perform basic ‘joins’ to combine tables and extract useful information;
- Use database ‘views’ to extract high-level information.
First, navigate to BigQuery: https://console.cloud.google.com/bigquery
2. Database metadata
Metadata for a particular table can be obtained by clicking the table on the left sidebar on BigQuery, as below:
If you do not see the mimiciii_clinical
dataset on BigQuery, you may need to request access to MIMIC-III on BigQuery, or pin the project to your sidebar.
Try looking at the metadata for other tables such as patients
and chartevents
.
3. Patient numbers
SQL queries can be entered in the top “Query editor” panel and the results will be displayed at the bottom when the ‘Run’ button is pressed. Enter the following SQL in the box and press the ‘Execute query’ button.
SELECT *
FROM `physionet-data.mimiciii_clinical.patients`
You should see the following:
50 records are retrieved at a time and you can page through the results using the controls at the bottom of the screen.
Obtain the number of patients with the following query:
SELECT COUNT(*)
FROM `physionet-data.mimiciii_clinical.patients`;
The ‘gender’ column identifies the gender of the patient. We can obtain the distinct values used to indicate gender using the following query:
SELECT DISTINCT(gender)
FROM `physionet-data.mimiciii_clinical.patients`;
We can see that ‘M’ and ‘F’ are the two characters used to indicate patient gender. We can use this information to obtain the number of female patients by adding a condition to select rows where the gender is ‘F’:
SELECT COUNT(*)
FROM `physionet-data.mimiciii_clinical.patients`
WHERE gender = 'F';
And the numbers of male and female patients can be obtained using the following query:
SELECT gender, COUNT(*)
FROM `physionet-data.mimiciii_clinical.patients`
GROUP BY gender;
4. Mortality and admissions
A flag which records whether or not a patient died in the hospital is stored in the patients table. Count the number of patients who died using the following query:
SELECT expire_flag, COUNT(*)
FROM `physionet-data.mimiciii_clinical.patients`
GROUP BY expire_flag;
The database also contains date of death for patients who died inside the hospital in the column ‘dod_hosp’ and the date of death found in social security death records in ‘dod_ssn’. This information from both columns is merged in the ‘dod’ column with priority given to ‘dod_hosp’. Note that this database contains adult and neonatal patients which will affect the mortality statistics. Categorizing patients into different age groups is carried out in the next section.
5. Patient age and mortality
To determine the adult mortality rate we must first select the adult patients. We define adults as those patients who are 15 years old or above at the date of their first admission. To perform this query, we must combine the patients and admissions tables to find patient admission dates and dates of birth. We have denoted ‘admissions’ with the alias ‘a’ and ‘patients’ with alias ‘p’:
SELECT p.subject_id, p.dob, a.hadm_id,
a.admittime, p.expire_flag
FROM `physionet-data.mimiciii_clinical.admissions` a
INNER JOIN `physionet-data.mimiciii_clinical.patients` p
ON p.subject_id = a.subject_id;
Next we find the earliest admission date for each patient. This requires the use of two functions: the ‘MIN’ function, which obtains the minimum value, and the ‘PARTITION BY’ function, which determines the groups over which the minimum value is obtained. To determine the earliest admission time for each patient:
SELECT p.subject_id, p.dob, a.hadm_id,
a.admittime, p.expire_flag,
MIN (a.admittime) OVER (PARTITION BY p.subject_id) AS first_admittime
FROM `physionet-data.mimiciii_clinical.admissions` a
INNER JOIN `physionet-data.mimiciii_clinical.patients` p
ON p.subject_id = a.subject_id
ORDER BY a.hadm_id, p.subject_id;
A patient’s age is given by the difference between their date of birth and the date of their first admission.
In BigQuery, we must use special functions to do mathematical operations on datetime columns (note: as most columns in MIMIC-III have dates and times, they are often called datetime
columns to differentiate them from date
columns, which only store the date).
The reference page for datetime operations is available on the BigQuery documentation website.
Let’s calculate the age of patients at their time of admission:
SELECT p.subject_id, p.dob, a.hadm_id,
a.admittime, p.expire_flag,
DATETIME_DIFF(admittime, dob, YEAR) as age
FROM `physionet-data.mimiciii_clinical.admissions` a
INNER JOIN `physionet-data.mimiciii_clinical.patients` p
ON p.subject_id = a.subject_id
ORDER BY p.subject_id, a.hadm_id;
You should see the following:
Note at the bottom we have calculated the age for subject_id
= 17 twice, once for each of their hospital admissions (the hadm_id
is different between the rows).
If we examine the same patient more than once when calculating a statistic such as mortality, then our estimate will contain “repeated measures”. Unless we handle this phenomenon explicitly, our calculation will be biased.
A simple solution is to only examine the first hospitalization for each patient, which we can do with a GROUP BY
clause.
SELECT
p.subject_id, p.dob, p.gender
, MIN(a.admittime) AS first_admittime
, MIN( DATETIME_DIFF(admittime, dob, YEAR) )
AS first_admit_age
FROM `physionet-data.mimiciii_clinical.patients` p
INNER JOIN `physionet-data.mimiciii_clinical.admissions` a
ON p.subject_id = a.subject_id
GROUP BY p.subject_id, p.dob, p.gender
ORDER BY p.subject_id
Note in the above:
- Our group is defined by columns listed after
GROUP BY
, in this case:p.subject_id
,p.dob
,p.gender
. BigQuery will return a row for each unique combination of these three columns. Since we know that each patient only has onedob
and onegender
in the PATIENTS table, we know that this will return one row persubject_id
. - We have removed
hadm_id
, as we will collapse multiplehadm_id
into a single row. - We have wrapped
admittime
in aMIN()
aggregate function. For dates,MIN()
returns the earliest date. - We have wrapped age in a
MIN()
aggregate function. This tells BigQuery to extract the minimum age across all hospital admissions.
Now that we have a set of unique patients with their age, we can group them into sensible categories based upon age and calculate the mortality rate in these categories. Patients with an age >= 15 years old are adults and the rest are assigned to other categories. Note the use of the WITH
clause, which allows us to make a temporary view which we can query against in subsequent lines.
WITH first_admission_time AS
(
SELECT
p.subject_id, p.dob, p.gender
, MIN (a.admittime) AS first_admittime
, MIN( DATETIME_DIFF(admittime, dob, YEAR) )
AS first_admit_age
FROM `physionet-data.mimiciii_clinical.patients` p
INNER JOIN `physionet-data.mimiciii_clinical.admissions` a
ON p.subject_id = a.subject_id
GROUP BY p.subject_id, p.dob, p.gender
ORDER BY p.subject_id
)
SELECT
subject_id, dob, gender
, first_admittime, first_admit_age
, CASE
-- all ages > 89 in the database were replaced with 300
WHEN first_admit_age > 89
then '>89'
WHEN first_admit_age >= 15
THEN 'adult'
WHEN first_admit_age <= 1
THEN 'neonate'
ELSE 'middle'
END AS age_group
FROM first_admission_time
ORDER BY subject_id
The above query can now be combined with the WHERE and COUNT functions described earlier to determine the number of adult patients, whether or not they died, and therefore, their mortality rate.
WITH first_admission_time AS
(
SELECT
p.subject_id, p.dob, p.gender
, MIN (a.admittime) AS first_admittime
, MIN( DATETIME_DIFF(admittime, dob, YEAR) )
AS first_admit_age
FROM `physionet-data.mimiciii_clinical.patients` p
INNER JOIN `physionet-data.mimiciii_clinical.admissions` a
ON p.subject_id = a.subject_id
GROUP BY p.subject_id, p.dob, p.gender
ORDER BY p.subject_id
)
, age as
(
SELECT
subject_id, dob, gender
, first_admittime, first_admit_age
, CASE
-- all ages > 89 in the database were replaced with 300
WHEN first_admit_age > 89
then '>89'
WHEN first_admit_age >= 14
THEN 'adult'
WHEN first_admit_age <= 1
THEN 'neonate'
ELSE 'middle'
END AS age_group
FROM first_admission_time
)
select age_group, gender
, count(subject_id) as NumberOfPatients
from age
group by age_group, gender
Note that no ‘middle’ patients show up - this reflects the fact that MIMIC-III does not contain data from paediatric patients.
6. ICU stays
In the MIMIC-III database, we define an ICU stay to be continuous if a patient is returned to an ICU room within 24 hours of being moved to a ward. Patient ICU movements are recorded in the transfers table:
SELECT *
FROM `physionet-data.mimiciii_clinical.transfers`;
The columns should be fairly self explanatory. Click on the transfers table on the left hand side if you need more information about the columns and the data they contain. The ‘prev_careunit’ and ‘curr_careunit’ contain the names of the previous and current careunit respectively. The transfers table also includes ‘prev_wardid’ and ‘curr_wardid’ columns, which contain the IDs of the previous and current careunit respectively. Ward IDs, which specify the room within a ward, have no corresponding key in order to protect patient health information.
The transfers table may have multiple entries per patient to provide detail of all movement between various careunits of the hospital. The first entry in the transfers table for a patient who comes into the ICU will have nothing in the ‘prev_careunit’ column. Similarly, the last entry for a patient will have nothing in the ‘curr_careunit’. Entries that have nothing in both previous and current careunit columns indicate that the patient has been transfered between non intensive care units. An example query for one patient and result from the transfers table is shown below. Note that columns ‘intime’, ‘outtime’, and ‘los’ have been truncated.
SELECT *
FROM `physionet-data.mimiciii_clinical.transfers`
WHERE HADM_ID = 112213;
row_id | subject_id | hadm_id | icustay_id |
---|---|---|---|
54 | 12 | 112213 | |
55 | 12 | 112213 | |
56 | 12 | 112213 | 232669 |
57 | 12 | 112213 | |
58 | 12 | 112213 | 232669 |
59 | 12 | 112213 | |
60 | 12 | 112213 |
7. Services
Services is a newly added table in MIMIC-III which contains information about the transfers from being under one service to another during a patient’s stay. The services table contains columns including ‘prev_service’ and ‘curr_service’ which contain the names of previous and current services respectively. ‘transfertime’ is the time at which the patient was moved from ‘prev_service’ to ‘curr_service’.
8. Tutorial problem
How would you gather useful information about patients admitted to the ICU? The problem can be broken down into several parts:
Step 1
First start with retrieving ‘subject_id’, ‘hadm_id’, ‘icustay_id’, ‘intime’, and ‘outtime’ from the ‘icustays’ table.
Step 2
Using the patients table retrieve the calculated age of patients.
Step 3
Separate neonates from adult patients.
Step 4
By incorporating the admissions table, find how long each stay was BEFORE the patients were admitted to the ICU
Step 5
Next find the date of the patient’s death if applicable.
Step 6
Then find those deaths that occurred while the patients were in the hospital.
Step 7
Find how many of those deaths occurred within the ICU.
Solutions to the problems in section 8
Solution to step 1
SELECT ie.subject_id, ie.hadm_id, ie.icustay_id,
ie.intime, ie.outtime
FROM `physionet-data.mimiciii_clinical.icustays` ie;
Solution to step 2
SELECT ie.subject_id, ie.hadm_id, ie.icustay_id,
ie.intime, ie.outtime,
DATETIME_DIFF(admittime, dob, YEAR) AS age
-- we use 'ie' as an alias for the icustays table
FROM `physionet-data.mimiciii_clinical.icustays` ie
-- we use 'pat' as an alias for the patients table
INNER JOIN `physionet-data.mimiciii_clinical.patients` pat
-- since subject_id is unique for every row in patients, we will get
-- one row for every row in icustays (ie)
ON ie.subject_id = pat.subject_id;
Solution to step 3
SELECT ie.subject_id, ie.hadm_id, ie.icustay_id,
ie.intime, ie.outtime,
DATETIME_DIFF(admittime, dob, YEAR) AS age,
CASE
WHEN DATETIME_DIFF(admittime, dob, YEAR) <= 1
THEN 'neonate'
WHEN DATETIME_DIFF(admittime, dob, YEAR) <= 14
THEN 'middle'
-- all ages > 89 in the database were replaced with 300
WHEN DATETIME_DIFF(admittime, dob, YEAR) > 89
then '>89'
ELSE 'adult'
END AS ICUSTAY_AGE_GROUP
FROM `physionet-data.mimiciii_clinical.icustays` ie
INNER JOIN `physionet-data.mimiciii_clinical.patients` pat
ON ie.subject_id = pat.subject_id;
Solution to step 4
SELECT ie.subject_id, ie.hadm_id, ie.icustay_id,
ie.intime, ie.outtime,
DATETIME_DIFF(adm.admittime, pat.dob, YEAR) as age,
DATETIME_DIFF(ie.intime, adm.admittime, DAY) as preiculos,
CASE
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) <= 1
THEN 'neonate'
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) <= 14
THEN 'middle'
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) > 89
THEN '>89'
ELSE 'adult'
END AS ICUSTAY_AGE_GROUP
FROM `physionet-data.mimiciii_clinical.icustays` ie
INNER JOIN `physionet-data.mimiciii_clinical.patients` pat
ON ie.subject_id = pat.subject_id
INNER JOIN `physionet-data.mimiciii_clinical.admissions` adm
ON ie.hadm_id = adm.hadm_id;
Solution to step 5
SELECT ie.subject_id, ie.hadm_id, ie.icustay_id,
ie.intime, ie.outtime,
-- patient death in hospital is stored in the admissions table
adm.deathtime,
DATETIME_DIFF(adm.admittime, pat.dob, YEAR) as age,
DATETIME_DIFF(ie.intime, adm.admittime, DAY) as preiculos,
CASE
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) <= 1
THEN 'neonate'
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) <= 14
THEN 'middle'
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) > 89
THEN '>89'
ELSE 'adult'
END AS ICUSTAY_AGE_GROUP
FROM `physionet-data.mimiciii_clinical.icustays` ie
INNER JOIN `physionet-data.mimiciii_clinical.patients` pat
ON ie.subject_id = pat.subject_id
INNER JOIN `physionet-data.mimiciii_clinical.admissions` adm
ON ie.hadm_id = adm.hadm_id;
Solution to step 6
SELECT ie.subject_id, ie.hadm_id, ie.icustay_id,
ie.intime, ie.outtime,
-- patient death in hospital is stored in the admissions table
adm.deathtime,
DATETIME_DIFF(adm.admittime, pat.dob, YEAR) as age,
DATETIME_DIFF(ie.intime, adm.admittime, DAY) as preiculos,
CASE
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) <= 1
THEN 'neonate'
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) <= 14
THEN 'middle'
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) > 89
THEN '>89'
ELSE 'adult'
END AS ICUSTAY_AGE_GROUP,
-- the "hospital_expire_flag" field in the admissions table indicates if a patient died in-hospital
CASE
WHEN adm.hospital_expire_flag = 1 then 'Y'
ELSE 'N'
END AS hospital_expire_flag
FROM `physionet-data.mimiciii_clinical.icustays` ie
INNER JOIN `physionet-data.mimiciii_clinical.patients` pat
ON ie.subject_id = pat.subject_id
INNER JOIN `physionet-data.mimiciii_clinical.admissions` adm
ON ie.hadm_id = adm.hadm_id;
Solution to step 7
SELECT ie.subject_id, ie.hadm_id, ie.icustay_id,
ie.intime, ie.outtime,
-- patient death in hospital is stored in the admissions table
adm.deathtime,
DATETIME_DIFF(adm.admittime, pat.dob, YEAR) as age,
DATETIME_DIFF(ie.intime, adm.admittime, DAY) as preiculos,
CASE
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) <= 1
THEN 'neonate'
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) <= 14
THEN 'middle'
WHEN DATETIME_DIFF(adm.admittime, pat.dob, YEAR) > 89
THEN '>89'
ELSE 'adult'
END AS ICUSTAY_AGE_GROUP,
-- the "hospital_expire_flag" field in the admissions table indicates if a patient died in-hospital
CASE
WHEN adm.hospital_expire_flag = 1 then 'Y'
ELSE 'N'
END AS hospital_expire_flag,
-- note also that hospital_expire_flag is equivalent to "Is adm.deathtime not null?"
CASE
WHEN adm.deathtime BETWEEN ie.intime and ie.outtime
THEN 'Y'
-- sometimes there are typographical errors in the death date, so check before intime
WHEN adm.deathtime <= ie.intime
THEN 'Y'
WHEN adm.dischtime <= ie.outtime
AND adm.discharge_location = 'DEAD/EXPIRED'
THEN 'Y'
ELSE 'N'
END AS ICUSTAY_EXPIRE_FLAG
FROM `physionet-data.mimiciii_clinical.icustays` ie
INNER JOIN `physionet-data.mimiciii_clinical.patients` pat
ON ie.subject_id = pat.subject_id
INNER JOIN `physionet-data.mimiciii_clinical.admissions` adm
ON ie.hadm_id = adm.hadm_id;
Feedback
Was this page helpful?
Glad to hear it! Please raise an issue here to tell us how we can improve.
Sorry to hear that. Please raise an issue here to tell us how we can improve.