A brief tour of cleanEHR

David Perez Suarez & Sinan Shi

2017-01-31

Load data

Usually a RData file which stores all the dataset will be given. A sample RData can be found in /doc/sample_ccd.RData.

library(cleanEHR)
data.path <- paste0(find.package("cleanEHR"), "/doc/sample_ccd.RData")
load(data.path)

Data overview

You can have a quick overview of the data by checking infotb. In the sample dataset, sensitive variables such as NHS number and admission time have been removed or twisted.

print(head(ccd@infotb))
##        site_id episode_id nhs_number pas_number         t_admission
## 1: pseudo_site          1         NA         NA 1970-01-01 01:00:00
## 2: pseudo_site          2         NA         NA 1970-01-01 01:00:00
## 3: pseudo_site          3         NA         NA 1970-01-01 01:00:00
## 4: pseudo_site          4         NA         NA 1970-01-01 01:00:00
## 5: pseudo_site          5         NA         NA 1970-01-01 01:00:00
## 6: pseudo_site          6         NA         NA 1970-01-01 01:00:00
##    t_discharge parse_file parse_time pid index
## 1:        <NA>         NA       <NA>   1     1
## 2:        <NA>         NA       <NA>   2     2
## 3:        <NA>         NA       <NA>   3     3
## 4:        <NA>         NA       <NA>   4     4
## 5:        <NA>         NA       <NA>   5     5
## 6:        <NA>         NA       <NA>   6     6

The basic entry of the data is episode which indicates an admission of a site. Using episode_id and site_id can locate a unique admission entry. pid is a unique patient identifier.

# quickly check how many episodes are there in the dataset.
ccd@nepisodes
## [1] 30

There are 263 fields which covers patient demographics, physiology, laboratory, and medication information. Each field has 2 labels, NHIC code and short name. There is a function lookup.items() to look up the fields you need. lookup.items() function is case insensitive and allows fuzzy search.

# searching for heart rate
lookup.items('heart') # fuzzy search

+-------------------+--------------+--------------+--------+-------------+
|     NHIC.Code     |  Short.Name  |  Long.Name   |  Unit  |  Data.type  |
+===================+==============+==============+========+=============+
| NIHR_HIC_ICU_0108 |    h_rate    |  Heart rate  |  bpm   |   numeric   |
+-------------------+--------------+--------------+--------+-------------+
| NIHR_HIC_ICU_0109 |   h_rhythm   | Heart rhythm |  N/A   |    list     |
+-------------------+--------------+--------------+--------+-------------+

Inspect individual episode

# check the heart rate, bilirubin, fluid balance, and drugs of episode_id = 7. 
# NOTE: due to anonymisation reason, some episodes data cannot be displayed
# properly. 
episode.graph(ccd, 7, c("h_rate",  "bilirubin", "fluid_balance_d"))

Non-longitudinal Data

sql.demographic.table() can generate a data.table that contains all the non-longitudinal variables. A demonstration of how to do some work on a subset of data.

# contains all the 1D fields i.e. non-longitudinal
tb1 <- sql.demographic.table(ccd)

# filter out all dead patient. (All patients are dead in the dataset.)
tb1 <- tb1[DIS=="D"]

# subset variables we want (ARSD = Advanced respiratory support days,
# apache_prob = APACHE II probability)
tb <- tb1[, c("SEX", "ARSD", "apache_prob"), with=F]
tb <- tb[!is.na(apache_prob)]

# plot
library(ggplot2)
ggplot(tb, aes(x=apache_prob, y=ARSD, color=SEX)) + geom_point()

Longitudinal data

To deal with longitudinal data, we need to first to transform it into a long table format.

Create a cctable

# To prepare a YAML configuration file like this. You write the following text
# in a YAML file. 
conf <- "
NIHR_HIC_ICU_0108:
  shortName: hrate
NIHR_HIC_ICU_0112:
  shortName: bp_sys_a
  dataItem: Systolic Arterial blood pressure - Art BPSystolic Arterial blood pressure
NIHR_HIC_ICU_0093:
   shortName: sex
"
library(yaml)
tb <- create.cctable(ccd, yaml.load(conf), freq=1)

# a lazy way to do that. 
tb <- create.cctable(ccd, list(NIHR_HIC_ICU_0108=list(), 
                         NIHR_HIC_ICU_0112=list(), 
                         NIHR_HIC_ICU_0093=list()), 
                     freq=1)
print(tb$tclean)
##       time NIHR_HIC_ICU_0108 NIHR_HIC_ICU_0112 NIHR_HIC_ICU_0093
##    1:    0                64                NA                 F
##    2:    1                71                NA                 F
##    3:    2                71                NA                 F
##    4:    3                80                NA                 F
##    5:    4                NA                NA                 F
##   ---                                                           
## 7932:  690                NA                NA                 M
## 7933:  691                NA                NA                 M
## 7934:  692                NA                NA                 M
## 7935:  693                NA                NA                 M
## 7936:  694                NA                NA                 M
##              site episode_id NIHR_HIC_ICU_0112.meta
##    1: pseudo_site          1                     NA
##    2: pseudo_site          1                     NA
##    3: pseudo_site          1                     NA
##    4: pseudo_site          1                     NA
##    5: pseudo_site          1                     NA
##   ---                                              
## 7932: pseudo_site          9                     NA
## 7933: pseudo_site          9                     NA
## 7934: pseudo_site          9                     NA
## 7935: pseudo_site          9                     NA
## 7936: pseudo_site          9                     NA

Manipulate on cctable

tb$tclean[, mean(NIHR_HIC_ICU_0108, na.rm=T), by=c("site", "episode_id")]
##            site episode_id        V1
##  1: pseudo_site          1  73.00000
##  2: pseudo_site         10  80.70370
##  3: pseudo_site         11  87.57143
##  4: pseudo_site         12  95.61667
##  5: pseudo_site         13 130.09091
##  6: pseudo_site         14       NaN
##  7: pseudo_site         15 117.50000
##  8: pseudo_site         16       NaN
##  9: pseudo_site         17  88.40719
## 10: pseudo_site         18       NaN
## 11: pseudo_site         19  89.50845
## 12: pseudo_site          2 103.14615
## 13: pseudo_site         20  72.02439
## 14: pseudo_site         21       NaN
## 15: pseudo_site         22       NaN
## 16: pseudo_site         23  98.48810
## 17: pseudo_site         24 123.16566
## 18: pseudo_site         25       NaN
## 19: pseudo_site         26  87.63636
## 20: pseudo_site         27 121.37143
## 21: pseudo_site         28 111.96195
## 22: pseudo_site         29  75.40000
## 23: pseudo_site          3       NaN
## 24: pseudo_site         30  53.00000
## 25: pseudo_site          4 114.60386
## 26: pseudo_site          5 102.20000
## 27: pseudo_site          6 107.87650
## 28: pseudo_site          7  70.40595
## 29: pseudo_site          8       NaN
## 30: pseudo_site          9       NaN
##            site episode_id        V1

Data cleaning

To clean the data, one needs to write the specification in the YAML configuration file.

conf <-"
NIHR_HIC_ICU_0108:
  shortName: hrate
  dataItem: Heart rate
  distribution: normal
  decimal_places: 0
  range:
    labels:
      red: (0, 300)
      amber: (11, 150)
    apply: drop_entry
  missingness: # remove episode if missingness is higher than 70% in any 24 hours interval 
    labels:
      yellow: 24
    accept_2d:
      yellow: 70 
    apply: drop_episode
"

ctb <- create.cctable(ccd, yaml.load(conf), freq=1)
ctb$filter.ranges("amber") # apply range filters
ctb$filter.missingness()
ctb$apply.filters()

cptb <- rbind(cbind(ctb$torigin, data="origin"), 
              cbind(ctb$tclean, data="clean"))


ggplot(cptb, aes(x=time, y=NIHR_HIC_ICU_0108, color=data)) + 
  geom_point(size=1.5) + facet_wrap(~episode_id, scales="free_x")