Usually a RData file which stores all the dataset will be given. A sample RData can be found in /doc/sample_ccd.RData
.
library(cleanEHR)
data.path <- paste0(find.package("cleanEHR"), "/doc/sample_ccd.RData")
load(data.path)
You can have a quick overview of the data by checking infotb
. In the sample dataset, sensitive variables such as NHS number and admission time have been removed or twisted.
print(head(ccd@infotb))
## site_id episode_id nhs_number pas_number t_admission
## 1: pseudo_site 1 NA NA 1970-01-01 01:00:00
## 2: pseudo_site 2 NA NA 1970-01-01 01:00:00
## 3: pseudo_site 3 NA NA 1970-01-01 01:00:00
## 4: pseudo_site 4 NA NA 1970-01-01 01:00:00
## 5: pseudo_site 5 NA NA 1970-01-01 01:00:00
## 6: pseudo_site 6 NA NA 1970-01-01 01:00:00
## t_discharge parse_file parse_time pid index
## 1: <NA> NA <NA> 1 1
## 2: <NA> NA <NA> 2 2
## 3: <NA> NA <NA> 3 3
## 4: <NA> NA <NA> 4 4
## 5: <NA> NA <NA> 5 5
## 6: <NA> NA <NA> 6 6
The basic entry of the data is episode which indicates an admission of a site. Using episode_id
and site_id
can locate a unique admission entry. pid
is a unique patient identifier.
# quickly check how many episodes are there in the dataset.
ccd@nepisodes
## [1] 30
There are 263 fields which covers patient demographics, physiology, laboratory, and medication information. Each field has 2 labels, NHIC code and short name. There is a function lookup.items()
to look up the fields you need. lookup.items()
function is case insensitive and allows fuzzy search.
# searching for heart rate
lookup.items('heart') # fuzzy search
+-------------------+--------------+--------------+--------+-------------+
| NHIC.Code | Short.Name | Long.Name | Unit | Data.type |
+===================+==============+==============+========+=============+
| NIHR_HIC_ICU_0108 | h_rate | Heart rate | bpm | numeric |
+-------------------+--------------+--------------+--------+-------------+
| NIHR_HIC_ICU_0109 | h_rhythm | Heart rhythm | N/A | list |
+-------------------+--------------+--------------+--------+-------------+
# check the heart rate, bilirubin, fluid balance, and drugs of episode_id = 7.
# NOTE: due to anonymisation reason, some episodes data cannot be displayed
# properly.
episode.graph(ccd, 7, c("h_rate", "bilirubin", "fluid_balance_d"))
sql.demographic.table()
can generate a data.table
that contains all the non-longitudinal variables. A demonstration of how to do some work on a subset of data.
# contains all the 1D fields i.e. non-longitudinal
tb1 <- sql.demographic.table(ccd)
# filter out all dead patient. (All patients are dead in the dataset.)
tb1 <- tb1[DIS=="D"]
# subset variables we want (ARSD = Advanced respiratory support days,
# apache_prob = APACHE II probability)
tb <- tb1[, c("SEX", "ARSD", "apache_prob"), with=F]
tb <- tb[!is.na(apache_prob)]
# plot
library(ggplot2)
ggplot(tb, aes(x=apache_prob, y=ARSD, color=SEX)) + geom_point()
To deal with longitudinal data, we need to first to transform it into a long table format.
cctable
# To prepare a YAML configuration file like this. You write the following text
# in a YAML file.
conf <- "
NIHR_HIC_ICU_0108:
shortName: hrate
NIHR_HIC_ICU_0112:
shortName: bp_sys_a
dataItem: Systolic Arterial blood pressure - Art BPSystolic Arterial blood pressure
NIHR_HIC_ICU_0093:
shortName: sex
"
library(yaml)
tb <- create.cctable(ccd, yaml.load(conf), freq=1)
# a lazy way to do that.
tb <- create.cctable(ccd, list(NIHR_HIC_ICU_0108=list(),
NIHR_HIC_ICU_0112=list(),
NIHR_HIC_ICU_0093=list()),
freq=1)
print(tb$tclean)
## time NIHR_HIC_ICU_0108 NIHR_HIC_ICU_0112 NIHR_HIC_ICU_0093
## 1: 0 64 NA F
## 2: 1 71 NA F
## 3: 2 71 NA F
## 4: 3 80 NA F
## 5: 4 NA NA F
## ---
## 7932: 690 NA NA M
## 7933: 691 NA NA M
## 7934: 692 NA NA M
## 7935: 693 NA NA M
## 7936: 694 NA NA M
## site episode_id NIHR_HIC_ICU_0112.meta
## 1: pseudo_site 1 NA
## 2: pseudo_site 1 NA
## 3: pseudo_site 1 NA
## 4: pseudo_site 1 NA
## 5: pseudo_site 1 NA
## ---
## 7932: pseudo_site 9 NA
## 7933: pseudo_site 9 NA
## 7934: pseudo_site 9 NA
## 7935: pseudo_site 9 NA
## 7936: pseudo_site 9 NA
cctable
tb$tclean[, mean(NIHR_HIC_ICU_0108, na.rm=T), by=c("site", "episode_id")]
## site episode_id V1
## 1: pseudo_site 1 73.00000
## 2: pseudo_site 10 80.70370
## 3: pseudo_site 11 87.57143
## 4: pseudo_site 12 95.61667
## 5: pseudo_site 13 130.09091
## 6: pseudo_site 14 NaN
## 7: pseudo_site 15 117.50000
## 8: pseudo_site 16 NaN
## 9: pseudo_site 17 88.40719
## 10: pseudo_site 18 NaN
## 11: pseudo_site 19 89.50845
## 12: pseudo_site 2 103.14615
## 13: pseudo_site 20 72.02439
## 14: pseudo_site 21 NaN
## 15: pseudo_site 22 NaN
## 16: pseudo_site 23 98.48810
## 17: pseudo_site 24 123.16566
## 18: pseudo_site 25 NaN
## 19: pseudo_site 26 87.63636
## 20: pseudo_site 27 121.37143
## 21: pseudo_site 28 111.96195
## 22: pseudo_site 29 75.40000
## 23: pseudo_site 3 NaN
## 24: pseudo_site 30 53.00000
## 25: pseudo_site 4 114.60386
## 26: pseudo_site 5 102.20000
## 27: pseudo_site 6 107.87650
## 28: pseudo_site 7 70.40595
## 29: pseudo_site 8 NaN
## 30: pseudo_site 9 NaN
## site episode_id V1
To clean the data, one needs to write the specification in the YAML configuration file.
conf <-"
NIHR_HIC_ICU_0108:
shortName: hrate
dataItem: Heart rate
distribution: normal
decimal_places: 0
range:
labels:
red: (0, 300)
amber: (11, 150)
apply: drop_entry
missingness: # remove episode if missingness is higher than 70% in any 24 hours interval
labels:
yellow: 24
accept_2d:
yellow: 70
apply: drop_episode
"
ctb <- create.cctable(ccd, yaml.load(conf), freq=1)
ctb$filter.ranges("amber") # apply range filters
ctb$filter.missingness()
ctb$apply.filters()
cptb <- rbind(cbind(ctb$torigin, data="origin"),
cbind(ctb$tclean, data="clean"))
ggplot(cptb, aes(x=time, y=NIHR_HIC_ICU_0108, color=data)) +
geom_point(size=1.5) + facet_wrap(~episode_id, scales="free_x")