Introduction to DaQAPO

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Niels Martin¹, Greg Van Houdt², and Gert Janssenswillen³

2022-07-14

Introduction

Process mining techniques generate valuable insights in business processes using automatically generated process execution data. However, despite the extensive opportunities that process mining techniques provide, the garbage in - garbage out principle still applies. Data quality issues are widespread in real-life data and can generate misleading results when used for analysis purposes. Currently, there is no systematic way to perform data quality assessment on process-oriented data. To fill this gap, we introduce DaQAPO - Data Quality Assessment for Process-Oriented data. It provides a set of assessment functions to identify a wide array of quality issues.

We identify two stages in the data quality assessment process:

Reading and preparing data;
Assessing the data quality - running quality tests.

If the user desires to remove anomalies detected by quality tests, he has the ability to do so.

Data Sources

Before we can perform the first stage - reading data - we must have access to the appropriate data sources and have knowledge of the expected data structure. Our package supports two input data formats:

An activity log: each line in the log represents an activity instance, i.e. the execution of an activity for a specific case (e.g. a client, a patient, a file,…) by a specific resource. Hence, an activity instance has a duration.
An event log: each line in the log represents an event recorded for a specific activity instance, expressing for instance its start or its completion. Therefore, an event has no duration.

Two example datasets are included in daqapo. These are hospital and hospital_events. Below, you can find their respective structures.

str(hospital)
#> tibble [53 x 7] (S3: tbl_df/tbl/data.frame)
#>  $ patient_visit_nr: num [1:53] 510 512 510 512 512 510 517 518 518 518 ...
#>  $ activity        : chr [1:53] "registration" "Registration" "Triage" "Triage" ...
#>  $ originator      : chr [1:53] "Clerk 9" "Clerk 12" "Nurse 27" "Nurse 27" ...
#>  $ start_ts        : chr [1:53] "20/11/2017 10:18:17" "20/11/2017 10:33:14" "20/11/2017 10:34:08" "20/11/2017 10:44:12" ...
#>  $ complete_ts     : chr [1:53] "20/11/2017 10:20:06" "20/11/2017 10:37:00" "20/11/2017 10:41:48" "20/11/2017 10:50:17" ...
#>  $ triagecode      : num [1:53] 3 3 3 3 3 NA 3 4 4 4 ...
#>  $ specialization  : chr [1:53] "TRAU" "URG" "TRAU" "URG" ...

str(hospital_events)
#> tibble [106 x 8] (S3: tbl_df/tbl/data.frame)
#>  $ patient_visit_nr     : num [1:106] 510 510 510 510 510 510 512 512 512 512 ...
#>  $ activity             : chr [1:106] "registration" "registration" "Triage" "Triage" ...
#>  $ originator           : chr [1:106] "Clerk 9" "Clerk 9" "Nurse 27" "Nurse 27" ...
#>  $ event_lifecycle_state: chr [1:106] "start" "complete" "start" "complete" ...
#>  $ timestamp            : chr [1:106] "20/11/2017 10:18:17" "20/11/2017 10:20:06" "20/11/2017 10:34:08" "20/11/2017 10:41:48" ...
#>  $ triagecode           : num [1:106] 3 3 3 3 NA NA 3 3 3 3 ...
#>  $ specialization       : chr [1:106] "TRAU" "TRAU" "TRAU" "TRAU" ...
#>  $ event_matching       : num [1:106] 1 1 1 1 1 1 1 1 1 1 ...

Both datasets were artificially created merely to illustrate the package’s functionalities.

Stage 1 - Read in data

First of all, data must be read and prepared such that the quality assessment tests can be executed. Data preparation requires transforming the dataset to a standardised activity log format. However, earlier we mentioned two input data formats: an activity log and an event log. When an event log is available, it needs to be converted to an activity log. daqapo provides a set of functions, with the aid of bupaR, to assist the user in this process.

Preparing an Activity Log

As mentioned earlier, the goal of reading and preparing data is to obtain a standardised activity log format. When your source data is already in this format, preparations come down to the following elements:

Providing appropriate names for timestamp columns
Applying the POSIXct timestamp format
Creating the activity log object.

For this section, the dataset hospital will be used to illustrate data preparations. Three main functions help the user to prepare his/her own dataset:

rename
convert_timestamp
activitylog

Rename

The activity log object adds a mapping to the data frame to link each column with its specific meaning. In this regard, the timestamp columns each represent a different lifecycle state. daqapo must know which column is which, requiring standardised timestamp names. The accepted timestamp values are:

schedule
assign
reassign
start
suspend
resume
abort_activity
abort_case
complete
manualskip
autoskip

The two timestamps required by daqapo are start and complete.

hospital %>%
  rename(start = start_ts,
         complete = complete_ts) -> hospital

Convert timestamp format

Each timestamp must also be in the POSIXct format.

hospital %>%
  convert_timestamps(c("start","complete"), format = dmy_hms) -> hospital

Create activitylog

When the timestamps are edited to the desired format, the activity log object can be created along with the required mapping.

hospital %>%
  activitylog(case_id = "patient_visit_nr",
              activity_id = "activity",
              resource_id = "originator",
              timestamps = c("start", "complete")) -> hospital

Preparing an Event Log

With event logs, things are a bit more complex. In an event log, each row represents only a part of an activity instance. Therefore, more complex data transformations must be executed and several problems could arise. In this section, we will use an event log variant of the activity log used earlier, named hospital_events.

hospital_events
#> # A tibble: 106 x 8
#>    patient_visit_nr activity    originator event_lifecycle~ timestamp triagecode
#>               <dbl> <chr>       <chr>      <chr>            <chr>          <dbl>
#>  1              510 registrati~ Clerk 9    start            20/11/20~          3
#>  2              510 registrati~ Clerk 9    complete         20/11/20~          3
#>  3              510 Triage      Nurse 27   start            20/11/20~          3
#>  4              510 Triage      Nurse 27   complete         20/11/20~          3
#>  5              510 Clinical e~ Doctor 7   start            20/11/20~         NA
#>  6              510 Clinical e~ Doctor 4   complete         20/11/20~         NA
#>  7              512 Registrati~ Clerk 12   start            20/11/20~          3
#>  8              512 Registrati~ Clerk 12   complete         20/11/20~          3
#>  9              512 Triage      Nurse 27   start            20/11/20~          3
#> 10              512 Triage      Nurse 27   complete         20/11/20~          3
#> # ... with 96 more rows, and 2 more variables: specialization <chr>,
#> #   event_matching <dbl>

The same principle regarding the timestamps apply. Therefore, the POSIXct format must be applied in advance. Additionally, the event log object also requires an activity instance id. If needed, one can be created manually as illustrated below.

The following functions form the building blocks of the required data preparation, but not all must be called to obtain a fully prepared activity log at all times:

convert_timestamps
assign_instance_id
check/fix_resource_inconsistencies
standardize_lifecycle
eventlog
to_activitylog

hospital_events %>%
  bupaR::convert_timestamps(c("timestamp"), format = dmy_hms) %>%
  bupaR::mutate(event_matching = paste(patient_visit_nr, activity, event_matching)) %>%
  bupaR::eventlog(case_id = "patient_visit_nr", 
                        activity_id = "activity", 
                        activity_instance_id = "event_matching", 
                        timestamp = "timestamp", 
                        resource_id = "originator",
                        lifecycle_id = "event_lifecycle_state") %>%
  fix_resource_inconsistencies() %>%
  bupaR::to_activitylog() -> hospital_events
#> Warning in validate_eventlog(eventlog): The following activity instances are
#> connected to more than one resource: 510 Clinical exam 1,518 Registration 1,518
#> Registration 2,518 Registration 3
#> *** OUTPUT ***
#> A total of 4 activity executions in the event log are classified as inconsistencies.
#> They are spread over the following cases and activities:
#> # A tibble: 4 x 5
#>   patient_visit_nr activity      event_matching      complete start   
#>              <dbl> <chr>         <chr>               <chr>    <chr>   
#> 1              510 Clinical exam 510 Clinical exam 1 Doctor 4 Doctor 7
#> 2              518 Registration  518 Registration 1  Clerk 9  Clerk 6 
#> 3              518 Registration  518 Registration 2  Clerk 12 Clerk 9 
#> 4              518 Registration  518 Registration 3  Clerk 3  Clerk 12
#> Inconsistencies solved succesfully.

Stage 2 - Data Quality Assessment

The table below summarizes the different data quality assessment tests available in daqapo, after which each test will be briefly demonstrated.

An overview of data quality assessment tests in `daqapo`.
Function name	Description	Output
detect_activity_frequency_violations	Function that detects activity frequency anomalies per case	Summary in console + Returns activities in cases which are executed too many times
detect_activity_order_violations	Function detecting violations in activity order	Summary in console + Returns detected orders which violate the specified order
detect_attribute_dependencies	Function detecting violations of dependencies between attributes (i.e. condition(s) that should hold when (an)other condition(s) hold(s))	Summary in console + Returns rows with dependency violations
detect_case_id_sequence_gaps	Function detecting gaps in the sequence of case identifiers	Summary in console + Returns case IDs which should be expected to be present
detect_conditional_activity_presence	Function detection violations of conditional activity presence (i.e. activity/activities that should be present when (a) particular condition(s) hold(s))	Summary in console + Returns cases violating conditional activity presence
detect_duration_outliers	Function detecting duration outliers for a particular activity	Summary in console + Returns rows with outliers
detect_inactive_periods	Function detecting inactive periods, i.e. periods of time in which no activity executions/arrivals are recorded	Summary in console + Returns periods of inactivity
detect_incomplete_cases	Function detecting incomplete cases in terms of the activities that need to be recorded for a case	Summary in console + Returns traces in which the mentioned activities are not present
detect_incorrect_activity_names	Function returning the incorrect activity labels in the log	Summary in console + Returns rows with incorrect activities
detect_missing_values	Function detecting missing values at different levels of aggregation	Summary in console + Returns rows with NAs
detect_multiregistration	Function detecting the registration of a series of events in a short time period for the same case or by the same resource	Summary in console + Returns rows with multiregistration on resource or case level
detect_overlaps	Checks if a resource has performed two activities in parallel	Data frame containing the activities, the number of overlaps and average overlap in minutes
detect_related_activities	Function detecting missing related activities, i.e. activities that should be registered because another activity is registered for a case	Summary in console + Returns cases violating related activities
detect_similar_labels	Function detecting potential spelling mistakes	Table showing similarities for each label
detect_time_anomalies	Funtion detecting activity executions with negative or zero duration	Summary in console + Returns rows with negative or zero durations
detect_unique_values	Function listing all distinct combinations of the given log attributes	Summary in console + Returns all unique combinations of values in given columns
detect_value_range_violations	Function detecting violations of the range of acceptable values	Summary in console + Returns rows with value range infringements

Detect Activity Frequency Violations

hospital %>%
  detect_activity_frequency_violations("Registration" = 1,
                                       "Clinical exam" = 1)
#> *** OUTPUT ***
#> For 3 cases in the activity log (13.6363636363636%) an anomaly is detected.
#> The anomalies are spread over the following cases:
#> # A tibble: 3 x 3
#>   patient_visit_nr activity          n
#>              <dbl> <chr>         <int>
#> 1              518 Registration      3
#> 2              512 Clinical exam     2
#> 3              535 Registration      2

Detect Activity Order Violations

hospital %>%
  detect_activity_order_violations(activity_order = c("Registration", "Triage", "Clinical exam",
                                                      "Treatment", "Treatment evaluation"))
#> Warning in detect_activity_order_violations.activitylog(., activity_order =
#> c("Registration", : Some activity instances within the same case overlap. Use
#> detect_overlaps to investigate further.
#> Warning in detect_activity_order_violations.activitylog(., activity_order
#> = c("Registration", : Not all specified activities occur in each case. Use
#> detect_incomplete_cases to investigate further.
#> Selected timestamp parameter value: both
#> *** OUTPUT ***
#> It was checked whether the activity order Registration - Triage - Clinical exam - Treatment - Treatment evaluation is respected.
#> This activity order is respected for 22 (100%) of the cases and not for0 (0%) of the cases.

Detect Attribute Dependencies

hospital %>% 
  detect_attribute_dependencies(antecedent = activity == "Registration",
                                consequent = startsWith(originator,"Clerk"))
#> *** OUTPUT ***
#> The following statement was checked: if condition(s) ~activity == "Registration" hold(s), then ~startsWith(originator, "Clerk") should also hold.
#> This statement holds for 12 (85.71%) of the rows in the activity log for which the first condition(s) hold and does not hold for 2 (14.29%) of these rows.
#> For the following rows, the first condition(s) hold(s), but the second condition does not:
#> # Log of 10 events consisting of:
#> 2 traces 
#> 4 cases 
#> 5 instances of 1 activity 
#> 5 resources 
#> Events occurred from 2017-11-21 18:10:17 until 2017-11-22 18:37:00 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 5 x 8
#>   patient_visit_nr activity   originator start               complete           
#>              <dbl> <chr>      <chr>      <dttm>              <dttm>             
#> 1              528 Registrat~ Nurse 6    2017-11-21 18:10:17 2017-11-21 18:15:04
#> 2              535 Registrat~ Clerk 3    2017-11-22 10:04:57 2017-11-22 10:06:46
#> 3              536 Registrat~ Clerk 9    2017-11-22 10:26:41 2017-11-22 10:32:56
#> 4              535 Registrat~ Clerk 6    2017-11-22 11:05:42 2017-11-22 11:11:11
#> 5              534 Registrat~ <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
#> # ... with 3 more variables: triagecode <dbl>, specialization <chr>,
#> #   .order <int>

Detect Case ID Sequence Gaps

hospital %>%
  detect_case_id_sequence_gaps()
#> *** OUTPUT ***
#> It was checked whether there are gaps in the sequence of case IDs
#> From the 27 expected cases in the activity log, ranging from 510 to 536, 5 (18.52%) are missing.
#> These missing case numbers are:
#> # A tibble: 2 x 3
#>    from    to n_missing
#>   <dbl> <dbl>     <dbl>
#> 1   511   511         1
#> 2   513   516         4

Detect Conditional Activity Presence

hospital %>%
  detect_conditional_activity_presence(condition = specialization == "TRAU",
                                       activities = "Clinical exam")
#> *** OUTPUT ***
#> The following statement was checked: if condition(s) ~specialization == "TRAU" hold(s), then activity/activities Clinical exam should be recorded
#> The condition(s) hold(s) for 2 cases. From these cases:
#> - the specified activity/activities is/are recorded for 2 case(s) (100%)
#> - the specified activity/activities is/are not recorded for 0 case(s) (0%)

Detect Duration Outliers

hospital %>%
  detect_duration_outliers(Treatment = duration_within(bound_sd = 1))
#> *** OUTPUT ***
#> Outliers are detected for following activities
#> Treatment     Lower bound: 5.06   Upper bound: 22.2
#> A total of 1 is detected (1.89% of the activity executions)
#> For the following activity instances, outliers are detected:
#> # Log of 2 events consisting of:
#> 1 trace 
#> 1 case 
#> 1 instance of 1 activity 
#> 1 resource 
#> Events occurred from 2017-11-21 18:26:04 until 2017-11-21 18:55:00 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 1 x 14
#>   patient_visit_nr activity  originator start               complete           
#>              <dbl> <chr>     <chr>      <dttm>              <dttm>             
#> 1              523 Treatment Nurse 17   2017-11-21 18:26:04 2017-11-21 18:55:00
#> # ... with 9 more variables: triagecode <dbl>, specialization <chr>,
#> #   .order <int>, duration <dbl>, mean <dbl>, sd <dbl>, bound_sd <dbl>,
#> #   lower_bound <dbl>, upper_bound <dbl>

hospital %>%
  detect_duration_outliers(Treatment = duration_within(lower_bound = 0, upper_bound = 15))
#> *** OUTPUT ***
#> Outliers are detected for following activities
#> Treatment     Lower bound: 0      Upper bound: 15
#> A total of 1 is detected (1.89% of the activity executions)
#> For the following activity instances, outliers are detected:
#> # Log of 2 events consisting of:
#> 1 trace 
#> 1 case 
#> 1 instance of 1 activity 
#> 1 resource 
#> Events occurred from 2017-11-21 18:26:04 until 2017-11-21 18:55:00 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 1 x 14
#>   patient_visit_nr activity  originator start               complete           
#>              <dbl> <chr>     <chr>      <dttm>              <dttm>             
#> 1              523 Treatment Nurse 17   2017-11-21 18:26:04 2017-11-21 18:55:00
#> # ... with 9 more variables: triagecode <dbl>, specialization <chr>,
#> #   .order <int>, duration <dbl>, mean <dbl>, sd <dbl>, bound_sd <dbl>,
#> #   lower_bound <dbl>, upper_bound <dbl>

Detect Inactive Periods

hospital %>%
  detect_inactive_periods(threshold = 30)
#> Selected timestamp parameter value: both
#> Selected inactivity type:arrivals
#> *** OUTPUT ***
#> Specified threshold of 30 minutes is violated 9 times.
#> Threshold is violated in the following periods:
#>          period_start          period_end   time_gap
#> 1 2017-11-20 10:20:06 2017-11-21 11:35:16 1515.16667
#> 2 2017-11-21 11:22:16 2017-11-21 11:59:41   37.41667
#> 3 2017-11-21 12:05:52 2017-11-21 13:43:16   97.40000
#> 4 2017-11-21 14:06:09 2017-11-21 15:12:17   66.13333
#> 5 2017-11-21 15:18:19 2017-11-21 16:42:08   83.81667
#> 6 2017-11-21 17:06:10 2017-11-21 18:02:10   56.00000
#> 7 2017-11-21 18:15:04 2017-11-22 10:04:57  949.88333
#> 8 2017-11-22 10:32:56 2017-11-22 16:30:00  357.06667
#> 9 2017-11-22 17:00:00 2017-11-22 18:00:00   60.00000

Detect Incomplete Cases

hospital %>%
  detect_incomplete_cases(activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation"))
#> *** OUTPUT ***
#> It was checked whether the activities Clinical exam, Registration, Treatment, Treatment evaluation, Triage are present for cases.
#> These activities are present for 4 (39.62%) of the cases and are not present for 18 (60.38%) of the cases.
#> Note: this function only checks the presence of activities for a particular case, not the completeness of these entries in the activity log or the order of activities.
#> For cases for which the aforementioned activities are not all present, the following activities are recorded (ordered by decreasing frequeny of occurrence):
#> # A tibble: 9 x 3
#>   activity                 n case_ids                                           
#>   <chr>                <int> <chr>                                              
#> 1 Triage                  11 510 - 512 - 517 - 521 - 524 - 525 - 526 - 527 - 52~
#> 2 Registration             9 512 - 518 - 518 - 518 - 521 - 522 - 527 - 528 - 534
#> 3 Clinical exam            5 512 - 510 - 527 - 528 - 512                        
#> 4 Treatment evaluation     2 529 - 532                                          
#> 5 0                        1 533                                                
#> 6 Trage                    1 520                                                
#> 7 Treatment                1 532                                                
#> 8 Triaga                   1 522                                                
#> 9 registration             1 510

Detect Incorrect Activity Names

hospital %>%
  detect_incorrect_activity_names(allowed_activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation"))
#> *** OUTPUT ***
#> 4 out of 9 (44.44% ) activity labels are identified to be incorrect.
#> These activity labels are:
#> registration - Trage - Triaga - 0
#> Given this information, 4 of 53 (7.55%) rows in the activity log are incorrect. These are the following:
#> # Log of 8 events consisting of:
#> 4 traces 
#> 4 cases 
#> 4 instances of 4 activities 
#> 4 resources 
#> Events occurred from 2017-11-20 10:18:17 until 2017-11-22 18:37:00 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 4 x 8
#>   patient_visit_nr activity   originator start               complete           
#>              <dbl> <chr>      <chr>      <dttm>              <dttm>             
#> 1              510 registrat~ Clerk 9    2017-11-20 10:18:17 2017-11-20 10:20:06
#> 2              520 Trage      Nurse 17   2017-11-21 13:43:16 2017-11-21 13:39:00
#> 3              522 Triaga     Nurse 5    2017-11-21 15:15:25 2017-11-21 15:18:04
#> 4              533 0          <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
#> # ... with 3 more variables: triagecode <dbl>, specialization <chr>,
#> #   .order <int>

Detect Missing Values

hospital %>%
  detect_missing_values(column = "activity")
#> Selected level of aggregation:overview
#> Warning in detect_missing_values.activitylog(., column = "activity"): Ignoring
#> provided column argument at overview level.
#> *** OUTPUT ***
#> Absolute number of missing values per column:
#>                   
#> patient_visit_nr 0
#> activity         0
#> originator       2
#> start            1
#> complete         0
#> triagecode       1
#> specialization   0
#> .order           0
#> Relative number of missing values per column (expressed as percentage):
#>                          
#> patient_visit_nr 0.000000
#> activity         0.000000
#> originator       3.773585
#> start            1.886792
#> complete         0.000000
#> triagecode       1.886792
#> specialization   0.000000
#> .order           0.000000
#> Overview of activity log rows which are incomplete:
#> # Log of 7 events consisting of:
#> 3 traces 
#> 4 cases 
#> 4 instances of 3 activities 
#> 2 resources 
#> Events occurred from NA until NA 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 4 x 8
#>   patient_visit_nr activity   originator start               complete           
#>              <dbl> <chr>      <chr>      <dttm>              <dttm>             
#> 1              510 Clinical ~ Doctor 7   2017-11-20 11:35:01 2017-11-20 11:36:09
#> 2              533 0          <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
#> 3              534 Registrat~ <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
#> 4              512 Clinical ~ Doctor 7   NA                  2017-11-20 11:33:57
#> # ... with 3 more variables: triagecode <dbl>, specialization <chr>,
#> #   .order <int>
## column heeft hier geen zin?!

hospital %>% 
  detect_missing_values(level_of_aggregation = "activity")
#> Selected level of aggregation:activity
#> *** OUTPUT ***
#> Absolute number of missing values per column (per activity):
#> # A tibble: 9 x 8
#>   activity  patient_visit_nr originator start complete triagecode specialization
#>   <chr>                <int>      <int> <int>    <int>      <int>          <int>
#> 1 0                        0          1     0        0          0              0
#> 2 Clinical~                0          0     1        0          1              0
#> 3 Registra~                0          1     0        0          0              0
#> 4 Trage                    0          0     0        0          0              0
#> 5 Treatment                0          0     0        0          0              0
#> 6 Treatmen~                0          0     0        0          0              0
#> 7 Triaga                   0          0     0        0          0              0
#> 8 Triage                   0          0     0        0          0              0
#> 9 registra~                0          0     0        0          0              0
#> # ... with 1 more variable: .order <int>
#> Relative number of missing values per column (per activity, expressed as percentage):
#> # A tibble: 9 x 8
#>   activity  patient_visit_nr originator start complete triagecode specialization
#>   <chr>                <dbl>      <dbl> <dbl>    <dbl>      <dbl>          <dbl>
#> 1 0                        0     1      0            0      0                  0
#> 2 Clinical~                0     0      0.111        0      0.111              0
#> 3 Registra~                0     0.0714 0            0      0                  0
#> 4 Trage                    0     0      0            0      0                  0
#> 5 Treatment                0     0      0            0      0                  0
#> 6 Treatmen~                0     0      0            0      0                  0
#> 7 Triaga                   0     0      0            0      0                  0
#> 8 Triage                   0     0      0            0      0                  0
#> 9 registra~                0     0      0            0      0                  0
#> # ... with 1 more variable: .order <dbl>
#> Overview of activity log rows which are incomplete:
#> # Log of 7 events consisting of:
#> 3 traces 
#> 4 cases 
#> 4 instances of 3 activities 
#> 2 resources 
#> Events occurred from NA until NA 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 4 x 8
#>   patient_visit_nr activity   originator start               complete           
#>              <dbl> <chr>      <chr>      <dttm>              <dttm>             
#> 1              510 Clinical ~ Doctor 7   2017-11-20 11:35:01 2017-11-20 11:36:09
#> 2              533 0          <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
#> 3              534 Registrat~ <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
#> 4              512 Clinical ~ Doctor 7   NA                  2017-11-20 11:33:57
#> # ... with 3 more variables: triagecode <dbl>, specialization <chr>,
#> #   .order <int>

hospital %>% 
  detect_missing_values(
  level_of_aggregation = "column",
  column = "triagecode")
#> Selected level of aggregation:column
#> *** OUTPUT ***
#> Absolute number of missing values in columntriagecode:1
#> Relative number of missing values in columntriagecode(expressed as percentage):1.88679245283019
#> 
#> Overview of activity log rows in whichtriagecodeis missing:
#> # Log of 2 events consisting of:
#> 1 trace 
#> 1 case 
#> 1 instance of 1 activity 
#> 1 resource 
#> Events occurred from 2017-11-20 11:35:01 until 2017-11-20 11:36:09 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 1 x 8
#>   patient_visit_nr activity   originator start               complete           
#>              <dbl> <chr>      <chr>      <dttm>              <dttm>             
#> 1              510 Clinical ~ Doctor 7   2017-11-20 11:35:01 2017-11-20 11:36:09
#> # ... with 3 more variables: triagecode <dbl>, specialization <chr>,
#> #   .order <int>

Detect Multiregistration

hospital %>%
  detect_multiregistration(threshold_in_seconds = 10)
#> Selected level of aggregation: resource
#> Selected timestamp parameter value: complete
#> *** OUTPUT ***
#> Multi-registration is detected for 4 of the 12 resources (33.33%). These resources are:
#> Doctor 7 - Nurse 27 - Nurse 5 - NA
#> For the following rows in the activity log, multi-registration is detected:
#> # Log of 17 events consisting of:
#> 5 traces 
#> 7 cases 
#> 9 instances of 5 activities 
#> 4 resources 
#> Events occurred from NA until NA 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 9 x 8
#>   originator patient_visit_nr activity   start               complete           
#>   <chr>                 <dbl> <chr>      <dttm>              <dttm>             
#> 1 Doctor 7                512 Clinical ~ 2017-11-20 11:27:12 2017-11-20 11:33:57
#> 2 Doctor 7                512 Clinical ~ NA                  2017-11-20 11:33:57
#> 3 Nurse 27                536 Triage     2017-11-22 15:15:39 2017-11-22 15:25:01
#> 4 Nurse 27                536 Treatment  2017-11-22 15:15:41 2017-11-22 15:25:03
#> 5 Nurse 5                 524 Triage     2017-11-21 17:04:03 2017-11-21 17:06:05
#> 6 Nurse 5                 525 Triage     2017-11-21 17:04:13 2017-11-21 17:06:08
#> 7 Nurse 5                 526 Triage     2017-11-21 17:04:15 2017-11-21 17:06:10
#> 8 <NA>                    533 0          2017-11-22 18:35:00 2017-11-22 18:37:00
#> 9 <NA>                    534 Registrat~ 2017-11-22 18:35:00 2017-11-22 18:37:00
#> # ... with 3 more variables: triagecode <dbl>, specialization <chr>,
#> #   .order <int>

Detect Overlaps

hospital %>%
  detect_overlaps()
#> # A tibble: 7 x 4
#>   activity_a    activity_b        n avg_overlap_mins
#>   <chr>         <chr>         <int>            <dbl>
#> 1 Clinical exam Treatment         2            8.17 
#> 2 Registration  Clinical exam     1            1.9  
#> 3 Registration  Triaga            1            2.65 
#> 4 Registration  Triage            1            1.93 
#> 5 Triage        Clinical exam     2            5.63 
#> 6 Triage        Registration      1            0.817
#> 7 Triage        Treatment         1            9.33

Detect Related Activities

hospital %>%
  detect_related_activities(antecedent = "Treatment evaluation", 
                            consequent = "Treatment")
#> *** OUTPUT ***
#> The following statement was checked: if Treatment evaluation is recorded for a case, then Treatment should also be recorded.
#> This statement holds for 5 (83.33%) of the cases in which Treatment evaluation was recorded and does not hold for 1 (16.67%) of the cases in which Treatment evaluation was recorded.
#> For the following cases, only Treatment evaluation is recorded:
#> [1] 529

Detect Similar Labels

hospital %>%
  detect_similar_labels(column_labels = "activity", max_edit_distance = 3)
#> Warning in detect_similar_labels.activitylog(., column_labels = "activity", :
#> Not all provided columns are of type character or factor and will be ignored:
#> patient_visit_nr,start,complete,.order
#> # A tibble: 16 x 3
#>    column_labels labels       similar_to                   
#>    <chr>         <chr>        <chr>                        
#>  1 activity      registration Registration                 
#>  2 activity      Registration registration                 
#>  3 activity      Triage       Trage - Triaga               
#>  4 activity      Trage        Triage - Triaga              
#>  5 activity      Triaga       Triage - Trage               
#>  6 originator    Clerk 9      Clerk 12 - Clerk 6 - Clerk 3 
#>  7 originator    Clerk 12     Clerk 9 - Clerk 6 - Clerk 3  
#>  8 originator    Nurse 27     Nurse 17 - Nurse 5 - Nurse 6 
#>  9 originator    Doctor 7     Doctor 4 - Doctor 1          
#> 10 originator    Nurse 17     Nurse 27 - Nurse 5 - Nurse 6 
#> 11 originator    Clerk 6      Clerk 9 - Clerk 12 - Clerk 3 
#> 12 originator    Doctor 4     Doctor 7 - Doctor 1          
#> 13 originator    Clerk 3      Clerk 9 - Clerk 12 - Clerk 6 
#> 14 originator    Nurse 5      Nurse 27 - Nurse 17 - Nurse 6
#> 15 originator    Nurse 6      Nurse 27 - Nurse 17 - Nurse 5
#> 16 originator    Doctor 1     Doctor 7 - Doctor 4

Detect Time Anomalies

hospital %>%
  detect_time_anomalies()
#> Selected anomaly type: both
#> *** OUTPUT ***
#> For 5 rows in the activity log (9.43%), an anomaly is detected.
#> The anomalies are spread over the activities as follows:
#> # A tibble: 3 x 3
#>   activity      type                  n
#>   <chr>         <chr>             <int>
#> 1 Registration  negative duration     3
#> 2 Clinical exam zero duration         1
#> 3 Trage         negative duration     1
#> Anomalies are found in the following rows:
#> # Log of 10 events consisting of:
#> 3 traces 
#> 3 cases 
#> 5 instances of 3 activities 
#> 5 resources 
#> Events occurred from 2017-11-21 11:22:16 until 2017-11-21 19:00:00 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 5 x 10
#>   patient_visit_nr activity   originator start               complete           
#>              <dbl> <chr>      <chr>      <dttm>              <dttm>             
#> 1              518 Registrat~ Clerk 12   2017-11-21 11:45:16 2017-11-21 11:22:16
#> 2              518 Registrat~ Clerk 6    2017-11-21 11:45:16 2017-11-21 11:22:16
#> 3              518 Registrat~ Clerk 9    2017-11-21 11:45:16 2017-11-21 11:22:16
#> 4              520 Trage      Nurse 17   2017-11-21 13:43:16 2017-11-21 13:39:00
#> 5              528 Clinical ~ Doctor 1   2017-11-21 19:00:00 2017-11-21 19:00:00
#> # ... with 5 more variables: triagecode <dbl>, specialization <chr>,
#> #   .order <int>, duration <dbl>, type <chr>

Detect Unique Values

hospital %>%
  detect_unique_values(column_labels = "activity")
#> *** OUTPUT ***
#> Distinct entries are computed for the following columns: 
#> activity
#> # Log of 105 events consisting of:
#> 14 traces 
#> 22 cases 
#> 53 instances of 9 activities 
#> 12 resources 
#> Events occurred from NA until NA 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 53 x 6
#>    activity  patient_visit_nr originator start               complete           
#>    <chr>                <dbl> <chr>      <dttm>              <dttm>             
#>  1 registra~              510 Clerk 9    2017-11-20 10:18:17 2017-11-20 10:20:06
#>  2 Registra~              512 Clerk 12   2017-11-20 10:33:14 2017-11-20 10:37:00
#>  3 Triage                 510 Nurse 27   2017-11-20 10:34:08 2017-11-20 10:41:48
#>  4 Triage                 512 Nurse 27   2017-11-20 10:44:12 2017-11-20 10:50:17
#>  5 Clinical~              512 Doctor 7   2017-11-20 11:27:12 2017-11-20 11:33:57
#>  6 Clinical~              510 Doctor 7   2017-11-20 11:35:01 2017-11-20 11:36:09
#>  7 Triage                 517 Nurse 17   2017-11-21 11:35:16 2017-11-21 11:39:00
#>  8 Registra~              518 Clerk 12   2017-11-21 11:45:16 2017-11-21 11:22:16
#>  9 Registra~              518 Clerk 6    2017-11-21 11:45:16 2017-11-21 11:22:16
#> 10 Registra~              518 Clerk 9    2017-11-21 11:45:16 2017-11-21 11:22:16
#> # ... with 43 more rows, and 1 more variable: .order <int>

hospital %>%
  detect_unique_values(column_labels = c("activity", "originator"))
#> *** OUTPUT ***
#> Distinct entries are computed for the following columns: 
#> activity - originator
#> # Log of 105 events consisting of:
#> 14 traces 
#> 22 cases 
#> 53 instances of 9 activities 
#> 12 resources 
#> Events occurred from NA until NA 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 53 x 6
#>    activity  originator patient_visit_nr start               complete           
#>    <chr>     <chr>                 <dbl> <dttm>              <dttm>             
#>  1 registra~ Clerk 9                 510 2017-11-20 10:18:17 2017-11-20 10:20:06
#>  2 Registra~ Clerk 12                512 2017-11-20 10:33:14 2017-11-20 10:37:00
#>  3 Triage    Nurse 27                510 2017-11-20 10:34:08 2017-11-20 10:41:48
#>  4 Triage    Nurse 27                512 2017-11-20 10:44:12 2017-11-20 10:50:17
#>  5 Clinical~ Doctor 7                512 2017-11-20 11:27:12 2017-11-20 11:33:57
#>  6 Clinical~ Doctor 7                510 2017-11-20 11:35:01 2017-11-20 11:36:09
#>  7 Triage    Nurse 17                517 2017-11-21 11:35:16 2017-11-21 11:39:00
#>  8 Registra~ Clerk 12                518 2017-11-21 11:45:16 2017-11-21 11:22:16
#>  9 Registra~ Clerk 6                 518 2017-11-21 11:45:16 2017-11-21 11:22:16
#> 10 Registra~ Clerk 9                 518 2017-11-21 11:45:16 2017-11-21 11:22:16
#> # ... with 43 more rows, and 1 more variable: .order <int>

Detect Value Range Violations

hospital %>%
  detect_value_range_violations(triagecode = domain_numeric(from = 0, to = 5))
#> $triagecode
#> $type
#> [1] "numeric"
#> 
#> $from
#> [1] 0
#> 
#> $to
#> [1] 5
#> 
#> attr(,"class")
#> [1] "value_range" "list"
#> *** OUTPUT ***
#> The domain range for column triagecode is checked.
#> Values allowed between 0 and 5
#> The values fall within the specified domain range for 46 (86.79%) of the rows in the activity log and outside the domain range for 7 (13.21%) of these rows.
#> 
#> The following rows fall outside the specified domain range for indicated column:
#> # Log of 14 events consisting of:
#> 5 traces 
#> 6 cases 
#> 7 instances of 5 activities 
#> 4 resources 
#> Events occurred from 2017-11-20 11:35:01 until 2017-11-23 18:33:00 
#>  
#> # Variables were mapped as follows:
#> Case identifier:     patient_visit_nr 
#> Activity identifier:     activity 
#> Resource identifier:     originator 
#> Timestamps:      start, complete 
#> 
#> # A tibble: 7 x 9
#>   column_checked patient_visit_nr activity        originator start              
#>   <chr>                     <dbl> <chr>           <chr>      <dttm>             
#> 1 triagecode                  510 Clinical exam   Doctor 7   2017-11-20 11:35:01
#> 2 triagecode                  529 Treatment eval~ Doctor 1   2017-11-22 16:30:00
#> 3 triagecode                  530 Triage          Nurse 17   2017-11-22 18:00:00
#> 4 triagecode                  531 Triage          Nurse 17   2017-11-22 18:05:00
#> 5 triagecode                  532 Treatment       Nurse 17   2017-11-22 18:15:00
#> 6 triagecode                  532 Treatment eval~ Doctor 7   2017-11-22 18:27:00
#> 7 triagecode                  533 0               <NA>       2017-11-22 18:35:00
#> # ... with 4 more variables: complete <dttm>, triagecode <dbl>,
#> #   specialization <chr>, .order <int>

Hasselt University, Research group Business Informatics | Research Foundation Flanders (FWO). niels.martin@uhasselt.be ↩︎
Hasselt University, Research group Business Informatics. greg.vanhoudt@uhasselt.be ↩︎
Hasselt University, Research group Business Informatics. gert.janssenswillen@uhasselt.be ↩︎

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.

Introduction to DaQAPO

Niels Martin1, Greg Van Houdt2, and Gert Janssenswillen3

2022-07-14

Introduction

Data Sources

Stage 1 - Read in data

Preparing an Activity Log

Rename

Convert timestamp format

Create activitylog

Preparing an Event Log

Stage 2 - Data Quality Assessment

Detect Activity Frequency Violations

Detect Activity Order Violations

Detect Attribute Dependencies

Detect Case ID Sequence Gaps

Detect Conditional Activity Presence

Detect Duration Outliers

Detect Inactive Periods

Detect Incomplete Cases

Detect Incorrect Activity Names

Detect Missing Values

Detect Multiregistration

Detect Overlaps

Detect Related Activities

Detect Similar Labels

Detect Time Anomalies

Detect Unique Values

Detect Value Range Violations

Niels Martin¹, Greg Van Houdt², and Gert Janssenswillen³