The “covid19.analytics” R package allows users to obtain live* worldwide data from the novel CoronaVirus Disease originally reported in 2019, CoViD-19, as published by the JHU CCSE repository [1], as well as, provide basic analysis tools and functions to investigate these datasets.
The goal of this package is to make the latest data promptly available to researchers and the scientific community.
argument | description |
---|---|
aggregated |
latest number of cases aggregated by country |
Time Series data | |
ts-confirmed |
time series data of confirmed cases |
ts-deaths |
time series data of fatal cases |
ts-recovered |
time series data of recovered cases |
ts-ALL |
all time series data combined |
Deprecated data formats | |
ts-dep-confirmed |
time series data of confirmed cases as originally reported (deprecated) |
ts-dep-deaths |
time series data of deaths as originally reported (deprecated) |
ts-dep-recovered |
time series data of recovered cases as originally reported (deprecated) |
Combined | |
ALL |
all of the above |
Time Series data for specific locations | |
ts-Toronto |
time series data of confirmed cases for the city of Toronto, ON - Canada |
ts-confirmed-US |
time series data of confirmed cases for the US detailed per state |
ts-deaths-US |
time series data of fatal cases for the US detailed per state |
The TimeSeries data is organized in an specific manner with a given set of fields or columns, which resembles the following structure:
“Province.State” | “Country.Region” | “Lat” | “Long” | … | seq of dates | … |
If you have data structured in a data.frame organized as described above, then most of the functions provided by the “covid19.analytics” package for analyzing TimeSeries data will work with your data. In this way it is possible to add new data sets to the ones that can be loaded using the repositories predefined in this package and extend the analysis capabilities to these new datasets.
Be sure also to check the compatibility of these datasets using the Data Integrity and Consistency Checks
functions described in the following section.
Due to the ongoing and rapid changing situation with the CoViD-19 pandemic, sometimes the reported data has been detected to change its internal format or even show some “anomalies” or “inconsistencies” (see https://github.com/CSSEGISandData/COVID-19/issues/).
For instance, in some cumulative quantities reported in time series datasets, it has been observed that these quantities instead of continuously increase sometimes they decrease their values which is something that should not happen, (see for instance, https://github.com/CSSEGISandData/COVID-19/issues/2165). We refer to this as inconsistency of “type II”.
Some negative values have been reported as well in the data, which also is not possible or valid; we call this inconsistency of “type I”.
When this occurs, it happens at the level of the origin of the dataset, in our case, the one obtained from the JHU/CCESGIS repository [1]. In order to make the user aware of this, we implemented two consistency and integrity checking functions:
consistency.check()
, this function attempts to determine whether there are consistency issues within the data, such as, negative reported value (inconsistency of “type I”) or anomalies in the cumulative quantities of the data (inconsistency of “type II”)
integrity.check()
, this determines whether there are integrity issues within the datasets or changes to the structure of the data
Alternatively we provide a data.checks()
function that will run both functions on an specified dataset.
It is highly unlikely that you would face a situation where the internal structure of the data, or its actual integrity may be compromised but if you think that this is the case or the integrity.check()
function reports this, please we urge you to contact the developer of this package (https://github.com/mponce0/covid19.analytics/issues).
Data consistency issues and/or anomalies in the data have been reported several times, see https://github.com/CSSEGISandData/COVID-19/issues/.
These are claimed, in most of the cases, to be missreported data and usually are just an insignificant number of the total cases.
Having said that, we believe that the user should be aware of these situations and we recommend using the consistency.check()
function to verify the dataset you will be working with.
The covid19.genomic.data()
allows users to obtain the covid19's genomic sequencing data from NCBI [3].
In addition to the access and retrieval of the data, the package includes some basics functions to estimate totals per regions/country/cities, growth rates and daily changes in the reported number of cases.
Function | Description | Main Type of Output |
---|---|---|
Data Acquisition | ||
covid19.data |
obtain live* worldwide data for covid19 virus, from the JHU’s CCSE repository [1] | return dataframes/list with the collected data |
covid19.Toronto.data |
obtain live* data for covid19 cases in the city of Toronto, ON Canada, from the City of Toronto reports [2] | return dataframe/list with the collected data |
covid19.US.data |
obtain live* US specific data for covid19 virus, from the JHU’s CCSE repository [1] | return dataframe with the collected data |
covid19.genomic.data |
obtain covid19’s genomic sequencing data from NCBI [3] | list, with the RNA seq data in the "$NC_045512.2" entry |
Data Quality Assessment | ||
data.checks |
run integrity and consistency checks on a given dataset | diagnostics about the dataset integrity and consistency |
consistency.check |
run consistency checks on a given dataset | diagnostics about the dataset consistency |
integrity.check |
run integrity checks on a given dataset | diagnostics about the dataset integrity |
Analysis | ||
report.summary |
summarize the current situation, will download the latest data and summarize different quantities | on screen table and static plots (pie and bar plots) with reported information, can also output the tables into a text file |
tots.per.location |
compute totals per region and plot time series for that specific region/country | static plots: data + models (exp/linear, Poisson, Gamma), mosaic and histograms when more than one location are selected |
growth.rate |
compute changes and growth rates per region and plot time series for that specific region/country | static plots: data + models (linear,Poisson,Exp), mosaic and histograms when more than one location are selected |
single.trend mtrends |
visualize different indicators of the “trends” in daily changes for a single or mutliple locations | compose of static plots: total number of cases vs time, daily changes vs total changes in different representations |
Graphics and Visualization | ||
total.plts |
plots in a static and interactive plot total number of cases per day, the user can specify multiple locations or global totoals | static and interactive plot |
itrends |
generates an interactive plot of daily changes vs total changes in a log-log plot, for the indicated regions | interactive plot |
live.map |
generates an interactive map displaying cases around the world | static and interactive plot |
Modelling | ||
generate.SIR.model |
generates a SIR (Susceptible-Infected-Recovered) model | list containing the fits for the SIR model |
plt.SIR.model |
plot the results from the SIR model | static and interactive plots |
The report.summary()
generates an overall report summarizing the different datasets.
It can summarize the “Time Series” data (cases.to.process="TS"
), the “aggregated” data (cases.to.process="AGG"
) or both (cases.to.process="ALL"
).
It will display the top 10 entries in each category, or the number indicated in the Nentries
argument, for displaying all the records set Nentries=0
.
The function can also target specific geographical location(s) using the geo.loc
argument.
When a geographical location is indicated, the report will include an additional “Rel.Perc” column for the confirmed cases indicating the relative percentage among the locations indicated.
Similarly the totals displayed at the end of the report will be for the selected locations.
In each case (“TS” or/and “AGG”) will present tables ordered by the different cases included, i.e. confirmed infected, deaths, recovered and active cases.
The dates when the report is generated and the date of the recorded data will be included at the beginning of each table.
It will also compute the totals, averages, standard deviations and percentages of various quantities:
it will compute the total number of cases per case
Percentages: percentages are computed as follow:
For “Time Series” data:
Typical structure of a summary.report()
output for the Time Series data:
################################################################################
##### TS-CONFIRMED Cases -- Data dated: 2020-04-12 :: 2020-04-13 12:02:27
################################################################################
Number of Countries/Regions reported: 185
Number of Cities/Provinces reported: 83
Unique number of geographical locations combined: 264
--------------------------------------------------------------------------------
Worldwide ts-confirmed Totals: 1846679
--------------------------------------------------------------------------------
Country.Region Province.State Totals GlobalPerc LastDayChange t-2 t-3 t-7 t-14 t-30
1 US 555313 30.07 28917 29861 35098 29595 20922 548
2 Spain 166831 9.03 3804 4754 5051 5029 7846 1159
3 Italy 156363 8.47 4092 4694 3951 3599 4050 3497
4 France 132591 7.18 2937 4785 7120 5171 4376 808
5 Germany 127854 6.92 2946 2737 3990 3251 4790 910
.
.
.
--------------------------------------------------------------------------------
Global Perc. Average: 0.38 (sd: 2.13)
Global Perc. Average in top 10 : 7.85 (sd: 8.18)
--------------------------------------------------------------------------------
********************************************************************************
******************************** OVERALL SUMMARY********************************
********************************************************************************
**** Time Series TOTS ****
ts-confirmed ts-deaths ts-recovered
1846679 114091 421722
6.18% 22.84%
**** Time Series AVGS ****
ts-confirmed ts-deaths ts-recovered
6995 432.16 1686.89
6.18% 24.12%
**** Time Series SDS ****
ts-confirmed ts-deaths ts-recovered
39320.05 2399.5 8088.55
6.1% 20.57%
* Statistical estimators computed considering 250 independent reported entries
********************************************************************************
Typical structure of a summary.report()
output for the Aggregated data:
#################################################################################################################################
##### AGGREGATED Data -- ORDERED BY CONFIRMED Cases -- Data dated: 2020-04-12 :: 2020-04-13 12:02:29
#################################################################################################################################
Number of Countries/Regions reported: 185
Number of Cities/Provinces reported: 138
Unique number of geographical locations combined: 2989
---------------------------------------------------------------------------------------------------------------------------------
Location Confirmed Perc.Confirmed Deaths Perc.Deaths Recovered Perc.Recovered Active Perc.Active
1 Spain 166831 9.03 17209 10.32 62391 37.40 87231 52.29
2 Italy 156363 8.47 19899 12.73 34211 21.88 102253 65.39
3 France 132591 7.18 14393 10.86 27186 20.50 91012 68.64
4 Germany 127854 6.92 3022 2.36 60300 47.16 64532 50.47
5 New York City, New York, US 103208 5.59 6898 6.68 0 0.00 96310 93.32
.
.
.
=================================================================================================================================
Confirmed Deaths Recovered Active
Totals
1846680 114090 421722 1310868
Average
617.83 38.17. 141.09 438.56
Standard Deviation
6426.31 613.69 2381.22 4272.19
* Statistical estimators computed considering 2989 independent reported entries
In both cases an overall summary of the reported cases is presented by the end, displaying totals, average and standard devitation of the computed quantities.
A full example of this report for today can be seen here