The package was developed to support the most common data shapes and file formats used in bioinformatics. This guide describes how each filed used by the portal should be structured.
Expression matrix
The expected format for matrices contains sample identifiers in columns and symbols in rows. The best format for the portal is saving an R object of matrix type as an .rds file because it will load faster, although it is also possible to save the matrix as a CSV or TSV file. If using the latter, ensure that the first row (with exception of the first column) contain the sample identifiers and that the first column contain the transcript identifiers.
Saving an RDS file: to export a matrix as an rds file, simply run:
saveRDS(matrix_object, "matrix_object.rds")
The following is a valid example of an expression matrix:
S1_01 | S1_02 | S2_01 | S2_02 | S3_01 | S3_02 | |
---|---|---|---|---|---|---|
ABC | -0.9638307 | 0.9305799 | -0.1634020 | 0.3869199 | 0.1323101 | -1.1476565 |
BCD | 0.5915099 | -0.7732724 | 1.3275503 | 0.1250687 | 0.6158247 | 2.7267535 |
CDE | 0.7670504 | -0.4685030 | -0.8342650 | 0.7302969 | -0.2097238 | -0.7743177 |
DEF | -0.1043932 | -0.0204701 | 0.6837000 | -0.1098506 | -0.0400066 | -1.4444510 |
EFG | 0.2271225 | -0.5235521 | -0.8528140 | -0.8013274 | 1.6425337 | 0.7103904 |
FGH | -1.2320077 | -1.2217801 | 1.0746905 | 0.2613401 | 1.1482638 | 0.9314086 |
GHI | -0.9559715 | 0.3681694 | 1.8851831 | -0.5059663 | -0.5620557 | 0.2334435 |
HIJ | -0.1059386 | 0.0058964 | 0.7196340 | 1.2818882 | -0.2737815 | -2.3557326 |
IJK | -1.1363132 | 1.3379408 | -1.1267577 | -0.0046435 | -1.2029404 | 0.7046288 |
JKL | 0.9970512 | 0.2802331 | -0.4905672 | 0.4724376 | 0.2732428 | 2.7696893 |
Measures table
The measures table follows the format of one row per subject (even if they have more than one sample collected) with the measures across columns. A data.frame or tibble object can be saved in an .rds file and used, otherwise CSV or TSV files are also supported.
Measures collected over time should be represented in separate
columns, with the convention (enforced by default) of a time code as a
suffix for measure names, separated by underscore (_
) –
this means that underscore cannot be used in long measure names as well.
For example, for disease activity collected over four time points, the
expected names are: diseaseActivity_Baseline, diseaseActivity_Week1,
diseaseActivity_Week2 and diseaseActivity_Week3. By default, it is
invalid to use a name such as Disease_Activity_Baseline. The time
separator can be modified in the configuration file by setting
timesep
to the desired separator.
The following is a valid example of a measures table:
Patient_ID | Platelets_m01 | Platelets_m02 | Age | drugNaive |
---|---|---|---|---|
p01 | 205.8735 | 213.6218 | 39 | Yes |
p02 | 151.0424 | 245.5823 | 72 | Yes |
p03 | 214.0426 | 151.8249 | 78 | No |
Lookup table
For datasets where a subject has more than one sample (e.g. samples over time, from different tissues or combinations thereof), a lookup table should be constructed and saved as a data.frame or tibble format in rds, CSV or TSV.
This table will map subject identifiers to sample identifiers, with the expression matrix containing data for all samples in the dataset. The table should also contain metadata that allows selection of groups of samples. For example, if subjects and samples vary on time, drug groups and tissues, the lookup table should have one column for each category. The following is an example of such a table:
Sample_ID | Time | Tissue | Drug | Patient_ID |
---|---|---|---|---|
S1_01 | m01 | A | d1 | p01 |
S1_02 | m02 | A | d1 | p01 |
S2_01 | m01 | A | d1 | p02 |
S2_02 | m02 | A | d1 | p02 |
S3_01 | m01 | A | d2 | p03 |
S3_02 | m02 | A | d2 | p03 |
In the case above, patients p01 and p02 belong to drug group d1, while patient p03 belongs to group d2. All patients have samples collected at months 1 and 2 (encoded as m01 and m02), and all samples are from the same tissue (A). Note that, in this example, drug group differentiates samples from different subjects, but not necessarily samples from the same subject.
The lookup table can also be enriched with other characteristics of subjects that can be use to partition them, such as age, gender, or others. Outputs of methods such as clustering can also be added to the table: this enables exploring, for example, correlations in different clusters, or comparing trajectories across different clusters.
The package does a very lightweight validation of the loaded files, only checking if subjects and samples match. It does not ensure that the correct transformations have been applied to the expression data, nor does it warn about or modify missing data – in modules where expression is paired with a measure, if a subject’s measure is missing that will be excluded from plots and calculations that include that measure.
The following checks ARE made:
Matching samples and subjects: the package will confirm that every sample in the expression matrix is matched to at least one subject in the lookup table. It will also ensure that all subjects in the measures table match to at least one sample in the lookup table. That is, there can be no excess of samples or subjects in each table.
Matrix format: if using an .rds file, the package will check that the expression matrix was indeed saved as a matrix object in the rds file. This is to ensure that the rownames are read properly.
The package includes two modules to showcase results of differential
expression analysis (see config for more
details). These modules will read files created using
limma
, edgeR
or deseq2
. All files
should be saved with column names and the column names must not be
changed – the only exception is you want to mix models from different
packages, then you should rename the columns so that all results have
the same column names (e.g. p-values are identified in the same way
across all files).
These modules require the creation of a table that lists all model
results and they support the use of additional columns in the table to
organize results from different types of models or subsets of samples.
All model results file should be placed into a models
folder within the project folder.
The table should look like the following and saved in a CSV or TSV file:
Model | Time | Drug | File |
---|---|---|---|
Linear | m01 | d1 | Model_1.txt |
Linear | m02 | d2 | Model_2.txt |
Nonlinear | m01 | d1 | Model_3.txt |
Nonlinear | m02 | d2 | Model_4.txt |
The heatmap module requires the creation of a table containing lists of names such as gene symbols (see config for more details). In this table, each row will have a column that contains the gene lists, with symbols separated by a comma. If you have a table where you have a list identifier and a symbol in each column, you can use a group-by operation with paste-collapse to create the required list, as follows:
If you have any issues with data preparation, please post it as an issue on the package GitHub.