The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
In this vignette we will deconvolve a simple dataset using “dtangle2.” This vignette will follow our previous vignette on “dtangle” closely.
In this vignette we will work through a simple example of deconvolving cell type proportions from DNA microarray data. We work with a data set created from rats and introduced by Shen-Orr et al. This is available on GEO with accession GSE19830. The data set we will work with is a subset of the Shen-Orr data and is included in the dtangle
package under the name shen_orr_ex
. Alternatively, we can access this data set and other data sets through the supplementary dtangle.data
package we have made available here. More information about the data set is available as part of the R
help, ?shen_orr_ex
. First, load up the data set.
library('dtangle')
names(shen_orr_ex)
## [1] "data" "annotation" "name"
In this data set, rat brain, liver and lung cells have been mixed together in various proportions. The resulting mixtures were analyzed with DNA microarrays. The mixing proportions are encoded in the mixture matrix
head(shen_orr_ex$annotation$mixture)
## Liver Brain Lung
## GSM495209 1 0 0
## GSM495210 1 0 0
## GSM495211 1 0 0
## GSM495212 0 1 0
## GSM495213 0 1 0
## GSM495214 0 1 0
Each row of this matrix is a sample and each column gives the mixing proportions of the cell types in each sample.
The RMA-summarized gene expression data generated as part of the Shen-Orr experiment is accessible under data$log
:
Y <- shen_orr_ex$data$log
Y[1:4,1:4]
## X1367566_at X1367568_a_at X1367570_at X1367584_at
## GSM495209 3.396192 7.685769 5.722330 6.628653
## GSM495210 2.882626 7.759002 6.005583 6.771917
## GSM495211 3.072980 7.598871 5.741630 6.564820
## GSM495212 3.168440 7.209959 6.396841 7.040779
Each row is a different individual and each column is a particular gene. The values of the matrix are \(\log_2\) RMA-summarized gene expressions.
The arguments to dtangle2 can be grouped as follows:
Other arguments to dtangle2 may be found in the help documentation
?dtangle2
In order to deconvolve gene expression data from mixture samples dtangle2 requires references of the cell-types to be deconvolved. The mixture gene expressions and reference gene expressions are given to dtangle2 using the arguments Y, references, and pure_samples.
Consider, again, our data from Shen-Orr et al as an example. For speed, we’re going to subset to a very small number of genes:
library('dtangle')
data = shen_orr_ex$data$log[,c(1:10,201:210,401:410)]
mixture_proportions = shen_orr_ex$annotation$mixture
looking at the mixture proportions we can see that the first nine samples are pure reference samples of the three cell types and the remaining samples are mixture samples of the cell types
mixture_proportions
## Liver Brain Lung
## GSM495209 1.00 0.00 0.00
## GSM495210 1.00 0.00 0.00
## GSM495211 1.00 0.00 0.00
## GSM495212 0.00 1.00 0.00
## GSM495213 0.00 1.00 0.00
## GSM495214 0.00 1.00 0.00
## GSM495215 0.00 0.00 1.00
## GSM495216 0.00 0.00 1.00
## GSM495217 0.00 0.00 1.00
## GSM495218 0.05 0.25 0.70
## GSM495219 0.05 0.25 0.70
## GSM495220 0.05 0.25 0.70
## GSM495221 0.70 0.05 0.25
## GSM495222 0.70 0.05 0.25
## GSM495223 0.70 0.05 0.25
## GSM495224 0.25 0.70 0.05
## GSM495225 0.25 0.70 0.05
## GSM495226 0.25 0.70 0.05
## GSM495227 0.70 0.25 0.05
## GSM495228 0.70 0.25 0.05
## GSM495229 0.70 0.25 0.05
## GSM495230 0.45 0.45 0.10
## GSM495231 0.45 0.45 0.10
## GSM495232 0.45 0.45 0.10
## GSM495233 0.55 0.20 0.25
## GSM495234 0.55 0.20 0.25
## GSM495235 0.55 0.20 0.25
## GSM495236 0.50 0.30 0.20
## GSM495237 0.50 0.30 0.20
## GSM495238 0.50 0.30 0.20
## GSM495239 0.55 0.30 0.15
## GSM495240 0.55 0.30 0.15
## GSM495241 0.55 0.30 0.15
## GSM495242 0.50 0.40 0.10
## GSM495243 0.50 0.40 0.10
## GSM495244 0.50 0.40 0.10
## GSM495245 0.60 0.35 0.05
## GSM495246 0.60 0.35 0.05
## GSM495247 0.60 0.35 0.05
## GSM495248 0.65 0.34 0.01
## GSM495249 0.65 0.34 0.01
## GSM495250 0.65 0.34 0.01
We want to use these reference samples to deconvolve the remaining mixture samples. This can be done in a couple of ways:
pure_samples = list(Liver=c(1,2,3),Brain=c(4,5,6),Lung=c(7,8,9))
dt_out = dtangle2(Y=data, pure_samples = pure_samples)
We can get the estiamtes by accessing the “estimates” property of the result, and plot them:
true_proportions = mixture_proportions[-(1:9),]
matplot(true_proportions,dt_out$estimates, xlim = c(0,1),ylim=c(0,1),xlab="Truth",ylab="Estimates")
mixture_samples = data[-(1:9),]
reference_samples = data[1:9,]
dt_out = dtangle2(Y=mixture_samples, reference=reference_samples,pure_samples = pure_samples)
matplot(true_proportions,dt_out$estimates, xlim = c(0,1),ylim=c(0,1),xlab="Truth",ylab="Estimates")
Now the variable pure_samples tells dtangle2 to which cell type each of the the rows of the references matrix corresponds to.
In this example we still needed the variable pure_samples because our reference expression matrix contained multiple reference profiles for each cell type. Often one only has a reference expression matrix with one (typically average) expression profile per cell type. In this case we don’t need the pure_samples argument:
ref_reduced = t(sapply(pure_samples,function(x)colMeans(reference_samples[x,,drop=FALSE])))
dt_out = dtangle2(Y=mixture_samples, reference=ref_reduced)
matplot(true_proportions,dt_out$estimates, xlim = c(0,1),ylim=c(0,1),xlab="Truth",ylab="Estimates")
Central to dtangle2 is finding marker genes for each cell type. Markers may either be given explicitly to dtangle2 by the user or they may be left up to dtangle2 itself to determine the marker genes automatically.
If we do not specify the argument markers then dtangle2 automatically determines marker genes:
dt_out = dtangle2(Y=mixture_samples, references = ref_reduced)
we can change the way that dtangle2 finds marker genes using the marker_method argument:
dt_out = dtangle2(Y=mixture_samples, references = ref_reduced,marker_method = "diff")
the default is to use “ratio”. More options may be found in the R help documentation.
The argument n_markers specifies how many marker genes to use. If unspecified then dtangle2 uses the top 10% of genes (as ranked according to marker_method) as markers.
dt_out$n_markers
## Liver Brain Lung
## 1 1 1
The number of marker genes can be explicitly specified by setting n_markers:
dt_out = dtangle2(Y=mixture_samples, references = ref_reduced,marker_method = "diff",n_markers=3)
dt_out$n_markers
## [1] 3 3 3
if just a single integer is specified then all genes us that number of marker genes. Alternatively we can specify a vector of integers to specify a number of marker genes individually for each cell type:
dt_out = dtangle2(Y=mixture_samples, references = ref_reduced,marker_method = "diff",n_markers=c(1,2,3))
dt_out$n_markers
## [1] 1 2 3
we can also, in a similar fashion, pass a percentage (or vector of percentages) to n_markers which will then use that percentage of the ranked marker genes for each cell type:
dt_out = dtangle2(Y=mixture_samples, references = ref_reduced,marker_method = "diff",n_markers=.075)
dt_out$n_markers
## [1] 1 1 1
dt_out = dtangle2(Y=mixture_samples, references = ref_reduced,marker_method = "diff",n_markers=c(.1,.15,.05))
dt_out$n_markers
## [1] 1 1 1
Instead of letting dtangle2 determine the marker genes we can instead explicitly pass a list of markers to dtangle2 specifying the marker genes,
marker_genes = list(c(1,2,3),
c(4,5,6),
c(7,8,9))
dt_out = dtangle2(Y=mixture_samples, references = ref_reduced,markers=marker_genes)
dt_out$n_markers
## [1] 3 3 3
the format of the list is precisely the same format as returned in the markers element of the output of dtangle2, that is, a list of vectors of column indicies of \(Y\) that are markers of each of the cell types. The elements of the list correspond one to each cell type in the same order specified either in elements of pure_samples or by the rows of references. The argument of n_markers can be used in the same way to subset the markers if desired.
dtangle2 finds the marker genes by using the find_markers function.
mrkrs = find_markers(Y=mixture_samples, references = ref_reduced)
names(mrkrs)
## [1] "L" "V" "M" "sM"
which returns a list with four elements L which contains all genes putatively assigned to a cell type they mark, V which contains the ranking values by which the elements of L are ordered, M and sM which are the matrix and sorted matrix used to create V and L.
We can pass either the entire list or just the L list to dtangle2 as markers and re-create how dtangle2 automatically chooses markers:
dt_out = dtangle2(Y = mixture_samples,references = ref_reduced,markers=mrkrs,n_markers=.1)
Notice that we have been working with log2-scale gene expressions. If not set by the user, dtangle2 assumes log2 transformed expressions. However we may choose to work with other transformations of gene expressions. In this case, one needs to tell dtangle2 how to transform this data back to the linear-scale. This is provided to the argument “inv_scale”.
Let us artificially create linear-scale gene expression data by un-doing the log2 transformation applied:
lin_scale_mix = 2^mixture_samples
lin_scale_ref = 2^ref_reduced
Since this data is already on the linear scale, the transformation to get this data to the linear scale is simply the identity transformation. Thus we can deconvolve this data using dtangle2 passing the identity function to the inv_scale argument
dt_out = dtangle2(Y=lin_scale_mix,references = lin_scale_ref,inv_scale = base::identity,
seed=1234,markers = mrkrs$L)
head(dt_out$estimates)
## Liver Brain Lung
## GSM495218 0.04283465 0.28579446 0.6713709
## GSM495219 0.04427364 0.29073474 0.6649916
## GSM495220 0.04517094 0.28855899 0.6662701
## GSM495221 0.63614969 0.06267572 0.3011746
## GSM495222 0.62915127 0.06803422 0.3028145
## GSM495223 0.63480652 0.06094075 0.3042527
Alternatively, for example, we could work with arc-hyperbolic-sine transformed data with dtangle2, telling dtangle that the transformation of this data back to the linear scale is the hyperbolic-sine function:
ahs_scale_mix = asinh(lin_scale_mix)
ahs_scale_ref = asinh(lin_scale_ref)
dt_out = dtangle2(Y=ahs_scale_mix,references = ahs_scale_ref,inv_scale = base::sinh,
seed=1234,markers = mrkrs$L)
head(dt_out$estimates)
## Liver Brain Lung
## GSM495218 0.04283464 0.28579417 0.6713712
## GSM495219 0.04427361 0.29073485 0.6649915
## GSM495220 0.04517091 0.28855883 0.6662703
## GSM495221 0.63614966 0.06267569 0.3011747
## GSM495222 0.62915122 0.06803428 0.3028145
## GSM495223 0.63480651 0.06094076 0.3042527
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.