Calculate p-values for enrichment of set

gsEasy has a function gset for calculating p-values of enrichment for sets (of genes) in ranked/scored lists (of genes) by permutation (see ‘Gene Set Enrichment Analysis’ described by Subramanian et al, 2005). gset, which requires arguments N, the total number of genes and S, the ranks of the genes in the test set amongst the N. An optional vector r of length N with scores, e.g. gene expression correlation, in order of rank can be passed. If unspecified, it defaults to 1-(i-1)/N for the ith gene. Finally, a numeric value p, used to weight the enrichment scores given by r can be passed (for more details, see Subramanian et al, 2005). The default value is 1.

#highly enriched... the set of ranks are relatively high out of 1000
gset(S=1:5 * 2, N=1000)
## [1] 1e-05
#random sets...
replicate(n=10, expr=gset(S=sample.int(n=1000, size=5), N=1000))
##  [1] 0.06602839 0.62686567 0.86069652 0.92537313 0.18905473 0.28855721
##  [7] 0.11940299 0.19402985 0.17412935 0.03287000

Alternatively, you can pass the names of genes as S with a sorted list of gene names as r (in which case the scores default to the ranks in the list), or a numeric vector of scores named by genes as r.

gset(S=c("gene 1", "gene 5", "gene 40"), r=paste("gene", 1:100))
## [1] 0.08935361

Multiple gene sets can thus be tested for enrichment with a single call to a high level function such as sapply (or, if you have many sets to test and multiple cores available, mclapply), for instance:

gene_sets <- c(list(1:5), replicate(n=10, simplify=FALSE, expr=sample.int(n=1000, size=5)))
names(gene_sets) <- c("enriched set", paste("unenriched set", 1:10))
gene_sets
## $`enriched set`
## [1] 1 2 3 4 5
## 
## $`unenriched set 1`
## [1] 533 428 519 988 457
## 
## $`unenriched set 2`
## [1]  34 494 137 467 330
## 
## $`unenriched set 3`
## [1] 454  21 623 262 794
## 
## $`unenriched set 4`
## [1] 353 205 361 985 492
## 
## $`unenriched set 5`
## [1] 420 454  49 738 911
## 
## $`unenriched set 6`
## [1] 649 926 803 396 446
## 
## $`unenriched set 7`
## [1] 965 916 582 247 605
## 
## $`unenriched set 8`
## [1] 349 416 960 204 694
## 
## $`unenriched set 9`
## [1] 499 263  17 575 325
## 
## $`unenriched set 10`
## [1] 186 369 390 641 352
sapply(gene_sets, function(set) gset(S=set, N=1000))
##      enriched set  unenriched set 1  unenriched set 2  unenriched set 3 
##         0.0000100         0.4477612         0.3233831         0.7263682 
##  unenriched set 4  unenriched set 5  unenriched set 6  unenriched set 7 
##         0.2437811         0.6069652         0.9353234         0.8308458 
##  unenriched set 8  unenriched set 9 unenriched set 10 
##         0.5124378         0.4179104         0.3432836

Ontological annotations

gsEasy has a function get_ontological_gene_sets for creating lists of gene sets defined by annotation with ontological terms, such that ontological is-a relations are propagated. get_ontological_gene_sets accepts an ontological_index argument and two character vectors, corresponding to genes and terms respectively, whereby the n-th element in each vector corresponds to one annotation pair. The result, a list of character vectors of gene names, can then be used as an argument of gset.

library(ontologyIndex)
data(hpo)
df <- data.frame(
    gene=c("gene 1", "gene 2"), 
    term=c("HP:0000598", "HP:0000118"), 
    name=hpo$name[c("HP:0000598", "HP:0000118")], 
    stringsAsFactors=FALSE,
    row.names=NULL)
df
##     gene       term                   name
## 1 gene 1 HP:0000598 Abnormality of the ear
## 2 gene 2 HP:0000118 Phenotypic abnormality
get_ontological_gene_sets(hpo, gene=df$gene, term=df$term)
## $`HP:0000001`
## [1] "gene 1" "gene 2"
## 
## $`HP:0000118`
## [1] "gene 1" "gene 2"
## 
## $`HP:0000598`
## [1] "gene 1"

GO

get_GO_gene_sets is a specialisation of get_ontological_gene_sets for the Gene Ontology (GO) which can be called passing just a file path to the annotation file (official version available at http://geneontology.org/gene-associations/gene_association.goa_human.gz).