1 Introduction

This Biological Entity Dictionary (BED) has been developed to address three main challenges. The first one is related to the completeness of identifier mappings. Indeed, direct mapping information provided by the different systems are not always complete and can be enriched by mappings provided by other resources. More interestingly, direct mappings not identified by any of these resources can be indirectly inferred by using mappings to a third reference. For example, many human Ensembl gene ID are not directly mapped to any Entrez gene ID but such mappings can be inferred using respective mappings to HGNC ID. The second challenge is related to the mapping of deprecated identifiers. Indeed, entity identifiers can change from one resource release to another. The identifier history is provided by some resources, such as Ensembl or the NCBI, but it is generally not used by mapping tools. The third challenge is related to the automation of the mapping process according to the relationships between the biological entities of interest. Indeed, mapping between gene and protein ID scopes should not be done the same way than between two scopes regarding gene ID. Also, converting identifiers from different organisms should be possible using gene orthologs information.

This document shows how to use the BED (Biological Entity Dictionary) R package to get and explore mapping between identifiers of biological entities (BE). This package provides a way to connect to a BED Neo4j database in which the relationships between the identifiers from different sources are recorded.

1.2 Installation

1.2.1 Dependencies

This BED package depends on the following packages available in the CRAN repository:

  • neo2R
  • visNetwork
  • dplyr
  • readr
  • stringr
  • utils
  • shiny
  • DT
  • miniUI
  • rstudioapi

All these packages must be installed before installing BED.

1.2.2 Installation from github

devtools::install_github("patzaw/BED")

1.2.3 Possible issue when updating from releases <= 1.3.0

If you get an error like the following…

Error: package or namespace load failed for ‘BED’:
 .onLoad failed in loadNamespace() for 'BED', details:
  call: connections[[connection]][["cache"]]
  error: subscript out of bounds

… remove the BED folder located here:

file.exists(file.path(Sys.getenv("HOME"), "R", "BED"))

1.3 Connection

Before using BED, the connection needs to be established with the underlying Neo4j DB. url, username and password should be adapted.

library(BED)
connectToBed(url="localhost:5454", remember=FALSE, useCache=FALSE)

The remember parameter can be set to TRUE in order to save connection information that will be automatically used the next time the connectToBed() function is called. By default, this parameter is set to FALSE to comply with CRAN policies. Saved connection can be managed with the lsBedConnections() and the forgetBedConnection() functions.

The useCache parameter is by default set to FALSE to comply with CRAN policies. However, it is recommended to set it to TRUE to improve the speed of recurrent queries: the results of some large queries are saved locally in a file.

The connection can be checked the following way.

checkBedConn(verbose=TRUE)
## http://bel040344:5454
## BED
## UCB-Human-Internal
## 2023.02.11
## Cache ON
## [1] TRUE
## attr(,"dbVersion")
##   name           instance    version
## 1  BED UCB-Human-Internal 2023.02.11

If the verbose parameter is set to TRUE, the URL and the content version are displayed as messages.

lsBedConnections()
## [[1]]
## [[1]]$url
## [1] "bel040344:5454"
## 
## [[1]]$username
## [1] NA
## 
## [[1]]$password
## [1] NA
## 
## [[1]]$cache
## [1] TRUE
## 
## [[1]]$.opts
## list()
## 
## [[1]]$name
## [1] "BED"
## 
## [[1]]$instance
## [1] "UCB-Human-Internal"
## 
## [[1]]$version
## [1] "2023.02.11"

The connection param of the connectToBed function can be used to connect to a saved connection other than the last one.

1.4 Data model

The BED underlying data model can be shown at any time using the following command.

showBedDataModel()
BED

1.5 Direct calls

Cypher queries can be run directly on the Neo4j database using the cypher function from the neo2R package through the bedCall function.

results <- bedCall(
    cypher,
    query=prepCql(
       'MATCH (n:BEID)',
       'WHERE n.value IN $values',
       'RETURN DISTINCT n.value AS value, labels(n), n.database'
    ),
    parameters=list(values=c("10", "100"))
)
results
##   value        labels(n)       n.database
## 1    10   BEID || GeneID       EntrezGene
## 2    10 BEID || ObjectID  MetaBase_object
## 3    10 BEID || ObjectID Cortellis_target
## 4   100   BEID || GeneID       EntrezGene
## 5   100   BEID || GeneID             HGNC
## 6   100 BEID || ObjectID  MetaBase_object

1.6 Feeding the database

Many functions are provided within the package to build your own BED database instance. These functions are not exported in order to avoid their use when interacting with BED normally. Information about how to get an instance of the BED neo4j database is provided here:

It can be adapted to user needs.

1.7 Caching

This part is relevant if the useCache parameter is set to TRUE when calling connectToBed().

Functions of the BED package used to retrieve thousands of identifiers can take some time (generally a few seconds) before returning a result. Thus for this kind of query, the query is run for all the relevant ID in the DB and thanks to a cache system implemented in the package same queries with different filters should be much faster the following times.

By default the cache is flushed when the system detect inconsistencies with the BED database. However, it can also be manualy flushed if needed using the clearBedCache() function.

Queries already in cache can be listed using the lsBedCache() function which also return the occupied disk space.

2 Exploring available data

2.1 Biological entities

BED is organized around the central concept of Biological Entity (BE). All supported types of BE can be listed.

listBe()
## [1] "Gene"       "Transcript" "Peptide"    "Object"

These BE are organized according to how they are related to each other. For example a Gene is_expressed_as a Transcript. This organization allows to find the first upstream BE common to a set of BE.

firstCommonUpstreamBe(c("Object", "Transcript"))
## [1] "Gene"
firstCommonUpstreamBe(c("Peptide", "Transcript"))
## [1] "Transcript"

2.2 Organisms

Several organims can be supported by the BED underlying database. They can be listed the following way.

listOrganisms()
## [1] "Danio rerio"       "Homo sapiens"      "Sus scrofa"       
## [4] "Mus musculus"      "Rattus norvegicus"

Common names are also supported and the corresponding taxonomic identifiers can be retrieved. Conversely the organism names corresponding to a taxonomic ID can be listed.

getOrgNames(getTaxId("human"))
##   taxID                        name           nameClass
## 1  9606 Homo sapiens Linnaeus, 1758           authority
## 2  9606                       human genbank common name
## 3  9606                Homo sapiens     scientific name

2.3 Identifiers of biological entities

The main aim of BED is to allow the mapping of identifiers from different sources such as Ensembl or Entrez. Supported sources can be listed the following way for each supported organism.

listBeIdSources(be="Transcript", organism="human")
##             database   nbBe   nbId         be
## 1 BEDTech_transcript 116349 116349 Transcript
## 2             RefSeq 207080 218246 Transcript
## 3     Ens_transcript 273594 284769 Transcript

The database gathering the largest number of BE of specific type can also be identified.

largestBeSource(be="Transcript", organism="human", restricted=TRUE)
## [1] "Ens_transcript"

Finally, the getAllBeIdSources() function returns all the source databases of BE identifiers whatever the BE type.

2.4 Experimental platforms and probes

BED also supports experimental platforms and provides mapping betweens probes and BE identifiers (BEID).

The supported platforms can be listed the following way. The getTargetedBe() function returns the type of BE on which a specific platform focus.

head(listPlatforms())
##              name                                        description focus
## GPL6101   GPL6101        Illumina ratRef-12 v1.0 expression beadchip  Gene
## GPL6887   GPL6887        Illumina MouseWG-6 v2.0 expression beadchip  Gene
## GPL6947   GPL6947       Illumina HumanHT-12 V3.0 expression beadchip  Gene
## GPL10558 GPL10558       Illumina HumanHT-12 V4.0 expression beadchip  Gene
## GPL1355   GPL1355     [Rat230_2] Affymetrix Rat Genome 230 2.0 Array  Gene
## GPL1261   GPL1261 [Mouse430_2] Affymetrix Mouse Genome 430 2.0 Array  Gene
getTargetedBe("GPL570")
## [1] "Gene"

3 Managing identifiers

3.1 Retrieving all identifiers from a source

All identifiers of an organism BEs from one source can be retrieved.

beids <- getBeIds(
    be="Gene", source="EntrezGene", organism="human",
    restricted=FALSE
)
dim(beids)
## [1] 164035      5
head(beids)
##     id preferred    Gene db.version db.deprecated
## 1 4535     FALSE 1120733   20230210         FALSE
## 2 4536     FALSE 1120737   20230210         FALSE
## 3 4512     FALSE 1120743   20230210         FALSE
## 4 4513     FALSE 1120746   20230210         FALSE
## 5 4509     FALSE 1120748   20230210         FALSE
## 6 4508     FALSE 1120749   20230210         FALSE

The first column, id, corresponds to the identifiers of the BE in the source. The column named according to the BE type (in this case Gene) corresponds to the internal identifier of the related BE. BE CAREFUL, THIS INTERNAL ID IS NOT STABLE AND CANNOT BE USED AS A REFERENCE. This internal identifier is useful to identify BEIDS corresponding to the same BE. The following code can be used to have an overview of such redundancy.

sort(table(table(beids$Gene)), decreasing = TRUE)
## 
##      1      2      3      4      5      6      7      8      9     10     11 
## 127784   9245   3092   1017    389    150     89     44     24     11      9 
##     12     13     14     16     30 
##      6      2      2      1      1
ambId <- sum(table(table(beids$Gene)[which(table(beids$Gene)>=10)]))

In the example above we can see that most of Gene BE are identified by only one EntrezGene ID. However many of them are identified by two or more ID; 32 BE are even identified by 10 or more EntrezGeneID. In this case, most of these redundancies come from ID history extracted from Entrez. Legacy ID can be excluded from the retrieved ID by setting the restricted parameter to TRUE.

beids <- getBeIds(
    be="Gene", source="EntrezGene", organism="human",
    restricted = TRUE
)
dim(beids)
## [1] 142080      5

The same code as above can be used to identify remaining redundancies.

sort(table(table(beids$Gene)), decreasing = TRUE)
## 
##      1      2      3 
## 141658    202      6

In the example above we can see that allmost all Gene BE are identified by only one EntrezGene ID. However some of them are identified by two or more ID. This result comes from how the BED database is constructed according to the ID mapping provided by the different source databases. The graph below shows how the mapping was done for such a BE with redundant EntrezGene IDs.

This issue has been mainly solved by not taking into account ambigous mappings between NCBI Entrez gene identifiers and Ensembl gene identifier provided by Ensembl. It has been achieved using the cleanDubiousXRef() function from the 2019.10.11 version of the BED-UCB-Human database.

eid <- beids$id[which(beids$Gene %in% names(which(table(beids$Gene)>=3)))][1]
print(eid)
## [1] "84773"
exploreBe(id=eid, source="EntrezGene", be="Gene") %>%
   visPhysics(solver="repulsion") %>% 
   vn_as_png()
visNetwork


The way the ID correspondances are reported in the different source databases leads to this mapping ambiguity which has to be taken into account when comparing identifiers from different databases.

The getBeIds() returns other columns providing additional information about the id. The same function can be used to retrieved symbols or probe identifiers.

3.1.1 Preferred identifier

The BED database is constructed according to the relationships between identifiers provided by the different sources. Biological entities (BE) are identified as clusters of identifiers which correspond to each other directly or indirectly (corresponds_to relationship). Because of this design a BE can be identified by multiple identifiers (BEID) from the same database as shown above. These BEID are often related to alternate version of an entity.

For example, Ensembl provides different version (alternative sequences) of some chromosomes parts. And genes are also annotated on these alternative sequences. In Uniprot some unreviewed identifiers can correspond to reviewed proteins.

When available such kind of information is associated to an Attribute node through a has relationship providing the value of the attribute for the BEID. This information can also be used to define if a BEID is a preferred identifier for a BE.

The example below shows the case of the MAPT gene annotated on different version of human chromosome 17.

mapt <- convBeIds(
   "MAPT", from="Gene", from.source="Symbol", from.org="human",
   to.source="Ens_gene", restricted=TRUE
)
exploreBe(
   mapt[1, "to"],
   source="Ens_gene",
   be="Gene"
) %>% 
   vn_as_png()
visNetwork
getBeIds(
   be="Gene", source="Ens_gene", organism="human",
   restricted=TRUE,
   attributes=listDBAttributes("Ens_gene"),
   filter=mapt$to
)
##                id preferred    Gene db.version db.deprecated
## 1 ENSG00000186868      TRUE 6459127        109         FALSE
## 2 ENSG00000276155     FALSE 6459127        109         FALSE
## 3 ENSG00000277956     FALSE 6459127        109         FALSE
## 4         LRG_660     FALSE 6459127        109         FALSE
##                             seq_region
## 1                 GRCh38 chromosome 17
## 2 GRCh38 chromosome CHR_HSCHR17_1_CTG5
## 3 GRCh38 chromosome CHR_HSCHR17_2_CTG5
## 4                          lrg LRG_660

3.2 Checking identifiers

The origin of identifiers can be guessed as following.

oriId <- c(
    "17237", "105886298", "76429", "80985", "230514", "66459",
    "93696", "72514", "20352", "13347", "100462961", "100043346",
    "12400", "106582", "19062", "245607", "79196", "16878", "320727",
    "230649", "66880", "66245", "103742", "320145", "140795"
)
idOrigin <- guessIdScope(oriId)
print(idOrigin$be)
## [1] "Gene"
print(idOrigin$source)
## [1] "EntrezGene"
print(idOrigin$organism)
## [1] "Mus musculus"

The best guess is returned as a list but other possible origins are listed in the details attribute.

print(attr(idOrigin, "details"))
##     be     source     organism nb proportion
## 1 Gene EntrezGene Mus musculus 25       1.00
## 2 Gene       HGNC Homo sapiens  3       0.12
## 3 Gene        MGI Mus musculus  2       0.08

If the origin of identifiers is already known, it can also be tested.

checkBeIds(ids=oriId, be="Gene", source="EntrezGene", organism="mouse")
checkBeIds(ids=oriId, be="Gene", source="HGNC", organism="human")
## Warning in checkBeIds(ids = oriId, be = "Gene", source = "HGNC", organism =
## "human"): Could not find 22 IDs among 25!

3.3 Identifier annotation

Identifiers can be annotated with symbols and names according to available information. The following code returns the most relevant symbol and the most relevant name for each ID. Source URL can also be generated with the getBeIdURL() function.

toShow <- getBeIdDescription(
    ids=oriId, be="Gene", source="EntrezGene", organism="mouse"
)
toShow$id <- paste0(
    sprintf(
        '<a href="%s" target="_blank">',
        getBeIdURL(toShow$id, "EntrezGene")
    ),
    toShow$id,
    '<a>'
)
kable(toShow, escape=FALSE, row.names=FALSE)
id symbol name preferred db.version db.deprecated
17237 Mgrn1 mahogunin, ring finger 1 TRUE 20230210 FALSE
105886298 Cmc4 C-x(9)-C motif containing 4 TRUE 20230210 FALSE
76429 Lhpp phospholysine phosphohistidine inorganic pyrophosphate phosphatase TRUE 20230210 FALSE
80985 Trim44 tripartite motif-containing 44 TRUE 20230210 FALSE
230514 Leprot leptin receptor overlapping transcript TRUE 20230210 FALSE
66459 Pyurf Pigy upstream reading frame TRUE 20230210 FALSE
93696 Chrac1 chromatin accessibility complex 1 TRUE 20230210 FALSE
72514 Fgfbp3 fibroblast growth factor binding protein 3 TRUE 20230210 FALSE
20352 Sema4b sema domain, immunoglobulin domain (Ig), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 4B TRUE 20230210 FALSE
13347 Dffa DNA fragmentation factor, alpha subunit TRUE 20230210 FALSE
100462961 Gm16149 predicted gene 16149 TRUE 20230210 FALSE
100043346 Rpl10-ps3 ribosomal protein L10, pseudogene 3 TRUE 20230210 FALSE
12400 Cbfb core binding factor beta TRUE 20230210 FALSE
106582 Nrm nurim (nuclear envelope membrane protein) TRUE 20230210 FALSE
19062 Inpp5k inositol polyphosphate 5-phosphatase K TRUE 20230210 FALSE
245607 Gprasp2 G protein-coupled receptor associated sorting protein 2 TRUE 20230210 FALSE
79196 Osbpl5 oxysterol binding protein-like 5 TRUE 20230210 FALSE
16878 Lif leukemia inhibitory factor TRUE 20230210 FALSE
320727 Ipo8 importin 8 TRUE 20230210 FALSE
230649 Atpaf1 ATP synthase mitochondrial F1 complex assembly factor 1 TRUE 20230210 FALSE
66880 Rsrc1 arginine/serine-rich coiled-coil 1 TRUE 20230210 FALSE
66245 Hspbp1 HSPA (heat shock 70kDa) binding protein, cytoplasmic cochaperone 1 TRUE 20230210 FALSE
103742 Mien1 migration and invasion enhancer 1 TRUE 20230210 FALSE
320145 Sp8 trans-acting transcription factor 8 TRUE 20230210 FALSE
140795 P2ry14 purinergic receptor P2Y, G-protein coupled, 14 TRUE 20230210 FALSE

All possible symbols and all possible names for each ID can also be retrieved using the following functions.

res <- getBeIdSymbols(
    ids=oriId, be="Gene", source="EntrezGene", organism="mouse",
    restricted=FALSE
)
head(res)
##      id symbol canonical direct preferred  entity
## 1 12400 PEBP2b     FALSE   TRUE      TRUE 3342359
## 2 12400 Pebpb2     FALSE   TRUE      TRUE 3342359
## 3 12400   PEA2     FALSE   TRUE      TRUE 3342359
## 4 12400  Pebp2     FALSE   TRUE      TRUE 3342359
## 5 12400   Cbfb      TRUE   TRUE      TRUE 3342359
## 6 80985   Dipb     FALSE   TRUE      TRUE 3354714
res <- getBeIdNames(
    ids=oriId, be="Gene", source="EntrezGene", organism="mouse",
    restricted=FALSE
)
head(res)
##          id                                                               name
## 1     12400                                           core binding factor beta
## 2     80985                                     tripartite motif-containing 44
## 3 105886298                                        C-x(9)-C motif containing 4
## 4     66245 HSPA (heat shock 70kDa) binding protein, cytoplasmic cochaperone 1
## 5     16878                                         leukemia inhibitory factor
## 6    320727                                                         importin 8
##   direct preferred  entity
## 1   TRUE      TRUE 3342359
## 2   TRUE      TRUE 3354714
## 3   TRUE      TRUE 3326440
## 4   TRUE      TRUE 3343177
## 5   TRUE      TRUE 3352402
## 6   TRUE      TRUE 3326441

Also probes and some biological entities do not have directly associated symbols or names. These elements can also be annotated according to information related to relevant genes.

someProbes <- c(
    "238834_at", "1569297_at", "213021_at", "225480_at",
    "216016_at", "35685_at", "217969_at", "211359_s_at"
)
toShow <- getGeneDescription(
    ids=someProbes, be="Probe", source="GPL570", organism="human"
)
kable(toShow, escape=FALSE, row.names=FALSE)
id EntrezGene symbol name
238834_at 91807 MYLK3 myosin light chain kinase 3
1569297_at 731779 LINC01300 long intergenic non-protein coding RNA 1300
213021_at 9527 GOSR1 golgi SNAP receptor complex member 1
225480_at 127687 C1orf122 chromosome 1 open reading frame 122
216016_at 114548 NLRP3 NLR family pyrin domain containing 3
35685_at 6015 RING1 ring finger protein 1
217969_at 738 VPS51 VPS51 subunit of GARP complex
211359_s_at 4988 OPRM1 opioid receptor mu 1

3.4 Products of molecular biology processes

The BED data model has beeing built to fulfill molecular biology processes:

  • is_expressed_as relationships correspond to the transcription process.
  • is_translated_in relationships correspond to the translation process.
  • codes_for is a fuzzy relationship allowing the mapping of genes on object not necessary corresonpding to the same kind of biological molecule.

These processes are described in different databases with different level of granularity. For exemple, Ensembl provides possible transcripts for each gene specifying which one of them is canonical.

The following functions are used to retrieve direct products or direct origins of molecular biology processes.

getDirectProduct("ENSG00000145335", process="is_expressed_as")
##             origin  osource         product        psource canonical
## 1  ENSG00000145335 Ens_gene ENST00000394986 Ens_transcript     FALSE
## 2  ENSG00000145335 Ens_gene ENST00000673902 Ens_transcript     FALSE
## 3  ENSG00000145335 Ens_gene ENST00000611107 Ens_transcript     FALSE
## 4  ENSG00000145335 Ens_gene ENST00000674129 Ens_transcript     FALSE
## 5  ENSG00000145335 Ens_gene ENST00000673718 Ens_transcript     FALSE
## 6  ENSG00000145335 Ens_gene ENST00000502987 Ens_transcript     FALSE
## 7  ENSG00000145335 Ens_gene ENST00000420646 Ens_transcript     FALSE
## 8  ENSG00000145335 Ens_gene ENST00000394989 Ens_transcript     FALSE
## 9  ENSG00000145335 Ens_gene ENST00000505199 Ens_transcript     FALSE
## 10 ENSG00000145335 Ens_gene ENST00000618500 Ens_transcript     FALSE
## 11 ENSG00000145335 Ens_gene ENST00000345009 Ens_transcript     FALSE
## 12 ENSG00000145335 Ens_gene ENST00000673766 Ens_transcript     FALSE
## 13 ENSG00000145335 Ens_gene ENST00000506691 Ens_transcript     FALSE
## 14 ENSG00000145335 Ens_gene ENST00000336904 Ens_transcript     FALSE
## 15 ENSG00000145335 Ens_gene ENST00000508895 Ens_transcript     FALSE
## 16 ENSG00000145335 Ens_gene ENST00000506244 Ens_transcript     FALSE
## 17 ENSG00000145335 Ens_gene ENST00000394991 Ens_transcript      TRUE
getDirectProduct("ENST00000336904", process="is_translated_in")
##            origin        osource         product         psource canonical
## 1 ENST00000336904 Ens_transcript ENSP00000338345 Ens_translation      TRUE
getDirectOrigin("NM_001146055", process="is_expressed_as")
##   origin    osource      product psource canonical
## 1   6622 EntrezGene NM_001146055  RefSeq     FALSE

4 Converting identifiers

4.1 Same entity and same organism: from one source to another

res <- convBeIds(
    ids=oriId,
    from="Gene",
    from.source="EntrezGene",
    from.org="mouse",
    to.source="Ens_gene",
    restricted=TRUE,
    prefFilter=TRUE
)
head(res)
##        from                 to to.preferred to.entity
## 1     19062 ENSMUSG00000006127         TRUE   3313651
## 2     13347 ENSMUSG00000028974         TRUE   3314551
## 3    103742 ENSMUSG00000002580         TRUE   3315858
## 4     72514 ENSMUSG00000047632         TRUE   3316967
## 5    320145 ENSMUSG00000048562         TRUE   3321363
## 6 105886298 ENSMUSG00000090110         TRUE   3326440

4.2 Same organism: from one entity to another

res <- convBeIds(
    ids=oriId,
    from="Gene",
    from.source="EntrezGene",
    from.org="mouse",
    to="Peptide",
    to.source="Ens_translation",
    restricted=TRUE,
    prefFilter=TRUE
)
head(res)
##     from                 to to.preferred to.entity
## 1  19062 ENSMUSP00000006286         TRUE   4404170
## 2  19062 ENSMUSP00000119996         TRUE   4404177
## 3  19062 ENSMUSP00000121060         TRUE   4404181
## 4  13347 ENSMUSP00000030816         TRUE   4405226
## 6  13347 ENSMUSP00000099505         TRUE   4405226
## 8 103742 ENSMUSP00000002655         TRUE   4406736

4.3 From one organism to another

res <- convBeIds(
    ids=oriId,
    from="Gene",
    from.source="EntrezGene",
    from.org="mouse",
    to="Peptide",
    to.source="Ens_translation",
    to.org="human",
    restricted=TRUE,
    prefFilter=TRUE
)
head(res)
##          from              to to.preferred to.entity
## 131    106582 ENSP00000397892         TRUE   2750846
## 125     16878 ENSP00000249075         TRUE   2750948
## 126     16878 ENSP00000384450         TRUE   2750948
## 326     80985 ENSP00000299413         TRUE   2756246
## 20      72514 ENSP00000339067         TRUE   2761263
## 23  105886298 ENSP00000358491         TRUE   2761904

4.4 Converting lists of identifiers

List of identifiers can be converted the following way. Only converted IDs are returned in this case.

humanEnsPeptides <- convBeIdLists(
    idList=list(a=oriId[1:5], b=oriId[-c(1:5)]),
    from="Gene",
    from.source="EntrezGene",
    from.org="mouse",
    to="Peptide",
    to.source="Ens_translation",
    to.org="human",
    restricted=TRUE,
    prefFilter=TRUE
)
unlist(lapply(humanEnsPeptides, length))
##   a   b 
##  21 117
lapply(humanEnsPeptides, head)
## $a
## [1] "ENSP00000299413" "ENSP00000358491" "ENSP00000358496" "ENSP00000497944"
## [5] "ENSP00000360104" "ENSP00000497385"
## 
## $b
## [1] "ENSP00000397892" "ENSP00000249075" "ENSP00000384450" "ENSP00000339067"
## [5] "ENSP00000256079" "ENSP00000444520"

4.4.1 BEIDList

BEIDList objects are used to manage lists of BEID with an attached explicit scope, and metadata provided in a data frame. The focusOnScope() function is used to easily convert such object to another scope. For example, in the code below, Entrez gene identifiers are converted in Ensembl identifiers.

entrezGenes <- BEIDList(
   list(a=oriId[1:5], b=oriId[-c(1:5)]),
   scope=list(be="Gene", source="EntrezGene", organism="Mus musculus"),
   metadata=data.frame(
      .lname=c("a", "b"),
      description=c("Identifiers in a", "Identifiers in b"),
      stringsAsFactors=FALSE
   )
)
entrezGenes
## BEIDList of 2 elements gathering 25 BEIDs in total
##    - Scope: be="Gene", source="EntrezGene", organism="Mus musculus"
##    - Metadata fields: ".lname", "description"
entrezGenes$a
## [1] "17237"     "105886298" "76429"     "80985"     "230514"
ensemblGenes <- focusOnScope(entrezGenes, source="Ens_gene")
ensemblGenes$a
## [1] "ENSMUSG00000090110" "ENSMUSG00000022517" "ENSMUSG00000035212"
## [4] "ENSMUSG00000030946" "ENSMUSG00000027189"

4.5 Converting data frames

IDs in data frames can also be converted.

toConv <- data.frame(a=1:25, b=runif(25))
rownames(toConv) <- oriId
res <- convDfBeIds(
    df=toConv,
    from="Gene",
    from.source="EntrezGene",
    from.org="mouse",
    to.source="Ens_gene",
    restricted=TRUE,
    prefFilter=TRUE
)
head(res)
##   a         b conv.from            conv.to
## 1 1 0.1457449     17237 ENSMUSG00000022517
## 2 2 0.9129520 105886298 ENSMUSG00000090110
## 3 3 0.6004795     76429 ENSMUSG00000030946
## 4 4 0.4678371     80985 ENSMUSG00000027189
## 5 5 0.8215641    230514 ENSMUSG00000035212
## 6 6 0.4735447     66459 ENSMUSG00000043162

4.6 Explore convertion shortest path between two identifiers

Because the conversion process takes into account several resources, it might be useful to explore the path between two identifiers which have been mapped. This can be achieved by the exploreConvPath function.

from.id <- "ILMN_1220595"
res <- convBeIds(
   ids=from.id, from="Probe", from.source="GPL6885", from.org="mouse",
   to="Peptide", to.source="Uniprot", to.org="human",
   prefFilter=TRUE
)
res
##           from     to to.preferred to.entity
## 1 ILMN_1220595 Q16552         TRUE   2743149
exploreConvPath(
   from.id=from.id, from="Probe", from.source="GPL6885",
   to.id=res$to[1], to="Peptide", to.source="Uniprot"
) %>% 
   vn_as_png()
visNetwork

The figure above shows how the ILMN_1220595 ProbeID, targeting the mouse NM_010552 transcript, can be associated to the Q16552 human protein ID in Uniprot.

4.7 Notes about converting from and to gene symbols

Canonical and non-canonical symbols are associated to genes. In some cases the same symbol (canonical or not) can be associated to several genes. This can lead to ambiguous mapping. The strategy to apply for such mapping depends on the aim of the user and his knowledge about the origin of the symbols to consider.

The complete mapping between Ensembl gene identifiers and symbols is retrieved by using the getBeIDSymbolTable function.

compMap <- getBeIdSymbolTable(
   be="Gene", source="Ens_gene", organism="rat",
   restricted=FALSE
)
dim(compMap)
## [1] 123161      6
head(compMap)
##                   id     symbol canonical direct preferred  entity
## 1 ENSRNOG00000055086    Gm24337      TRUE   TRUE      TRUE 8219709
## 2 ENSRNOG00000060424         U6      TRUE   TRUE      TRUE 8219234
## 3 ENSRNOG00000067185         U2      TRUE   TRUE      TRUE 8219729
## 4 ENSRNOG00000052327         U2      TRUE   TRUE      TRUE 8218801
## 5 ENSRNOG00000025892 LRRGT00147     FALSE   TRUE      TRUE 8218725
## 6 ENSRNOG00000025892  LOC296619     FALSE   TRUE      TRUE 8218725

The canonical field indicates if the symbol is canonical for the identifier. The direct field indicates if the symbol is directly associated to the identifier or indirectly through a relationship with another identifier.

As an example, let’s consider the “Snca” symbol in rat. As shown below, this symbol is associated to 2 genes; it is canonical for one gene and not for another. These 2 genes are also associated to other symbols.

sncaEid <- compMap[which(compMap$symbol=="Snca"),]
sncaEid
##                       id symbol canonical direct preferred  entity
## 50493 ENSRNOG00000008656   Snca      TRUE   TRUE      TRUE 4747702
## 93394 ENSRNOG00000029408   Snca     FALSE  FALSE      TRUE 4766278
compMap[which(compMap$id %in% sncaEid$id),]
##                       id    symbol canonical direct preferred  entity
## 50492 ENSRNOG00000008656 MGC105443     FALSE   TRUE      TRUE 4747702
## 50493 ENSRNOG00000008656      Snca      TRUE   TRUE      TRUE 4747702
## 93394 ENSRNOG00000029408      Snca     FALSE  FALSE      TRUE 4766278
## 93395 ENSRNOG00000029408   Mageb16      TRUE  FALSE      TRUE 4766278

The getBeIdDescription function described before, reports only one symbol for each identifier. Canonical and direct symbols are prioritized.

getBeIdDescription(
   sncaEid$id,
   be="Gene", source="Ens_gene", organism="rat"
)
##                                    id  symbol                   name preferred
## ENSRNOG00000008656 ENSRNOG00000008656    Snca        synuclein alpha      TRUE
## ENSRNOG00000029408 ENSRNOG00000029408 Mageb16 MAGE family member B16      TRUE
##                    db.version db.deprecated
## ENSRNOG00000008656        109         FALSE
## ENSRNOG00000029408        109         FALSE

The convBeIds works differently in order to provide a mapping as exhaustive as possible. If a symbol is associated to several input identifiers, non-canonical associations with this symbol are removed if a canonical association exists for any other identifier. This can lead to inconsistent results, depending on the user input, as show below.

convBeIds(
   sncaEid$id[1],
   from="Gene", from.source="Ens_gene", from.org="rat",
   to.source="Symbol"
)
##                 from        to to.preferred to.entity
## 2 ENSRNOG00000008656 MGC105443           NA   4747702
## 1 ENSRNOG00000008656      Snca           NA   4747702
convBeIds(
   sncaEid$id[2],
   from="Gene", from.source="Ens_gene", from.org="rat",
   to.source="Symbol"
)
##                 from      to to.preferred to.entity
## 2 ENSRNOG00000029408 Mageb16           NA   4766278
## 1 ENSRNOG00000029408    Snca           NA   4766278
convBeIds(
   sncaEid$id,
   from="Gene", from.source="Ens_gene", from.org="rat",
   to.source="Symbol"
)
##                 from        to to.preferred to.entity
## 2 ENSRNOG00000008656 MGC105443           NA   4747702
## 1 ENSRNOG00000008656      Snca           NA   4747702
## 4 ENSRNOG00000029408   Mageb16           NA   4766278

In the example above, when the query is run for each identifier independently, the association to the “Snca” symbol is reported for both. However, when running the same query with the 2 identifiers at the same time, the “Snca” symbol is reported only for one gene corresponding to the canonical association. An additional filter can be used to only keep canonical symbols:

convBeIds(
   sncaEid$id,
   from="Gene", from.source="Ens_gene", from.org="rat",
   to.source="Symbol",
   canonical=TRUE
)
##                 from      to to.preferred to.entity
## 1 ENSRNOG00000008656    Snca           NA   4747702
## 2 ENSRNOG00000029408 Mageb16           NA   4766278

Finally, as shown below, when running the query the other way, “Snca” is only associated to the gene for which it is the canonical symbol.

convBeIds(
   "Snca",
   from="Gene", from.source="Symbol", from.org="rat",
   to.source="Ens_gene"
)
##   from                 to to.preferred to.entity
## 1 Snca ENSRNOG00000008656         TRUE   4747702

Therefore, the user should chose the function to use with care when needing to convert from or to gene symbol.

5 An interactive dictionary: Shiny module

IDs, symbols and names can be seeked without knowing the original biological entity or probe. Then the results can be converted to the context of interest.

searched <- searchBeid("sv2A")
toTake <- which(searched$organism=="Homo sapiens")[1]
relIds <- geneIDsToAllScopes(
  geneids=searched$GeneID[toTake],
  source=searched$Gene_source[toTake],
  organism=searched$organism[toTake]
)

A Shiny gadget integrating these two function has been developped and is also available as an Rstudio addins.

relIds <- findBeids()

It relies on a Shiny module (beidsServer() and beidsUI() functions) made to facilitate the development of applications focused on biological entity related information. The code below shows a minimum example of such an application.

library(shiny)
library(BED)
library(DT)

ui <- fluidPage(
   beidsUI("be"),
   fluidRow(
      column(
         12,
         tags$br(),
         h3("Selected gene entities"),
         DTOutput("result")
      )
   )
)

server <- function(input, output){
    found <- beidsServer("be", toGene=TRUE, multiple=TRUE, tableHeight=250)
    output$result <- renderDT({
       req(found())
       toRet <- found()
       datatable(toRet, rownames=FALSE)
    })
}

shinyApp(ui = ui, server = server)

6 Session info

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BED_1.4.13       visNetwork_2.1.2 neo2R_2.4.1      knitr_1.42      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.10       bslib_0.4.2       compiler_4.2.0    pillar_1.8.1     
##  [5] later_1.3.0       jquerylib_0.1.4   base64enc_0.1-3   tools_4.2.0      
##  [9] digest_0.6.31     jsonlite_1.8.4    evaluate_0.20     lifecycle_1.0.3  
## [13] tibble_3.1.8      pkgconfig_2.0.3   rlang_1.0.6       shiny_1.7.4      
## [17] cli_3.6.0         rstudioapi_0.14   curl_5.0.0        yaml_2.3.7       
## [21] xfun_0.37         fastmap_1.1.0     withr_2.5.0       httr_1.4.4       
## [25] dplyr_1.1.0       stringr_1.5.0     generics_0.1.3    vctrs_0.5.2      
## [29] htmlwidgets_1.6.1 sass_0.4.5        webshot_0.5.4     DT_0.27          
## [33] tidyselect_1.2.0  glue_1.6.2        R6_2.5.1          processx_3.8.0   
## [37] fansi_1.0.4       rmarkdown_2.20    callr_3.7.3       magrittr_2.0.3   
## [41] ps_1.7.2          ellipsis_0.3.2    promises_1.2.0.1  htmltools_0.5.4  
## [45] xtable_1.8-4      mime_0.12         httpuv_1.6.9      utf8_1.2.3       
## [49] stringi_1.7.12    miniUI_0.1.1.1    cachem_1.0.6