Introduction

The bcRep package helps to analyze IMGT/HighV-QUEST* output in more detail. It functions well with B cells, but can also be used for T cell data, in some cases.
Using this package IMGT/HighV-QUEST output files can be analysed, f.e. functionality, junction frames, gene usage and mutations. The package provides functions to analyze clones out of the IMGT/HighV-QUEST output, but also methods to compare sequences and clones, are provided. Further a function to do principal coordinate analysis on dissimilarity/distance measurements is implemented. Almost all results can be visualized via implemented plotting functions.

*: Alamyar, E. et al., IMGT/HighV-QUEST: A High-Throughput System and Web Portal for the Analysis of Rearranged Nucleotide Sequences of Antigen Receptors, JOBIM 2010, Paper 63 (2010). Website: http://www.imgt.org/

Package features

General information and package data

bcRep depends on following packages: vegan, gplots, ineq, parallel, doParallel, foreach, ape, stringdist.

Parallel processing

Parallel processing is possible for functions clones(), clones.shared(), sequences.geneComb(), compare.aaDistribution(), compare.geneUsage() and compare.trueDiversity(). By default, 1 core is used.

Datasets

There are several datasets provided in the package. Most of them are examples of IMGT/HighV-QUEST output; two additional files represent clonotype files and one gene usage data:

library(bcRep)
data(summarytab) # An extract from IMGT/HighV-QUEST output file 1_Summary(...).txt
data(ntseqtab) # An extract from IMGT/HighV-QUEST output file 3_Nt-sequences(...).txt
data(aaseqtab) # An extract from IMGT/HighV-QUEST output file 5_AA-sequences(...).txt
data(mutationtab) # An extract from IMGT/HighV-QUEST output file 
                  # 7_V-REGION-mutation-and-AA-change-table(...).txt

data(clones.ind) # Clonotypes of one individual
data(clones.allind) # Clonotypes of eight individuals

data(vgenes) # V gene usage of 10 individuals (rows) and 30 genes (columns)

Processing of IMGT/HighV-QUEST data

IMGT/HighV-QUEST can process datasets with up to 500.000 sequences. For larger data sets, FASTA files have to be splitted into smaller datasets and be uploaded individually to IMGT. Afterwards the function combineIMGT() can be used to combine several folders.

# Example: combine folders IMGT1a, IMGT1b and IMGT1c to a new datset NewProject
combineIMGT(folders = c("pathTo/IMGT1a", "pathTo/IMGT1b", "pathTo/IMGT1c"), 
            name = "NewProject")

To read IMGT/HighV-QUEST output files, use readIMGT("PathTo/file.txt",filterNoResults=TRUE). You can choose between including or excluding sequences without any information (lines with a sequence ID, but “No results”). Spaces and “-” in headers will be replaced by “_“. If there is a particular IMGT table required as input for a function, this is mentioned in the corresponding help file.

Amino acid distribution

Amino acid distribution can be analyzed using aaDistribution() and visualized with plotAADistribution().

This function returns a list containing proportions of all amino acids (including stop codons “*“) for each analyzed sequence length. Optionally, also the number used for analysis can be given.

Figures can be saved as PDF in the working directory.

## 
## Attaching package: 'bcRep'
## The following object is masked _by_ '.GlobalEnv':
## 
##     sequences.mutation.base
# Example: 
data(aaseqtab)
aadistr<-aaDistribution(sequences = aaseqtab$CDR3_IMGT, numberSeq = TRUE)

# First 4 columns of Amino acid distribution table: 
        # for a sequence length of 13 AA (* = stop codon)
        # (aadistr$Amino_acid_distribution$`sequence length = 13`)
  Position1 Position2 Position3 Position4
F 0 0 0.007018 0.01053
L 0 0.003509 0.04561 0.08772
I 0 0.02105 0.02456 0.03158
M 0 0 0.01404 0.01754
V 0.003509 0.007018 0.09123 0.04561
S 0 0.07719 0.08421 0.07368
P 0.003509 0 0.02105 0.07018
T 0.02105 0.05614 0 0.03158
A 0.9614 0.003509 0.05965 0.05614
Y 0 0 0.01754 0.05263
H 0 0.01404 0.01053 0.007018
C 0 0 0.02105 0.007018
W 0 0 0.007018 0.02807
N 0 0.01404 0.007018 0.04211
D 0 0 0.2281 0.06316
G 0.003509 0.01053 0.1825 0.207
Q 0 0 0.01053 0.01404
R 0.007018 0.607 0.06316 0.0807
K 0 0.186 0.007018 0.03509
E 0 0 0.09825 0.0386
***** 0 0 0 0
# Plot example for sequence lengths of 14-17 amino acids:
aadistr.part<-list(aadistr$Amino_acid_distribution[13:16], 
                   data.frame(aadistr$Number_of_sequences_per_length[13:16,]))
names(aadistr.part)<-names(aadistr)
plotAADistribution(aaDistribution.tab=aadistr.part, plotAADistr=TRUE, 
                   plotSeqN=TRUE, PDF=NULL) 

Diversity

Diversity of sequences and clones can be analyzed using trueDiversity() or clones.giniIndex().

True diversity

Using trueDiversity() richness or diversity of sequences with the same length can be analyzed. Basically diversity of amino acids at each position is calculated. True diversity can be measured for orders q = 0, 1 or 2. plotTrueDiversity() can be used for visualization. If mean.plot = F, a figure for each sequence length including diversity per position will be returned. If mean.plot = T only one figure with mean diversity values for all sequences with the same length will be returned.

Order 0: Richness (in this case it represents number of different amino acids per position).

Order 1: Exponential function of Shannon entropy using the natural logarithm (weights all amino acids by their frequency).

Order 2: Inverse Simpson entropy (weights all amino acids by their frequency, but weights are given more to abundant amino acids).

These indices are very similar (Hill, 1973). For example the exponential function of Shannon index is linearly related to inverse Simpson.

Amino acid sequences can be found in IMGT output table 5_AA-sequences(...).txt.

# Example:
data(aaseqtab)
trueDiv<-trueDiversity(sequences = aaseqtab$CDR3_IMGT, order = 1) 
        # using exponent of Shannon entropy

# True diversity of order 1 for amino acid length of 5 AA 
        # (trueDiv$True_diversity$'sequence length = 5')
Position1 Position2 Position3 Position4 Position5
1 3 3 1.89 3
# True diversity for sequences of amino acid length 14-17: 
trueDiv.part<-list(trueDiv$True_diversity_order, trueDiv$True_diversity[13:16])
names(trueDiv.part)<-names(trueDiv)
plotTrueDiversity(trueDiversity.tab=trueDiv.part, mean.plot = F,color="darkblue", PDF=NULL)

plotTrueDiversity(trueDiversity.tab=trueDiv, mean.plot = T,color="darkblue", PDF=NULL)

Gini index

‘clones.giniIndex()’ calculates the Gini index of clones. Input is a vector containing clone sizes (copy number). The Gini index measures the inequality of clone size distribution. It ranges between 0 and 1. An index of 0 represents a polyclonal distribution, where all clones have same size. An index of 1 represents a perfect monoclonal distribution.

If a PDF project name is given, the Lorenz curve will be returned, as well. Lorenz curve can be interpreted as p*100 percent have L(p)*100 percent of clone size.

# Example:
data(clones.ind)
clones.giniIndex(clone.size=clones.ind$total_number_of_sequences, PDF = NULL)
# [1] 0.7473557

Gene usage

Gene usage can be analyzed using functions geneUsage() and sequences.geneComb().

It can be differentiated between subgroup (e.g. IGHV1), gene (e.g. IGHV1-1) or allele (e.g. IGHV1-1*2), as well as between relative or absolute values.

When using function geneUsage(), functionality and junction frame usage, dependent on gene usage can be studied. Please note, that input vectors have to have the same order. Single genes per line, as well as several genes per line, can be processed. IMGT nomenclature has to be used (e.g. “Homsap IGHV4-34*01“)!

# Example:
data(summarytab)
Vgu<-geneUsage(genes = summarytab$V_GENE_and_allele, level = "subgroup", 
                functionality = summarytab$Functionality)
plotGeneUsage(geneUsage.tab = Vgu, plotFunctionality = TRUE, PDF = NULL, 
              title = "IGHV usage")

Using function sequences.geneComb(), gene/gene ratios of two gene families will be analyzed. IMGT nomenclature has to be used (e.g. “Homsap IGHV4-34*01“). Results can be plotted with and without gene/gene combinations, where one or both genes are unknown.

# Example: 
data(summarytab)
VDcomb.tab<-sequences.geneComb(family1 = summarytab$V_GENE_and_allele, 
                family2 = summarytab$D_GENE_and_allele, 
                level = "subgroup", abundance = "relative")
plotGeneComb(geneComb.tab = VDcomb.tab, withNA = FALSE, PDF = NULL)

Functions for a set of sequences

Function to filter and to summarize IMGT/HighV-QUEST are provided:

Filter sequences

Sequence files can be filtered for functionality or junction frame usage:

  • sequences.getProductives(): filters for productive sequences
  • sequences.getUnproductives(): filters for unproductive sequences
  • sequences.getAnyFunctionality(): excludes sequences without any functionality information

  • sequences.getInFrames(): filters for in-frame sequences
  • sequences.getOutOfFrames(): filters for out-of-frame sequences
  • sequences.getAnyJunctionFrame(): excludes sequences without any junction frame information

The function looks for a column containing functionality or junction frame information and filters the whole table. Columns will remain, rows will be filtered. Junction frame information is only available in the IMGT summary table (1_Summary(...).txt). Functionality information is available in every table.

# Example: filter for productive sequences
data(summarytab)
ProductiveSequences<-sequences.getProductives(summarytab)
# dimension of the summary table [rows, columns]:
dim(summarytab) 
## [1] 3000   29
# dimension of the summary table, filtered for productive sequences [rows, columns]:
dim(ProductiveSequences) 
## [1] 2645   29

Functionality and junction frames

Some basic statistics about functionality and junction frames can be given by sequences.functionality() and sequences.junctionFrame(). Both functions return absolute or relative values for each kind of sequences.

# Example:
data(summarytab)
sequences.functionality(data = summarytab$Functionality, relativeValues=TRUE)

# Proportion of productive and unproductive sequences:
productive unproductive unknown
0.8817 0.1127 0.005667

Mutations

Some basic mutation analysis can be done with sequences.mutation(). Input is the IMGT output table 7_V-REGION-mutation-and-AA-change-table(...).txt and optionally 1_Summary(...).txt. Mutation analysis can be done for V-region, FR1-3 and CDR1-2 sequences. Functionality, as well as junction frame analysis can be included. Further the R/S ratio (ratio of replacement and silent mutations) can be returned.

data(mutationtab)
V.mutation<-sequences.mutation(mutationtab = mutationtab, sequence = "V", 
                               junctionFr = TRUE, rsRatio=TRUE)

# V.mutation$Number_of_mutations, first 6 lines:
  number_mutations number_replacement number_silent RS_ratio
Sequence1 1 1 0 0
Sequence2 0 0 0 0
Sequence3 1 0 1 0
Sequence4 7 4 3 1.333
Sequence5 29 21 8 2.625
Sequence6 3 1 2 0.5
# V.mutation$Junction frame:
  proportion
mutation_in_in-frame_sequences 0
mutations_in_out-of-frame_sequences 0
mutations_in_sequences_with_unknown_junction_frame 0.7387
no_mutation_in_in-frame_sequences 0
no_mutations_in_out-of-frame_sequences 0
no_mutations_in_sequences_with_unknown_junction_frame 0.2613

Proportions of nucleotides at and nearby mutation sites

Proportions of nucleotide distributions at and nearby mutation sites can be calculated using sequences.mutation.base(). This function calculates the porportion of a, c, g or t at the mutated positions and/or next to mutations from position -3 to +3. Parallel processing can be used for large datasets.

plotSequencesMutationAA() visualizes base proportions nearby mutations. Barplots are given for each base.

# Example:

data(mutationtab)
data(summarytab)
V.BaseMut<-sequences.mutation.base(mutationtab = mutationtab, summarytab = summarytab, 
                                   analyseEnvironment = TRUE, analyseMutation = TRUE, 
                                   sequence = "V", nrCores=1)
plotSequencesMutationBase(mutationBaseTab = V.BaseMut, plotEnvironment = TRUE, 
     plotMutation = TRUE, colHeatmap = c("white","darkblue"))

Proportions of amino acid changes

Proportions of amino acid changes can be calculated using sequences.mutation.AA(). This function calculates the porportion of mutations from amino acid x to amino acid y. The original amino acids are in rows, the mutated one in columns.

plotSequencesMutationAA() visualizes the 20 x 20 data frame in some kind of heatmap, showing groups of proportions. Furthermore changes of amino acid characteristics, like hydropathy, volume or chemical can be shown (orange dots). Figures can be saved as PDF.

# Example:

data(mutationtab)
V.AAmut<-sequences.mutation.AA(mutationtab = mutationtab, sequence = "V")
plotSequencesMutationAA(mutationAAtab = V.AAmut, showChange = "hydropathy")

Functions for a set of clones

bcRep provides some functions to study clone features. Some columns of the clonotype file can simply be plotted using boxplot() or barplot().

Defining clones and shared clones

Clones can be defined from a set of sequences, using clones(). Criteria for sequences, belonging to the same clones are:

  • same CDR3 (amino acid) length and a identity of a given treshold (it’s possible, to look for same CDR3 sequences or for a CDR3 identity of e.g. 85%)
  • same V gene
  • same J gene (optional)

Output of the clone table can be individually specified (see help), but following columns are mandatory:

  • Shared CDR3 amino acid sequence(s)
  • CDR3 amino acid sequence length
  • Number of unique CDR3 sequences (each CDR3 sequence is mentioned once)
  • Number of all CDR3 sequences (CDR3 sequences are listed with duplicates)
  • Sequence count per CDR3
  • V gene (as used for analysis)
  • V gene and allele (IMGT nomenclature)
  • J gene (as used for analysis)
  • J gene and allele (IMGT nomenclature)

Optionally columns like D gene, CDR3 amino acid sequences, CDR3 nucleotide sequences, functionality, junction frames and so on, can be specified.

# Example:
data(aaseqtab)
data(summarytab)
clones.tab<-clones(aaseqtab = aaseqtab, summarytab = summarytab, ntseqtab = NULL, 
                   identity = 0.85, useJ = TRUE, dispD = FALSE, dispSeqID = FALSE, 
                   dispCDR3aa = FALSE, dispCDR3nt = FALSE,  
                   dispJunctionFr.ratio = FALSE, dispJunctionFr.list = FALSE, 
                   dispFunctionality.ratio = FALSE, dispFunctionality.list = FALSE, 
                   dispTotalSeq = FALSE, nrCores=1)
# Example of a clonotype file (not from the command above)
data(clones.ind)
str(clones.ind, strict.width="cut", width=85)
## 'data.frame':    1000 obs. of  14 variables:
##  $ unique_CDR3_sequences_AA    : chr  "ARERADY, AKERADY, AGERADY" "AKDSDPTV, AKDSD"..
##  $ CDR3_length_AA              : num  7 8 8 8 8 8 8 9 10 10 ...
##  $ number_of_unique_sequences  : num  3 2 2 2 2 3 2 2 2 2 ...
##  $ total_number_of_sequences   : num  45 12 243 130 14 244 59 68 44 296 ...
##  $ sequence_count_per_CDR3     : chr  "42, 2, 1" "11, 1" "242, 1" "129, 1" ...
##  $ V_gene                      : chr  "IGHV3-30-3" "IGHV3-30-3" "IGHV3-30-3" "IGHV"..
##  $ V_gene_and_allele           : chr  "Homsap IGHV3-30-3*01 F, Homsap IGHV3-30*04 "..
##  $ J_gene                      : chr  "IGHJ4" "IGHJ4" "IGHJ4" "IGHJ4" ...
##  $ J_gene_and_allele           : chr  "Homsap IGHJ4*02 F" "Homsap IGHJ4*02 F" "Hom"..
##  $ D_gene                      : chr  "Homsap IGHD1-1*01 F, Homsap IGHD4-17*01 F, "..
##  $ all_CDR3_sequences_AA       : chr  "ARERADY, ARERADY, ARERADY, ARERADY, ARERADY"..
##  $ all_CDR3_sequences_nt       : chr  "gcgagagaacgggctgactac, gcgagagaacgggctgacta"..
##  $ Functionality_all_sequences : chr  "productive, productive, unproductive, produ"..
##  $ Junction_frame_all_sequences: chr  "in-frame, in-frame, in-frame, in-frame, in-"..

Filter clones for their size

Using clones.filterSize() you can filter a set of clonotypes by their size. Three methods for filtering are available.

You can set tresholds for the parameters…

  • number: filters the table for a given number of sequences; e.g. 20 smallest clones
  • propOfClones: filters the table for the proportion of the total number of clones, e.g. the 10% biggest clones
  • propOfSequences: filters the table for the proportion of all sequences, belonging to all clones; e.g. clones, that include 1% of all sequences

Additionally, filtering can be done for both ends (biggest and smallest clones; two.tailed), or only for biggest (upper.tail) or smallest (lower.tail) clones. Biggest clones refers to clones with the biggest sequence number (or largest size/copy number) (same for smallest clones).

# Example 1: Getting the 3 smallest clones (clones with 85% CDR3 identity)
clones.filtered1<-clones.filterSize(clones.tab=clones.ind, 
                                    column="total_number_of_sequences", 
                                    number=3, method="lower.tail")

    # Output of clones.filtered1[,c("total_number_of_sequences",
            # "unique_CDR3_sequences_AA", "CDR3_length_AA", "V_gene")]:
total_number_of_sequences unique_CDR3_sequences_AA CDR3_length_AA V_gene
2 GALITMVRGVISWRFDP, GALIAMVRGVISWRFDP 17 IGHV4-4
2 ARGLQRLVL, VRGLQRLVL 10 IGHV4-4
2 AKDRYGDYALY, ARDRYGDYALY 11 IGHV4-4
# Example 2: Getting 10% biggest and smallest clones 
# (clones with 85% CDR3 identity; column 4 = "total_number_of_sequences")
clones.filtered2<-clones.filterSize(clones.tab=clones.ind, column=4, propOfClones=0.1,
                                    method="two.tailed")
names(clones.filtered2) # a list with biggest and smallest 10% clones
## [1] "upper.tail" "lower.tail"
dim(clones.filtered2$upper.tail) # dimension of table with biggest clones [rows, columns]
## [1] 100  14
dim(clones.filtered2$lower.tail) # dimension of table with smallest clones [rows, columns]
## [1] 100  14

# Example 3: Getting clones, that include 0.5% of all sequences 
# (clones with 85% CDR3 identity; column 4 = "total_number_of_sequences"")
clones.filtered3<-clones.filterSize(clones.tab=clones.ind, column=4,
                                    propOfSequences=0.005, method="two.tailed")
names(clones.filtered3) # a list with biggest and smallest 10% clones
## [1] "upper.tail" "lower.tail"
dim(clones.ind) ## dimension of clones.ind table [rows, columns]
## [1] 1000   14
## >> number of rows is equal to number of rows of biggest and smallest clones
dim(clones.filtered3$upper.tail) # dimension of table with biggest clones [rows, columns]
## [1] 48 14
dim(clones.filtered3$lower.tail) # dimension of table with smallest clones [rows, columns]
## [1] 952  14

Filter clones for functionality or junction frame usage

clones.filterFunctionality() and clones.filterJunctionFrame() filter set of clones for functionality or junction frame usage.

  • Filtering for functionality: filter clones for only productive or unproductive sequences
# Example
data(clones.ind)
productiveClones<-clones.filterFunctionality(clones.tab = clones.ind, 
                                             filter = "productive")
# dimension of clones.ind [rows, columns]:
dim(clones.ind) 
## [1] 1000   14
# dimension of clones.ind, filtered for productive sequences [rows, columns]:
dim(productiveClones) 
## [1] 1000   14
  • Filtering for junction frame usage: filter for only in-frame or out-of-frame sequences
# Example
data(clones.ind)
inFrameClones<-clones.filterJunctionFrame(clones.tab = clones.ind, filter = "in-frame")
# dimension of clones.ind [rows, columns]:
dim(clones.ind) 
## [1] 1000   14
# dimension of clones.ind, filtered for in-frame sequences [rows, columns]:
dim(inFrameClones) 
## [1] 873  14

Clone features

Functions to analyse clone features, like CDR3 length distribution or clone copy number, are provided.

CDR3 length distribution of clones

Using clones.CDR3Length() CDR3 length distribution of clones can be estimated. The corresponding visualization method is plotClonesCDR3Length(). Input for both functions are vectors containing nucleotide or amino acid sequences. Further functionality and junction frame usage, depending on CDR3 length can be studied. Absolute or relative values can be returned.

# Example:
data(clones.ind)
CDR3length<-clones.CDR3Length(CDR3Length = clones.ind$CDR3_length_AA, 
                  functionality = clones.ind$Functionality_all_sequences)

# output of some CDR3 length proportions (CDR3length$CDR3_length[1:3]): 
CDR3_length_7 CDR3_length_8 CDR3_length_9
0.018 0.013 0.007
# output of functionality ratios for some CDR3 lengths 
    # (CDR3length$CDR3_length_vs_functionality[,1:3)])
  CDR3_length_7 CDR3_length_8 CDR3_length_9
productive 0.8889 0.8462 1
unproductive 0.1111 0.1538 0
unknown 0 0 0
plotClonesCDR3Length(CDR3Length = clones.ind$CDR3_length_AA, 
                     functionality = clones.ind$Functionality_all_sequences, 
                     title = "CDR3 length distribution", PDF = NULL)

Clone copy number distribution

plotClonesCopyNumber() can be used to plot clone sizes as an boxplot.

# Example: Clone copy number distribution with and without outliers
plotClonesCopyNumber(copyNumber = clones.ind$total_number_of_sequences, 
                     withOutliers=TRUE, color = "darkblue", 
                     title = "Copy number distribution", PDF = NULL)

plotClonesCopyNumber(copyNumber = clones.ind$total_number_of_sequences, 
                     withOutliers=FALSE, color = "darkblue", 
                     title = "Copy number distribution", PDF = NULL)

Further functions like geneUsage()or trueDiversity() can be used for further analysis of clone sets.

Comparison of different samples

If more than one sample should be analyzed, bcRep provides some functions to compare different samples. Gene usage, amino acid distribution, diversity and shared clones can be compared and visualized.

Comparison of gene usage

Gene usage can be compared using compare.geneUsage(). Input is a list including sequence vectors of each individual. If no names for the samples are specified, samples will be called Sample1, Sample2, … in input order. Analysis of subgroup (e.g. IGHV1), gene (e.g. IGHV1-1) or allele (e.g. IGHV1-1*2), as well as of relative or absolute values is possible.

Results can be visualized using plotCompareGeneUsage() (optional saved as PDF in working directory). A heatmap will be returned, with samples as rows and genes as columns. Proportions of genes are color coded.

data(aaseqtab)
data(aaseqtab2)
V.comp<-compare.geneUsage(gene.list = list(aaseqtab$V_GENE_and_allele, 
                                           aaseqtab2$V_GENE_and_allele), 
                               level = "subgroup", abundance = "relative", 
                          names = c("Individual1", "Individual2"), 
                          nrCores = 1)
  IGHV1 IGHV2 IGHV3 IGHV4 IGHV5 IGHV6 IGHV7
Individual1 0.2477 0.004005 0.507 0.2096 0.02236 0.006008 0.003338
Individual2 0.1863 0.007667 0.5523 0.2077 0.03 0.007 0.009
plotCompareGeneUsage(comp.tab = V.comp, color = c("gray97", "darkblue"), PDF = NULL)

Comparison of amino acid distribution

The amino acid distribution between two or more individuals can be compared using compare.aaDistribution(). Input is a list including sequence vectors for each individual. Sequences of the same length will be analyzed together. Optional the number of sequences used for analysis can be returned. If no names for the samples are specified, samples will be called Sample1, Sample2, … in input order.

The output is a list, containing amino acid distributions for each sequence length and each individual.

plotCompareAADistribution() returns a plot (optional saved as PDF in working directory), where columns represent the samples, and rows represent different sequence lengths. In each field, one bar represents one position of the sequence.

data(aaseqtab)
data(aaseqtab2)
AAdistr.comp<-compare.aaDistribution(sequence.list = list(aaseqtab$CDR3_IMGT,
                                                          aaseqtab2$CDR3_IMGT), 
                                     names = c("Individual1", "Individual2"), 
                                     numberSeq = TRUE, nrCores = 1)

# Comparison of sequence length of 14-16 amino acids: 
## Individual 1: grep("14|15|16",names(AAdistr.comp$Amino_acid_distribution$Individual1)) 
        ## >> 13:15
## Individual 2: grep("14|15|16",names(AAdistr.comp$Amino_acid_distribution$Individual2)) 
        ## >> 12:14
AAdistr.comp.part<-list(list(AAdistr.comp$Amino_acid_distribution$Individual1[13:15], 
                             AAdistr.comp$Amino_acid_distribution$Individual2[12:14]),
                        list(AAdistr.comp$Number_of_sequences_per_length$Individual1[13:15],
                             AAdistr.comp$Number_of_sequences_per_length$Individual2[12:14]))
names(AAdistr.comp.part)<-names(AAdistr.comp)
names(AAdistr.comp.part$Amino_acid_distribution)<-
  names(AAdistr.comp$Amino_acid_distribution)
names(AAdistr.comp.part$Number_of_sequences_per_length)<-
  names(AAdistr.comp$Number_of_sequences_per_length)

plotCompareAADistribution(comp.tab = AAdistr.comp.part, plotSeqN = TRUE, 
                          colors=c("darkblue","darkred"), PDF = NULL)

Comparison of richness and diversity

Richness and diversity can be compared using compare.trueDiversity(). Input can be a list of sequences or the output of compare.aaDistribution(). Diversity can be analyzed for order 0 (richness), 1 (exponent of Shannon entropy) or 2 (inverse Simpson) (see diversity analysis above). If no names for the samples are specified, samples will be called Sample1, Sample2, … in the input order.

The output is a list, containing richness or diversity indices for each sequence length and each individual.

Results can be visualized using plotCompareTrueDiversity() (optional saved as PDF in working directory). A plot will be returned, with one figure for each sequence length, if mean.plot = F. If mean.plot = T only one figure with mean diversity values for all sequences with the same length will be returned. Each figure contains the sequence position on the x-axis and the richness/diversity index on the y-axis. If no colors are specified, rainbow() will be used.

data(aaseqtab)
data(aaseqtab2)
trueDiv.comp<-compare.trueDiversity(sequence.list = list(aaseqtab$CDR3_IMGT, 
                                                         aaseqtab2$CDR3_IMGT), 
                                    names = c("Individual1", "Individual2"), 
                                    order = 1, nrCores = 1)

# Comparison of sequence length of 14-16 amino acids: 
grepindex1<-grep("14|15|16",names(trueDiv.comp$Individual1))
grepindex2<-grep("14|15|16",names(trueDiv.comp$Individual2))
trueDiv.comp.part<-list(trueDiv.comp$True_diversity_order, 
                        trueDiv.comp$Individual1[grepindex1],
                        trueDiv.comp$Individual2[grepindex2])
names(trueDiv.comp.part)<-names(trueDiv.comp)
plotCompareTrueDiversity(comp.tab = trueDiv.comp.part, mean.plot = F, colors=c("darkblue","darkred"), 
                         PDF = NULL)
plotCompareTrueDiversity(comp.tab = trueDiv.comp, mean.plot = T, colors = c("darkblue","darkred"), 
                         PDF = NULL)

Looking for clones, that are shared between several samples

To compare clones of different samples, the function clones.shared() can be used. The criteria are the same than in function clones():

  • same CDR3 (amino acid) length and a identity of a given treshold (so it’s possible, to look for same CDR3 sequences or for a CDR3 identity of 85%)
  • same V gene
  • same J gene (optional)

The input for the function has to be prepared as following: combine all individual clone files to one using rbind() and add a first column with sample ID’s. An example you can find in data(clones.allind).

# Example:
data(clones.allind) # includes 2948 clones for 7 individuals
colnames(clones.allind) # colnames of clones.allind
##  [1] "samples"                      "unique_CDR3_sequences_AA"    
##  [3] "CDR3_length_AA"               "number_of_unique_sequences"  
##  [5] "total_number_of_sequences"    "sequence_count_per_CDR3"     
##  [7] "V_gene"                       "V_gene_and_allele"           
##  [9] "J_gene"                       "J_gene_and_allele"           
## [11] "D_gene"                       "all_CDR3_sequences_AA"       
## [13] "all_CDR3_sequences_nt"        "Functionality_all_sequences" 
## [15] "Junction_frame_all_sequences"

Output of ‘clones.shared()’ is a table, which can have the same columns than the ones from clones(). Individual data is seperated by “;”. Columns like * Number of samples sharing a clone * ID’s of samples sharing a clone * copy number of sequences per sample are provided.

# Example:
data(clones.allind)
sharedclones<-clones.shared(clones.tab = clones.allind) 
              # with default parameters: identity = 0.85, useJ = TRUE, ....

The output of ‘clones.shared()’ can be summarized using clones.shared.summary(). This function returns a data frame, which contains the number of individual clones (optional, only if clone table is given) and the number of shared clones. The number of individual clones is equivalent to the total number of clones per individual minus the number of shared clones.

# Example:
data(clones.allind)
sharedclones<-clones.shared(clones.tab = clones.allind, identity = 0.85, useJ = TRUE)
sharedclones.summary<-clones.shared.summary(shared.tab = sharedclones, 
                                            clones.tab = clones.allind)
# Example:
data(clones.allind)
sharedclones<-clones.shared(clones.tab = clones.allind, identity = 0.85, useJ = TRUE)
sharedclones.summary<-clones.shared.summary(shared.tab = sharedclones, 
                                            clones.tab = clones.allind)

# sharedclones.summary:
group number_clones
only in individual1 435
only in individual3 360
only in individual4 440
only in individual5 422
only in individual6 460
only in individual7 389
only in individual9 433
individual3; individual5; individual9 1
individual3; individual4; individual5 1
individual1; individual3; individual6 1

Dissimilarity and distance measurements on gene usage data

The function ‘geneUsage.distance()’ provides several dissimilarity and distance measurements for gene usage data. Bray-Curtis, Jaccard and cosine indices can be calculated on data (taken from vegan::vegdist and proxy::dist).frames with genes in columns and samples in rows, see Example below:

# Example:
data(vgenes) 
vgenes[1:5,1:5]
  IGHV4-34 IGHV1-8 IGHV1-3 IGHV3-23 IGHV3-23D
Sample1 0.06276 0.03591 0.01268 0.08919 0.08858
Sample2 0.05292 0.04631 0.01998 0.08522 0.08389
Sample3 0.02476 0.05528 0.008978 0.04073 0.03966
Sample4 0.05124 2.6e-05 0.02297 0.01556 0.01522
Sample5 0.01844 0.02616 0.03021 0.04346 0.04256

‘geneUsage.distance()’ then calculates a distance matrix:

# Example for 5 samples and 5 genes:
data(vgenes) 
geneUsage.distance(geneUsage.tab = vgenes[1:5,1:5], method = "bc")
  Sample1 Sample2 Sample3 Sample4 Sample5
Sample1 0 0.06271 0.3456 0.5193 0.363
Sample2 0.06271 0 0.2989 0.4812 0.3294
Sample3 0.3456 0.2989 0 0.5297 0.1886
Sample4 0.5193 0.4812 0.5297 0 0.4568
Sample5 0.363 0.3294 0.1886 0.4568 0

Dissimilarity and distance measurements on sequences data

Using ‘sequences.distance()’ distance matrices, based on sequence data, e.g. CDR3 sequences, can be calculated. Following dissmilarity/distance measurements can be used (taken from stringdist::stringdist-metric):

Further it can be distinguished wheather all sequences or subsets of sequences of the same length shall be used for distance estimation.

# Example 1:
## Divide sequences into subsets of same length and then apply Levenshtein distance:
data(clones.ind)
clones.ind<-subset(clones.ind, CDR3_length_AA==7) #take only a subset of clones.ind (CDR3 sequences of 7 AA)
dist1<-sequences.distance(sequences = clones.ind$unique_CDR3_sequences_AA, 
     method = "levenshtein", divLength=TRUE)
# Subset of output for 7AA sequences:
dist1[[1]][1:5,1:5]
  ARERADY AKERADY AGERADY ARERADY AREQADY
ARERADY 0 1 1 0 1
AKERADY 1 0 1 1 2
AGERADY 1 1 0 1 2
ARERADY 0 1 1 0 1
AREQADY 1 2 2 1 0
# Example 2: 
## Use all sequences with a length of 7 or 8 AA for cosine distance matrix:
data(clones.allind)
clones.allind<-subset(clones.allind, CDR3_length_AA %in% c(9,31)) #take only a subset of clones.ind (CDR3 sequences of 9 or 31 AA)
dist2<-sequences.distance(sequences = clones.allind$unique_CDR3_sequences_AA, 
     groups = clones.allind$individuals, method = "cosine", divLength=FALSE)
# Subset of output:
dist2[[1]][1:4,1:4]
  AGGGSYYHA AGGGSYYHS ARGPYLFDY ARGPYLFDH
AGGGSYYHA 0 0.05263 0.3775 0.3882
AGGGSYYHS 0.05263 0 0.4466 0.4647
ARGPYLFDY 0.3775 0.4466 0 0.09547
ARGPYLFDH 0.3882 0.4647 0.09547 0

Principal coordinate analysis on distance matrices

Principal coordinate analysis can be performed on distance matrices. The function dist.PCoA() provides principal coordinate objects, like in ‘ape::pcoa()’, with an output as follows (taken from ape::pcoa):

The results can also be plotted, using ‘plotDistPCoA()’. Input has to be the PCoA table from ‘dist.PCoA()’, a ‘groups’ data.frame and a pair of axes to be plotted. Optional arguments are names, plotCorrection (default: FALSE), title, plotLegend (default: FALSE) and PDF.

# Example for gene usage data:
data(vgenes) 
vgenes<-vgenes[,1:10] # include only genes 1-10
gu.dist<-geneUsage.distance(geneUsage.tab = vgenes, method = "bc")
gu.pcoa<-dist.PCoA(dist.tab = gu.dist, correction = "cailliez")
str(gu.pcoa) # a PCoA object
## List of 1
##  $ :List of 7
##   ..$ correction : chr [1:2] "cailliez" "3"
##   ..$ note       : chr "Cailliez correction applied to negative eigenvalues: D' = -0.5*(D + 0.0607817555282937 )^2, except diagonal elements"
##   ..$ values     :'data.frame':  10 obs. of  6 variables:
##   .. ..$ Eigenvalues : num [1:10] 0.1688 0.0608 0.026 0.0196 0.0126 ...
##   .. ..$ Corr_eig    : num [1:10] 0.2148 0.0877 0.0417 0.0349 0.0253 ...
##   .. ..$ Rel_corr_eig: num [1:10] 0.4908 0.2003 0.0953 0.0798 0.0578 ...
##   .. ..$ Broken_stick: num [1:10] 0.3397 0.2147 0.1522 0.1106 0.0793 ...
##   .. ..$ Cum_corr_eig: num [1:10] 0.491 0.691 0.786 0.866 0.924 ...
##   .. ..$ Cum_br_stick: num [1:10] 0.34 0.554 0.707 0.817 0.897 ...
##   ..$ vectors    : num [1:10, 1:7] -0.1721 -0.1274 0.0216 0.2608 0.1213 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:10] "Sample1" "Sample2" "Sample3" "Sample4" ...
##   .. .. ..$ : chr [1:7] "Axis.1" "Axis.2" "Axis.3" "Axis.4" ...
##   ..$ trace      : num 0.29
##   ..$ vectors.cor: num [1:10, 1:8] -0.1945 -0.1477 0.0271 0.2896 0.1399 ...
##   ..$ trace.cor  : num 0.438
##   ..- attr(*, "class")= chr "pcoa"
# Example of a 'groups' data.frame: in the case, there are two groups:
groups.vec<-cbind(rownames(vgenes),c(rep("group1",5),rep("group2",5)))
colnames(groups.vec)<-c("Sample","Group")
pandoc.table(groups.vec)
Sample Group
Sample1 group1
Sample2 group1
Sample3 group1
Sample4 group1
Sample5 group1
Sample6 group2
Sample7 group2
Sample8 group2
Sample9 group2
Sample10 group2
plotDistPCoA(pcoa.tab = gu.pcoa, groups= groups.vec, axes = c(1,2), 
             names = rownames(vgenes), plotLegend = TRUE)

# Example for sequence data, divided into subsets of the same length:
data(clones.ind)
clones.ind<-subset(clones.ind, CDR3_length_AA %in% c(7:10)) # include only CDR3 sequences of 7-10 AA
seq.dist<-sequences.distance(sequences = clones.ind$unique_CDR3_sequences_AA, 
     method = "levenshtein", divLength=T)
seq.pcoa<-dist.PCoA(dist.tab = seq.dist, correction = "none")

# Example of a 'groups' data.frame: 
## in the case, there are no groups (all samples have the same group):
groups.vec<-unlist(apply(data.frame(clones.ind$unique_CDR3_sequences_AA),1,
                         function(x){strsplit(x,split=", ")[[1]]}))
groups.vec<-cbind(groups.vec, 1)

plotDistPCoA(pcoa.tab = seq.pcoa, groups=groups.vec, axes = c(1,2), 
             plotCorrection = FALSE, 
             title = 'PCoA for sequences of length 7-10 AA', plotLegend=FALSE)