BIGDAWG v1.2.1

‘Bridging ImmunoGenomic Data-Analysis Workflow Gaps’ (‘BIGDAWG’) is an integrated analysis system that automates the manual data-manipulation and trafficking steps (the gaps in an analysis workflow) normally required for analyses of highly polymorphic genetic systems (e.g., the immunological human leukocyte antigen (HLA) and killer-cell Immunoglobulin-like receptor (KIR) genes) and their respective genomic data (immunogenomic). Starting with unambiguous genotype data for case-control groups, ‘BIGDAWG’ performs tests of Hardy-Weinberg equilibrium, and carries out case-control association analyses for haplotypes, individual loci, and HLA amino acid positions.

Input Data

Data for BIGDAWG should be in a tab delimited text format. The first row must be a header line and must include column names for genotype data. The first two columns must contain subject IDs and phenotypes (0 = control, 1 = case), respectively. Data for each genotype pair must be located in adjacent columns. Column names for a given locus must have the same name; do not use ’_1’, ‘.1’, etc. For HLA alleles, names (with or with a locus prefix) can include from a single field up to the full length name for a given allele.

Missing Information When there is missing information ,either for lack of genotyping information or absence of genotyped loci, BIGDAWG allows for conventions to differentiate the type of data that is missing.

Data missing due to lack of a molecular genotyping result is considered not available (NA). Acceptable NA strings include: NA, ****, -, na and Na. Empty data cells will be considered NA.

Data missing due to genomic structural variation (i.e., no locus present) is considered absence. Acceptable absence strings include: Absent, absent, Abs, ABS, ab, Ab, AB, ^. The last symbol is the unicode caret symbol. For HLA data, BIGDAWG allows for a special allele name that indicates absence: 00, 00:00, 00:00:00 and 00:00:00:00 are all acceptable indicators of HLA locus absence. When choosing to use 0’s (zeros) to populate allele name fields, use similar or higher levels of resolution (http://hla.alleles.org/nomenclature/naming.html). When using HLA data, the 00:00 naming convention is preferred and will allow for the amino acid analysis to test a phenotype association for locus absence (see below).

Example of data set architecture and acceptable values:

subjectID Disease A A B B DRB1 DRB1 DRB3 DRB3
subject1 0 01:01 02:01 08:01 44:02 01:01 03:01 01:01 00:00
subject2 1 02:01 24:02 51:01 51:01 11:01 14:01 02:02 02:11
subject3 0 03:01 26:02 NA NA 13:01 15:01 00:00 00:00

Data Output

After the package is run, BIGDAWG will create a new folder entitled ‘output hhmmss ddmmyy’ in the working directory (unless otherwise specified by Results.Dir parameter, see below). Within the output folder will be a precheck file (‘PreCheck.txt’) detailing the summary statistics of the dataset and the results of the Hardy-Weinberg equilibrium test (‘HWE.txt’). If no locus subsets are specified (see parameters section), a single subfolder entitled ‘set1’ will contain the outputs of each association analysis optioned. If multiple locus subsets are optioned, multiple subfolders for each locus set will be written, each containing the analytic results for that subset. Within each set subfolder, a parameter file will detail the parameters that are relevant to that subset, as well as BIGDAWG version numbers, for user reference.

Parameters

BIGDAWG(Data, HLA=T, Run.Tests, Loci.Set, All.Pairwise=F, Trim=F, Res=2, Missing=0, Cores.Lim=1L, Results.Dir)

Data

Class: String. No Default.

e.g., Data="HLA_data" or Data="foo.txt"

Specifies genotype data file name. May use file name within working directory or full file name path to specify file location. See Data Input section for details about file formatting.

HLA

Class: logical. Default = T.

Indicates whether or not your data is specific for HLA loci. If your data is not HLA, is a mix of HLA and data for other loci, or includes non-standard HLA allele names, you should set HLA = F. This will override the Trim and EVS.rm arguments, and will skip various tests and checks. Set HLA = T if and only if the dataset HLA alleles name are consistent with IMGT/HLA Database release 3.0.0 or later (https://www.ebi.ac.uk/ipd/imgt/hla).

Run.Tests

Class: String or Character vector. Default = Run all tests.

e.g., Run.Tests = c("L","A")

Specifies which tests to run in analysis. “HWE” will call the Hardy Weinberg Equilibrium test, “H” will call the haplotype association test, “L” will call the locus association test, and “A” will call the amino acid test. Combinations of the test are permitted, e.g., Run.Tests="HWE" or Run.Tests=c("HWE","H","L").

Currently, the amino acid analysis is limited to the HLA-A, -B, -C, -DRB1, -DQA1, -DQB1, -DPA1 and -DPB1 loci.

Loci.Set

Class: List. Default = Use all loci.

e.g., Loci.Set=list(c("DRB1","DQB1"),c("A","DRB1","DPB1"))

Input list defining which loci to use for analyses. Combinations Permitted. The pair of alleles for a locus must be in adjacent columns in the analyzed data set. Running multiple sets is only relevant for the haplotype analysis. For all other analyses, loci are treated independently. Consider running haplotype analysis independently when optioning multi-locus sets to avoid redundancy of the other analyses. Each set output will be contained within a separate folder (see Data Output section).

All.Pairwise

Class: Logical. Default = F.

Should pairwise combinations of loci be run in the haplotype analysis? Only relevant to haplotype analysis.

EVS.rm

Class: Logical. Default = F. (HLA=T specific).

Flags whether or not to strip expression variant suffixes from HLA alleles. Example: A*01:11N will convert to A*01:11. Should not be optioned for data that does not conform to HLA naming conventions.

Trim

Class: Logical. Default = F. (HLA=T specific).

Flags whether or not to Trim HLA alleles to a specified resolution. Should not be optioned for data that does not conform to HLA naming conventions.

Resolution

Class: Numeric. Default = 2. (HLA=T specific).

Sets the desired resolution when trimming HLA alleles. Used only when Trim = T. Fields for HLA formatting must follow current colon-delimited nomenclature conventions. Currently, amino acid analysis will automatically truncate to 2-field resolution. Trimming is automatic and need not be optioned for amino acid analysis to run. Should not be optioned for data that does not conform to HLA naming conventions.

Missing

Class: String/Numeric. Default = 0.

Sets the allowable per subject missing data for running haplotype analysis. Effects can be disastrous on processing time for large values of missing. Missing may be set as a number or as “ignore” to skip removal and retain all subjects.

Cores.Lim

Class: Integer. Default = 1 Core.

Specifies the number of cores accessible by BIGDAWG in amino acid analysis. Not relevant to Windows operating systems which will use only a single core. More than 1 core is best when optioned in command line R. Not recommend for GUIs, e.g. RStudio.

Results.Dir

Class: String. Default = see Data Output section.

String name of a folder for BIGDAWG output. Subfolder for each set will generated within any output folder specified.

Examples

These are examples only and need not be run as defined below.

#Run the full analysis using the example set bundled with BIGDAWG
BIGDAWG(Data="HLA_data")

#Run the haplotype analysis with all pairwise combinations on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="H", All.Pairwise=T)

#Run the Hardy-Weinberg and Locus analysis with non-HLA data while ignoring any missing data on a file called 'data.txt'
BIGDAWG(Data="data.txt", HLA=F, Run.Tests=c("HWE","L"), Missing="ignore")

#Run the amino acid analysis trimming data to 2-Field resolution on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="A", Trim=T, Res=2)

#Run the haplotype analysis with subsets of loci on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="H", Loci.Set=list(c("DRB1","DQB1","DPB1"),c("DRB1","DQB1")))

Updating the bundled IMGT/HLA protein alignment

The bundled HLA protein alignment used in the amino acid analysis can be updated to the most recent release (https://www.ebi.ac.uk/ipd/imgt/hla). ‘BIGDAWG v 1.2.1’ was bundled using IMGT/HLA database release 3.22.0, 2015-10-10. Future database updates do not guarantee compatability with the updating tool.

# Update to the most recent IMGT/HLA database release
UpdateRelease()

# Force update
UpdateRelease(Force=T)

# Restore to original bundled update.
UpdateRelease(Restore=T)

End of vignette.