Overview

‘Bridging ImmunoGenomic Data-Analysis Workflow Gaps’ (‘BIGDAWG’) is an integrated analysis system that automates the manual data-manipulation and trafficking steps (the gaps in an analysis workflow) normally required for analyses of highly polymorphic genetic systems (e.g., the immunological human leukocyte antigen (HLA) and killer-cell Immunoglobulin-like receptor (KIR) genes) and their respective genomic data (immunogenomic) (Pappas DJ, Marin W, Hollenbach JA, Mack SJ. 2016. ‘Bridging ImmunoGenomic Data Analysis Workflow Gaps (BIGDAWG): An integrated case-control analysis pipeline.’ Human Immunology. 77:283-287). Starting with unambiguous genotype data for case-control groups, ‘BIGDAWG’ performs tests of Hardy-Weinberg equilibrium, and carries out case-control association analyses for haplotypes, individual loci, and HLA amino acid positions.

Input Data

Data for BIGDAWG should be in a tab delimited text format. The first row must be a header line and must include column names for genotype data. The first two columns must contain subject IDs and phenotypes (0 = control, 1 = case). While a phenotype maybe include disease status, non-status phenotypes (onset, severity, ancestry, etc.) maybe used. However, phenotype designatons in the dataset are restricted to the use of 0s and 1s. Genotype pairs must be located in adjacent columns. Column names for a given locus must have the same name; do not use ’_1’, ‘.1’, etc.

For HLA alleles, you may choose to format your genotype calls with our without the locus prefix. For example, for HLA-A, a given genotype call maybe 01:01:01:01 or A*01:01:01:01. Allele names can include any level of resolution, from a single field up to the full length name. For HLA-DRB3,-DRB4,-DRB5 genotype calls, you may choose to represent those in a single pair of columns or as separate pairs of columns for each locus. However, when submitted as a single pair of columns, genotypes must be formatted as Locus*Allele (including non-DRB loci). The column names must either be DRB345 or DRB3/4/5. Homozygous or hemizygous status for DRB3, DRB4 and DRB5 genotypes is determined in reference to DRB1 genotypes based on the haplotypes as defined by Andersson, 1998 (Andersson G. 1998. Evolution of the HLA-DR region.Front Biosci. 3:d739-45.). If you wish to define your own zygosity, it is suggested you split them into separate pairs of columns for each locus.

Missing Information When there is missing information, either for lack of genotyping information or absence of genotyped loci, BIGDAWG allows for conventions to differentiate the type of data that is missing.

Data missing due to lack of a molecular genotyping result is considered not available (NA). Acceptable NA strings include: NA, ****, -, na and Na. Empty data cells will be considered NA.

Data missing due to genomic structural variation (i.e., no locus present) is considered absence. Acceptable absence strings include: Absent, absent, Abs, ABS, ab, Ab, AB, ^. The last symbol is the unicode caret symbol. For HLA data, BIGDAWG allows for a special allele name that indicates absence: 00, 00:00, 00:00:00 and 00:00:00:00 are all acceptable indicators of HLA locus absence. When choosing to use 0’s (zeros) to populate allele name fields, use similar or higher levels of resolution (http://hla.alleles.org/nomenclature/naming.html) and following the same naming convention as with other genotype calls (either with or without locus prefix). If using a single column pair for DRB3/4/5 and the “00” absence indicator, then do not affix a locus prefix for the absent calls. Only include the locus prefix for known DRB345 genotypes (i.e., DRB345*00:00 is NOT an acceptable name). For HLA data, the 00:00 naming convention is preferred and will allow for the amino acid analysis to test a phenotype association for locus absence (see below).

Novel Alleles BIGDAWG will accept any name for novel alleles. However, it is suggested you follow the same naming convention for novel alleles as with other genotypes calls in your data, either with or without the locus prefix. For example, new alleles could be submitted as follows: New, 01:New, or A*01:New. The BIGDAWG analysis for amino acids cannot accept new alele designations. If you would like to run the amino acid analysis, it is suggested you replace a new allele with NA or omit the subject entirely.

Example of data architecture and acceptable values:

subjectID Disease A A B B DRB1 DRB1 DRB3 DRB3
subject1 0 01:01 02:01 08:01 44:02 01:01 03:01 NA NA
subject2 1 02:01 24:02 51:01 51:01 11:01 14:01 02:02 02:11
subject3 0 03:01 26:02 NA NA 10:01 08:01 00:00 00:00

Data Output

After the package is run, BIGDAWG will create a new folder entitled ‘output hhmmss ddmmyy’ in the working directory (unless otherwise specified by Results.Dir parameter, see below). Within the output folder will be a precheck file (‘PreCheck.txt’) detailing the summary statistics of the dataset and the results of the Hardy-Weinberg equilibrium test (‘HWE.txt’). If no locus subsets are specified (see parameters section), a single subfolder entitled ‘set1’ will contain the outputs of each association analysis run. If multiple locus subsets are optioned, subfolders for each locus set will be created, each containing the respective analytic results for that subset. Within each set subfolder, a parameter file will detail the parameters that are relevant to that subset.

Error Messages and Codes

BIGDAWG has a few built-in checks to ensure data format consistency and compatibility, especially for HLA data. BIGDAWG also does a parameter review before performing chi-squared tests and returns ‘NCalc’ (not calculated) when all genotypes have expected counts < 5 or the degrees of freedom do not allow for a test (e.g., dof < 1).

Parameters

BIGDAWG(Data, HLA=T, Run.Tests, Loci.Set, All.Pairwise=F, Trim=F, Res=2, Missing=0, Cores.Lim=1L, Results.Dir)

Data

Class: String. Required. No Default.

e.g., Data="HLA_data" -or- Data="foo.txt"

Specifies genotype data file name. May use file name within working directory or full file name path to specify file location. See Data Input section for details about file formatting.

HLA

Class: logical. Optional. Default = T.

Indicates whether or not your data is specific for HLA loci. If your data is not HLA, is a mix of HLA and data for other loci, or includes non-standard HLA allele names, you should set HLA = F. This will override the Trim and EVS.rm arguments, and will skip various tests and checks. Set HLA = T if and only if the dataset HLA alleles name are consistent with the most recent IMGT/HLA Database release (https://www.ebi.ac.uk/ipd/imgt/hla).

Run.Tests

Class: String or Character vector. Optional. Default = Run all tests.

e.g., Run.Tests = c("L","A") -or- Run.Tests = "HWE"

Specifies which tests to run in analysis. “HWE” will call the Hardy Weinberg Equilibrium test, “H” will call the haplotype association test, “L” will call the locus association test, and “A” will call the amino acid test. Combinations of the test are permitted.

Currently, the amino acid analysis is limited to the HLA-A, -B, -C, -DRB1, -DRB3, -DRB4, -DRB5, -DQA1, -DQB1, -DPA1 and -DPB1 loci.

Loci.Set

Class: List. Optional. Default = Use all loci.

e.g., Loci.Set=list(c("DRB1","DQB1"),c("A","DRB1","DPB1"), c("DRB1","DRB3"))

Input list defining which loci to use for analyses. If you included HLA-DRB3,-DRB4,-DRB5 as a collapsed column pair (‘DRB345’), you must specifiy the single locus in the Loci.set if you wish them to be included in the analysis (i.e., ‘DRB3’). Combinations are permitted.

The pair of alleles for a locus must be in adjacent columns in the analyzed data set. Running multiple sets is ONLY relevant for the haplotype analysis. For all other analyses, loci are treated independently. Consider running haplotype analysis independently when optioning multi-locus sets to avoid redundancy of the other analyses. Each set output will be contained within a separate folder (see Data Output section).

All.Pairwise

Class: Logical. Optional. Default = F.

Should pairwise combinations of loci be run in the haplotype analysis? Only relevant to haplotype analysis.

EVS.rm

Class: Logical. Optional. Default = F. (HLA=T specific).

Flags whether or not to strip expression variant suffixes from HLA alleles. Example: A*01:11N will convert to A*01:11. Should not be optioned for data that does not conform to HLA naming conventions.

Trim

Class: Logical. Optional. Default = F. (HLA=T specific).

Flags whether or not to Trim HLA alleles to a specified resolution. Should not be optioned for data that does not conform to HLA naming conventions.

Resolution

Class: Numeric. Optional. Default = 2. (HLA=T specific).

Sets the desired resolution when trimming HLA alleles. Used only when Trim = T. Fields for HLA formatting must follow current colon-delimited nomenclature conventions. Currently, the amino acid analysis will automatically truncate to 2-field resolution. Trimming is automatic and need not be optioned for amino acid analysis to run. This test will not run for data that does not conform to HLA naming conventions.

Missing

Class: String/Numeric. Optional. Default = 0.

Sets the allowable per subject threshold for missing alleles. Relevant to running the haplotype analysis. Effects can be disastrous on processing time for large values of missing. Missing may be set as a number or as “ignore” to skip removal and retain all subjects.

Cores.Lim

Class: Integer. Optional. Default = 1 Core.

Specifies the number of cores accessible by BIGDAWG in amino acid analysis. Not relevant to Windows operating systems which will use only a single core. More than 1 core is best when optioned in command line R. Not recommend for GUIs, e.g. RStudio.

Results.Dir

Class: String. Optional. Default = see Data Output section.

String name of a folder for BIGDAWG output. Subfolder for each set will generated within any output folder specified.

Examples

These are examples only and need not be run as defined below.

#Run the full analysis using the example set bundled with BIGDAWG
BIGDAWG(Data="HLA_data")

#Run the haplotype analysis with all pairwise combinations on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="H", All.Pairwise=T)

#Run the Hardy-Weinberg and Locus analysis with non-HLA data while ignoring any missing data on a file called 'data.txt'
BIGDAWG(Data="data.txt", HLA=F, Run.Tests=c("HWE","L"), Missing="ignore")

#Run the amino acid analysis trimming data to 2-Field resolution on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="A", Trim=T, Res=2)

#Run the haplotype analysis with subsets of loci on a file called 'data.txt'
BIGDAWG(Data="data.txt", Run.Tests="H", Loci.Set=list(c("DRB1","DQB1","DPB1"),c("DRB1","DQB1")))

Updating the bundled IMGT/HLA protein alignment

The bundled HLA protein alignment used in the amino acid analysis can be updated to the most recent release (https://www.ebi.ac.uk/ipd/imgt/hla). This version of BIGDAWG was bundled using the indicated release (see above). Future database updates do not guarantee compatability with the updating tool.

# Update to the most recent IMGT/HLA database release
UpdateRelease()

# Force update
UpdateRelease(Force=T)

# Restore to original bundled update.
UpdateRelease(Restore=T)

End of vignette.