Challenges that GWAS Suffering from High Dimensional SNP Data and Incompleteness of Genotype values, LD focused Method & Solution, and Pipelines for SNP Data Pre-processing

Genome Wide Association Studies (GWASs) have revolutionized the field of quantitative genetics by its great success in identifying the causal SNPs conferring the complex phenotypic traits or diseases. In statistical genetics, increasing the marker density along with the increased sample size can increase the QTL mapping resolution. Therefore, GWAS application essentially favor to the large scale sample data with high dimensional SNP markers. However, high dimensional SNP markers always means costly expense in sequencing, and long time in huge data processing. In epistatic GWAS analysis, it's not direct SNP markers but the marker pairs are involved, which make it unendurable to handle only several thousands of SNPs.

Next Generation Sequencing (NGS) technology can provide a very cheap and high-throughput sequencing, which, in theory, make it possible to call and genotype highly dense SNP data and furtherly achieve a higher GWAS analysis resolution. However, most of the NGS technology users usually targeted for a lower cost and chose for low-coverage sequencing, which consequentially increase the difficulty in efficient alignment, and the inaccuracy in SNP calling and the higher ratios of missing values after genotype calling.

In short, there are two challenges that GWAS technology facing: one is high dimensional SNP data and the other is the incompleteness of genotype data. It's necessary to develop some method and tools to solve the two challenges

Linkage Disequilibrium (LD) is the non-random association of alleles at different loci in a given population. A group of SNPs can be inherited together because of high LD, which means that there tends to be redundant information. Therefore, genotypes with high LD are informatics correlated in local range and a missing genotype can be inferred from other known genotypes falling in the same LD bin/haplotype block. Additionally, a group SNPs from the same LD bin/haplotype block can be integrated as one marker, or selected as a representative tag SNP, which can remove the genetic information redundancy, reduce the SNP data dimension, and make it possible to conduct an epistatic GWAS analysis.

To determine the LD bin/haplotype block, impute the missing genotypes, and calculate an integrated marker or find a representative tag SNP for each LD bin, are very necessary procedures for SNP data preprocessing. We targeted for solving the two challenges that GWAS suffering, and developed two SNP data preprocessing pipelines: PP_SNP_Venue1 and PIP_SNP_Venue2, both of which can accept the high dimensional SNP data with moderately incomplete genotype values as input. PIP_SNP_Venue1 analyze all of the Correlation R values between any two neighbor SNP markers, and then automatically detect the LD bin/haplotype blocks and mapping it to the whole genome. PIP_SNP_Venue2 jump the LD bin detecting and take the existing LD bin/haplotype block mapping file as input. In each LD bin, both PIP_SNP_Venue1 and PIP_SNP_Venue2 conduct imputing the missing genotypes, and synthesize an integrated marker or find a representative Tag SNP.

The user can choose PIP_SNP_Venue1 or PIP_SNP_Venue2 to do conduct their SNP data preprocessing. The user interface allow user to select different methods to detect LD bin/haplotype block , calculate Correlation R value of two SNP markers, and synthesize the LD Bin marker


1. Shizhong Xu, Genetic Mapping and Genomic Selection Using Recombination Breakpoint Data, Genetics, 2013, 195, 1103-1115.


3. Rasmus Nielsen, Joshua S. Paul, Anders Albrechtsen, Yun S. Song, Genotype and SNP calling from next-generation sequencing data, Nat Rev Genet., 2011, 12(6): 443-451.