Recently, high-throughput sequencing, particularly NGS technology make it capable to sequence and discover great huge number of SNPs and furtherly explore the within-species diversity via constructing haplotype maps and conducting genome-wide association studies(GWAS). GWAS application essentially favor to the large scale data with high dimensional SNP markers and furtherly achieve a higher GWAS analysis resolution. However, most of the NGS technology users usually targeted for a lower cost and chose for low-coverage sequencing, which consequentially increase the difficulty in efficient alignment, and the inaccuracy in SNP calling and the higher ratios of missing values after genotype calling. In epistatic GWAS analysis, it's not direct SNP markers but the marker pairs are involved, which make it unendurable to handle only several thousands of SNPs. To conduct epistatic GWAS analysis, the high dimensional (million scales) SNP markers must be reduced to an acceptable scale. Therefore, there are two obvious challenges that GWAS technology suffers: one is high dimensional SNP data and the other is the incompleteness of genotype data.

The biological concept of Linkage Disequilibrium (LD) means that there is information redundancy among SNP markers, and a group SNP markers from the same LD bin/haplotype block can be integrated as one marker, or selected as a representative tag SNP. Even we do not want to reduce the SNP marker dimension, the missing genotype values can be inferred based on the known genotypes that fall in the same LD bin.

We are targeting for solving the two challenges of GWAS analysis, and developed some methods and pipelines-PIP_SNP for SNP data preprocessing. Fig. 1 illustrate the basic concept of the two challenges and its solutions.

Fig. 1 The concept of two challenges that GWAS suffer and its solutions


PIP_SNP is designed to start from the raw SNP data which can be of high dimension and contain missing genotypes. The raw SNP data should be stored as an mxn matrix, where the m is the marker (row) number and n is the individual (column) number. The meaningful known genotype values should be coded as 0,1,2 while the missing genotype value should be specifically coded as -1. PIP_SNP can proceed the successive SNP data processing, including the LD bin mapping, missing value imputing and representative marker synthesizing. Fig.2 illustrate the PIP_SNP's main procedures.

Fig. 2 The PIP_SNP' Main Preprocessing Modules including LD bin Mapping, Missing Genotype Imputing and Representative Marker Synthesizing

The measurement of two neighbor SNP markers can be calculated by the Pearson Correlation or the LD D values. The more correlative of the two markers, the higher the correlation R value, and two identical SNP marker has the maximum correlation R value of 1.0. If we set a correlation R threshold, such as 0.7, we can map all of the SNP markers into successive blocks. The boundary breakthrough point of each block has a correlation R value that is lower than the preset correlation R threshold. Each block can be considered as a LD bin, containing multiple high correlative SNP markers. If we calculate all of pairwise correlation R values, a 2D scatter plot can be generated. In the scatter plot, a LD bin can be recognized as triangular pattern. Therefore, we can define several detection methods to detect a valid LD bin by calculating the left and/or right of SNP of the LD bin with the breakpoint. Fig.3 illustrate the concept to detect a valid LD bin. Fig.3 A-B are the whole and zoomed 1D plot of all the neighbor two SNP markers' Pearson correlation values, while Fig.3 C is the 2D scatter plot.

Fig. 3 1D Line and 2D Scatter Plots of the Pearson Correlation R values, and the Illustration of the Pattern using for Detecting a Valid LD Bin

Once detect a valid LD bin, the missing genotype can be inferred by its neighbor known genotypes, and k Nearest Neighbor (kNN) method is utilized for the missing genotype imputing in PIP_SNP. Further, a valid LD bin can be directly output or synthesized by integrating all the markers or finding an optimal SNP marker as representative tag SNP marker. Fig.4 A illustrate the three options for synthesizing a valid LD bin, and Fig.4 C show the meaning of the method 2 to find the optimal SNP marker as the tag SNP marker of a LD bin.

Fig. 4 Illustration of the Methods to Synthesize a valid LD Bin

.The LD bin mapping can be generated by other method, or we want to use the existing LD bin mapping results. To be more flexible, we design two versions: PIP_SNP_Venue1 take the raw SNP data as the only input and proceed all of the above three processing procedures, while PIP_SNP_Venue2 need the raw SNP data and the existing LD bin mapping result data. Fig.5 illustrate the data flowcharts and processing modules for the two version of the PIP_SNP pipelines.

Fig. 5 Illustration of Data Flowcharts and Processing Modules for the Two Applications of PIP_SNP Pipelines

Deep Synthesizing: To Solve the Challenging SNP data from Random Population

Due to the LD existing, there are information redundancy among the neighbor SNPs, and the whole genome can be partitioned into LD bins (Haplotype blocks). The number of constructed LD bins should be greatly reduced, when compared with the original raw SNPs. To achieve this, the most important step is to define the relatedness measurement of local SNPs and the algorithm that can widely apply to all of the SNP data.

We investigated two distinct SNP data from RIL(Recombinant Inbred Lines) population and random HapMap population. We found that the correlation characteristic measured as the two neighbor SNPs and one fixed SNP against it right shift SNPs are very different. In general, the RIL SNP data display us as conservatively stable and modestly decreasing correlation R values for the above defined method, however, the random HapMap SNP data show us violent vibration and many spikes for the two correlation methods. Fig.6 illustrate the Pearson correlation R curves for the two different SNP data. As a such, it's relative easy to process the RIL SNP data by mapping the whole SNPs to LD Bins and achieving a higher compression ratio. However, it's a challenge to apply the same method to random HapMap SNP data.

Fig. 6 The Pearson Correlation Characteristics of SNP data from RIL population and Random HapMap Population

From Fig. 6, we oversee the spike characteristics of correlation among local neighbor SNPs from random population, and sense that the information redundancy from SNPs can be skipped, e.g., the 1st and 3rd SNPs are with high correlation and can be grouped together, while the 2nd and 4th SNPs are with high correlation and can be grouped together. To handle this scenario, we develop another program called Deep Synthesising that is compared with the shallow synthesising in PIP_SNP. Fig.7 illustrate the distinction and the cascade of the two function modules.

Fig. 7 The Distinction and Cascading Relationship between Shallow Synthesising and Deep Synthesising of SNPs

According the criteria of correlation, an algorithm for Deep Synthesising to group the neighbor continuous and or skipped SNPs has been proposed. Essentially, the detection of a breakthrough boundary of marker bin move on until there is two more SNPs that can not be grouped at any cases. During a the detection proceed, all the SNPs are sifted and moved into a continuous and an un-continuous(skipped) marker buffers, which will be furthered synthesised into several bins. Fig.8 illustrate the whole procedure of function module- Deep Synthesising.

Fig. 8 The Procedure of Deep Synthesising

The processing in module Deep Synthesising can be cascaded to the previous initial processing in PIP_SNP. PIP_SNP has two representative application cases(see below) as PIP_SNP_Venue1 and PIP_SNP_Venue2 to handle the raw SNP data with and without previous LD Bin mapping as input respectively. Fig.9 illustrate the relationship of data flow among the three function modules.

Fig. 9 Illustration of the Data flow among three function modules: PIP_SNP_Venue1, PIP_SNP_Venue2 and Deep Synthesising

Representative Application of PIP_SNP

The user can upload their raw SNP data for preprocessing. Fig.10 and Fig.11 illustrate two typical application of PIP_SNP_Venue1 and PIP_SNP_Venue2 respectively. PIP_SNP_Venue1 only require to upload the raw SNP data while the PIP_SNP_Venue2 require to upload raw SNP data and an existing LD bin mapping result file. Additionally, the two files can be stored in cloud server and provided as unique URL.

Fig. 10 The User-Interface for a Typical Application of PIP_SNP_Venue1

Fig. 11 The User-Interface for a Typical Application of PIP_SNP_Venue2


1. Shizhong Xu, Genetic Mapping and Genomic Selection Using Recombination Breakpoint Data, Genetics, 2013, 195, 1103-1115..

2. https://en.wikipedia.org/wiki/Tag_SNP

3. Rasmus Nielsen, Joshua S. Paul, Anders Albrechtsen, Yun S. Song, Genotype and SNP calling from next-generation sequencing data, Nat Rev Genet., 2011, 12(6): 443-451.

4.Zhang W., Dai X., Wang Q., Xu S., Zhao P.X., "PEPIS: A Pipeline for Estimating Epistatic Effects in Quantitative Trait Locus Mapping and Genome-Wide Association Studies", 2016. PLoS Comput Biol, 12(5).