PIP_SNP: Pipelines for Preprocessing Genotyped SNP Data Featured as LD Bin Mapping, Missing Genotype Imputing, and LD Bin Marker Synthesizing

Module Description

PIP_SNP is SNP Data Preprocessing Pipelines, which are specifically developed to solve the challenges of SNP Data due to high dimension of SNP markers and too much incompleteness of genotype status. The biological concept of LD is utilized, which can be understood as a group of SNP markers from a local range are inheritied together. Therefore, the whole genome can be mapped with LD Bins, and each LD bin can contain several correlated SNP Markers. A missing genotype can be inferred from the known genotypes at the same LD Bin. A representative marker of a bin can be generated by integrating the whole SNP markers of a LD Bin. PIP_SNP functionally include the LD mapping across the whole genome, missing genotype's imputing for each LD Bin, and representative marker synthesising for each LD Bin.
To be flexible, we developed two related but discerned pipelines PIP_SNP_Venue1 and PIP_SNP_Venue2. PIP_SNP_Venue1 start from the only numerical genotyped SNP Data, and it mainly include three modules: LD Bin Detecting and Mapping , LD Based Missing Genotype Imputing, and LD Based Synthesizing. PIP_SNP_Venue2 start from the existing LD Mapping result and the numerical genotyped SNP Data, and it mainly include two modules: LD Based Missing Genotype Imputing and LD Based Marker Synthesizing.
To efficiently synthesis the SNP data from random population, such as HapMap, and achieve a higher synthesising ratio, we considered the correlation of the neighbor connected and skipped SNPs. Another program called Deep Synthesising has been developed, which can be selected as the synthesising option to further synthesis the markers output from PIP_SNP.
PIP_SNP is developed in C++, and compiled in linux via a Open-Source IDE Code:Blocks and Windows via Visual Studio 2015. The user can download the proper version to local, compiled and run it as a Command Line .
Deep Synthesising is also developed in C++, and compiled in linux via Open-Source IDE Code:Blocks. The user can download the proper version of source code to local, compiled and run it as a Command Line.
In this Web interface version, we tried to provide an user-friendly convenience. The original genotyped SNP data can be very huge, we technically developed a moudle which can work in HTML5 browser and implement the resumable multithreading chunked data uploading. Additionally, the original genotyped SNP data can be stored in remote cloud server, such as google drive, PIP_SNP provide the options to allow user to provide the shared URL .

Download

User Manual

Test Dataset

PIP_SNP_Venue1 need only genotyped SNP Data File as input.
PIP_SNP_Venue2 need the genotyped SNP Data File and the existing LD Bin Mapping File as inputs.

Performance Evaluation

We use RIL SNP and Hapmap SNP together with the phenotype trait for the Association performance evaluation. The two types of SNPs are synthesized and compressed into a series of dimensions, then we use TASSEL and our developed 2D GWAS Platform- PATOWAS for association performance evaluation, and generated the corresponding p values
RIL_SNP for Evaluation: RIL Raw SNP and Trait, and the evaluation results RIL SNP Evaluation Results .
Hapmap_SNP for Evaluation: Hapmap Raw SNP and Trait and the evaluation results Hapmap SNP Evaluation Results

Source Code

Linux CodeBlocks Beta Version: Linux_CB_PIP_SNP_Venue1, and Linux_CB_PIP_SNP_Venue2, and Linux_CB_Deep_Synthesis .
Linux CodeBlocks Alpha Version: Linux_CB_PIP_SNP_Venue1, and Linux_CB_PIP_SNP_Venue2.
Windows Visual Studio Alpha Version: Windows_VS_PIP_SNP_Venue1, and Windows_VS_PIP_SNP_Venue2.

Development Information

Language:	C/C++
Current Version:	V1.0
Platform:	Linux (Code:Blocks) or Windows(Visual Studio 2015)
Licence:	GPL 3.0
Status:	Active
Last Update:	04/18/2020
Contact:	Wenchao Zhang, wezhang AT noble.org