Recently, high-throughput sequencing, particularly **NGS ** technology make it capable to sequence and discover great huge number of **SNPs** and furtherly explore the within-species diversity via constructing haplotype maps and conducting **genome-wide association studies(GWAS)**. To conduct GWASs, one type of **(Linear Mixed Models)LMMs** is always applied and the calculating of the **kinship matrix** is usually the first step to solve the LMMs. The kinship matrix **K** is a 2D matrix with the dimension of **nxn**, and each kinship matrix entry ** K(i,j)** is a coefficient to assess the genetic resemblance between individual **i** and individual **j**. In principle, the larger the individual number, and the larger the genotypic marker number, the more complexity the kinship matrix calculating. A typical **GWAS** studies may call several **millions scale SNPs** and the genotypic markers across several hundred ~ thousands accessions/lines. Therefore, a typical genotypic data can amount to **>100 GBs**. To load such a **GBs** scale genotypic data and implement the kinship matrix calculation is a challenge

In the recent years, **GPU (Graphics Processing Units)** with multiple hardware processor (>1,000) cores has become a standard **HPC (High Performance Computing)** solution system for large scale computing. Especially, GPU empowered HPC platform favor to **large scale matrix operation**.

We have analyzed the math principle and the complexity of the marker-assist kinship matrix, then we have successfully developed this **GPU **empowered pipeline for main effect kinship matrix calculating: **KMC1D** , which can easily realize several hundreds of times of speeding up, when compared with the golden single processing.

In genetic analysis, there are two kinds of **main effects: additive and dominant**. If the genotype data are available, the additive and dominant genotypic numeric values can be clearly defined. To calculate the kinship matrix, the simplest method is to load the whole the genotype matrix data for its **transpose** and then the **multiplication **of two matrix, which in principle need to request the CPU to allocate a memory at the same size of the genotype data file. Obviously, it's clumsy and even possible, if the original genotype data amount to **100GBs **. In mathematically principle, the kinship matrix is of **linearity**, which can be acquired by summing a bunch of **sub-kinship matrix**, and each sub-kinship matrix is calculated by **a block of successive markers**. Therefore, if we only load one block of the genotypic markers, such as 10000, which will greatly reduce the memory requirements. **Fig.1 **depict the math rationale to generate the two types of genotypic marker values and partition all the markers into successive blocks for sub kinship matrix calculating.

**Fig. 1 The math rationale to generate the two types of main genotypic values and partition all the markers into successive blocks for sub kinship matrix calculating **

**GPU **based parallel computing favors to compute large scale matrix operations. Ideally, each matrix entry operation can be implemented by **one thread ** corresponds to **one GPU core**. All the GPU program include **two parts**: one as the **host** part running on **CPU**, and the other part as **kernel** codes running on the **slavery device-GPU cores**. The **GPU kernel codes** are functionally implemented by parallelization and distinguished by specific primitive **_global_**. We analyzed the mathematical principle and the matrix operation procedure to calculate kinship matrix, and coded **4 GPU core paralleling kernel** functions including **transpose** of matrix, **multiplication** of two matrix, **sum** of two matrix, and the **normalization** of the summed kinship matrix. **Fig.2 ** depict the GPU empowered parallel pipeline architecture for main effect kinship matrix calculating through partitioning coded genotypic marker into successive blocks, then calculating sub kinship matrix and merging into a whole.

**Fig. 2 GPU empowered parallel pipeline architecture for main effect kinship matrix calculating **

To efficiently load the **huge genotypic data**, we developed a module which can work in **HTML5 browser** for a **resumable multithreading chunked **data uploading.

Equipped with **GPU ** parallel computing, **KMC1D** can calculate the **main effect kinship matrix** for the given coded genotype data. It need either the **additive effect matrix (Z_matrix)** or the **dominance effect matrix(W_matrix)** file. **Fig.3 ** provide the user-interface snapshot for additive effect kinship matrix calculating. ** Note: The inputting genotypic matrix file must be comma or Tab delimited text file, and stored as m(rows, Markers) x n(cols, Individual/Accessions), each row corresponds to one marker **.

**Fig. 3 The user-interface for additive effect kinship matrix calculating **

**1.** Xu, S., "Mapping Quantitative Trait Loci by Controlling Polygenic Background Effects". Genetics, 2013. 195(4):p.1709-23.

**2.** Zhang W., Dai X., Wang Q., Xu S., Zhao P.X., "PEPIS: A Pipeline for Estimating Epistatic Effects in Quantitative Trait Locus Mapping and Genome-Wide Association Studies", 2016. PLoS Comput Biol, 12(5)

**3.** Cecilia J. M. , Garc´ıa J. M. , and Ujaldon M., “The GPU on the Matrix-Matrix Multiply: Performance Study and Contributions”, in Parallel Computing: From Multicores and GPU’s to Petascale, B. Chapman et al., Eds. Advances in Parallel Computing, vol. 19, pp. 331-340, 2010.

**4.** Dobravec T., Bulic P., "Comparing CPU and GPU Implementations of a Simple Matrix Multiplication Algorithm", IJCEE, vol 9, 430-438, 2017.