Incremental Haplotype Inference

Peter Damaschke

What the project is about

We address the combinatorial problem of inferring the unknown haplotypes in a population, given a sample of genotypes, under the assumption that the population forms a perfect phylogeny. This is important because physical haplotyping by DNA sequencing is expensive, whereas genotypes are easier to obtain. Perfect phylogenies appear naturally and quite frequently. We developed a simple algorithm that identifies all sufficiently frequent haplotypes in a random sample of genotypes. It builds the solution incrementally, that is, loci are added one-by-one in a given ordering. Missing data can also be recovered if they are not too prevalent. Moreover, the idea of the algorithm would easily extend to more general population structures.

Paper

The algorithm is described in a paper presented at the 2nd RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotypes, 2004. Here is an extended version. (It has never appeared as a regular publication later on. Ask me if you want to know the details of the distressing and hair-raising story.)

Implementation

Michael Ewald has implemented the algorithm as a Java application (Final Year Individual Project for the Imperial College London). William Garner has implemented an improved version within his master thesis work.

In order to increase the rate of recognized haplotypes from smaller samples, the algorithm is applied to several random orderings of loci, and resulting partial haplotypes are combined. More inference rules than the two local rules from the paper have further increased the performance.

Extensive testing has been done. In a large trial to decode 30 haplotypes from 100 genotypes with 10% missing entries (simulated data), no incorrect haplotypes were found in any run.

Download

Get the version of February 2005 as a zip file. Unzip and compile as usual. A folder called Suite2 is created, with all Java files.

The Suite contains the genotype decoder (the algorithm) along with a data set generator that can produce simulated genotype data for demonstration purposes. In order to work with real data, load your preprocessed genotype files and use the genotype decoder only. Note that you must provide input (the genotype sample) as a file of equally long words with characters 0,1,2 which have the usual meaning as described in the aforementioned papers. Any other symbol is considered as a missing item.