Incremental Haplotype Inference
Peter Damaschke
What the project is about
We address the combinatorial problem of inferring the unknown haplotypes in a
population, given a sample of genotypes, under the assumption that the
population forms a perfect phylogeny. This is important because physical
haplotyping by DNA sequencing is expensive, whereas genotypes are easier to
obtain. Perfect phylogenies appear naturally and quite frequently.
We developed a simple algorithm that identifies all sufficiently frequent
haplotypes in a random sample of genotypes. It builds the solution
incrementally, that is, loci are added one-by-one in a given ordering. Missing
data can also be recovered if they are not too prevalent. Moreover, the idea of the algorithm would easily extend to more general population structures.
Paper
The algorithm is described in a paper presented at the
2nd RECOMB Satellite Workshop on Computational Methods for SNPs and
Haplotypes, 2004. Here is an extended version.
(It has never appeared as a regular publication later on. Ask me if you want
to know the details of the distressing and hair-raising story.)
Implementation
Michael Ewald has implemented the algorithm as a Java application
(Final Year Individual Project for the Imperial College London). William Garner
has implemented an improved version within his master thesis work.
In order to increase the rate of recognized haplotypes from smaller samples,
the algorithm is applied to several random orderings of loci, and resulting
partial haplotypes are combined. More inference rules than the two local
rules from the paper have further increased the performance.
Extensive testing has been done. In a large trial to decode 30 haplotypes from
100 genotypes with 10% missing entries (simulated data), no incorrect haplotypes
were found in any run.
Download
Get the version of February 2005 as a zip
file. Unzip and compile as usual. A folder called Suite2 is created,
with all Java files.
The Suite contains the genotype decoder (the algorithm) along with a data set
generator that can produce simulated genotype data for demonstration purposes.
In order to work with real data, load your preprocessed genotype files and use
the genotype decoder only. Note that you must provide input (the genotype
sample) as a file of equally long words with characters 0,1,2 which have the
usual meaning as described in the aforementioned papers. Any other symbol is
considered as a missing item.