Hello, A few days ago I posted a message asking for help with a program that would estimate or determine individual haplotypes (i.e., gametic phase) from a set of direct sequences of heterozygote individuals. Thank you so much to those of you who took the time to reply, particularly to those friends who went out of their way to highlight potential problems and alternatives, BTW I will reply personally :^)#. I have compiled excerpts for the suggested programs below for the benefit of the community. Thanx again and cheers! Axa ABOUT PHASE AND fastPHASE From: John C Garza [mailto:carlosjg@ucsc.edu] PHASE by Matt Stephens (University of Chicago) is the best program for determining the population frequencies of haplotypes and therefore determining the most likely phase of heterozygous sequences. However, it (and all other phasing algorithms) can run into trouble with MHC, since there is so much gene conversion/recombination. In what species are you are studying MHC? How we minimize sequencing is by using SSCP with Sybr Gold staining to determine mobility patterns for select sequenced alleles, then do population genetic screening with SSCP. From: Darren Obbard [mailto:dobbard@staffmail.ed.ac.uk] phase and fastPhase from Mathew Stephens http://stephenslab.uchicago.edu/software.html Are both widely used, and trusted. From: Magdalena Zarowiecki [mailto:m.zarowiecki@nhm.ac.uk] I've been spending a lot of time with this problem. The most commonly used software is PHASE, but it doesn't work for my dataset, as it is too big (300 taxa, 80 variable characters.) Instead I've been using fastPHASE, it is available for Mac, PC and Linux, but I only ever got the Linux version to work. I have also used the ELB algorithm in Arlequin, and a new software called BEAGLE, all of which are relatively hassle-free. You can convert your data to PHASE/fastPHASE input on this website: http://www.mnhn.fr/jfflot/seqphase/ There is a nice paper that came out last year that compares several different phase-solving algorithms, and also suggests a method for solving a problem just like yours, complete with some useful PERL-scripts. From: Michael Sorenson [mailto:msoren@bu.edu] Not sure how well the software will work on MHC, but please see the following paper for a favorable evaluation of PHASE: Harrigan, R.J., M.E. Mazza & M.D. Sorenson. 2008. Computation versus cloning: evaluation of two methods for haplotype determination. Molecular Ecology Resources 8: 1239-1248. ABOUT DNAsp/PHASE From: Mark Chapman [mailto:mchapman@plantbio.uga.edu] The PHASE algorithm in DnaSP will do this. you make a fasta file of your (aligned) genotypes and open it as "unphase/genotype data file". I have come across some alignments that it cant handle for whatever reason and the program crashes, but this is only maybe 5% of the time and is probably due to very few homozygous genotypes. Each file takes 1 minute to 1 hour to run typically, so it can be frustrating to do this several times. A significant advantage is that once this has been done in DnaSP you can export several filetypes, like nexus, to use in other programs. See our Plant Cell paper for a little more info (http://markachapman.googlepages.com/ChapmanPlantCell2008.pdf). From: John Wares [mailto:jpwares@uga.edu] You can use software like PHASE - it is currently fully implemented in DNAsp, which makes life MUCH easier than only a few months ago when you had to know or use a little PERL to get data in and out for analysis. It is a likelihood-based analysis that considers the frequency of homozygous alleles at each site as well as the combinations of alleles to infer haplotypes directly from sequences. So, that is the easy answer to your question. The thing I'm worried about with doing this for MHC data is that probably every individual is a heterozygote, so there will be no non-statistical inference of haplotypes, and they are all heterozygotes because MHC harbors so many alleles - maybe hundreds, right? Depends on what you are working on these days but even devil's hole pupfish carries a ton of diversity. PHASE software works great when you can count the number of segregating sites on your hands, but I think you are going to have a great deal of uncertainty once you cross that threshold. Plus, there is the issue of what is a heterozygous site - we all know that sequencing is usually pretty clean, but probably a few sites - even in haploid mtDNA - that are ambiguous. You would want to sequence in both directions to eliminate most of the sequencing/coding errors, and maybe multiple sequences per individual, because each of those chemistry errors (not actual heterozygous sites) will be used by PHASE to infer yet another segregating site and thus another haplotype. You might also consider using high fidelity polymerase. But, of course, these things cause the expense and/or time to go up, which is what you're trying to avoid. The other thing that you may be able to do - depending on your question - is consider any very-low-frequency variant (as in, only a single inferred haplotype in your data set) an 'unknown'. Singletons are unlikely to affect your end analysis much (again, depends on question) and that helps eliminate things that only popped up because of sequencing error (but this is just a somewhat messy solution). ABOUT HAPLOTYPER From: Adkins, Ron [mailto:radkins1@utmem.edu] Probably the most popular program (and easy to use) is Phase by Matthew Stephens. http://stephenslab.uchicago.edu/software.html Other programs are haplotyper http://www.people.fas.harvard.edu/~junliu/Haplo/docMain.htm and plem. There are many more, but it would probably be simplest for you to just go with Phase. They all have a similar accuracy. Given that you direct sequenced what I assume are PCR products, I think you will end up having to treat your data as if you collected SNP genotypic data at several sites directly. ABOUT ARLEQUIN From: Alexander Weigand [mailto:WeigandA@gmx.net] perhabs the program Arlequin will be of any purpose for you? http://cmpg.unibe.ch/software/arlequin3/#Implemented%20methods ABOUT HAPSTAT From: Steve McKechnie [mailto:stephen.mckechnie@sci.monash.edu.au] HAPSTAT should do the job - you would need to feed in the diploid state of all variable sites, in sequential order. ABOUT RSCA From: Kraus, Robert [mailto:robert.kraus@wur.nl] If times are tough you might look at an alternative method called RSCA: Worley Ka., Gillingham M, Jensen P, et al. (2008) Single locus typing of MHC class I and class IIB loci in a population of red jungle fowl. Immunogenetics 60, 233-247. ABOUT CHAMPURU From: Jean-François Flot [mailto:jflot@uni-goettingen.de] If your MHC alleles are of variable lengths, my program Champuru (http://www.mnhn.fr/jfflot/champuru) may be exactly what you are looking for since it extracts directly the gametic phases of length-variant heterozygotes from the patterns of double peaks in the corresponding forward and reverse chromatograms. To find out the phases of heterozygotes whose two haplotypes have the same length, you could use PHASE (see http://www.mnhn.fr/jfflot/seqphase for a web tool that generates PHASE input files from FASTA alignments and converts PHASE output files back into FASTA) or resequence your PCR products with allele-specific primers. ABOUT GENEIOUS PLUGIN From: Eamonn Mallon [mailto:ebm3@leicester.ac.uk] I know Geneious has a heterozygote plugin, I haven't used it myself. http://www.biomatters.com/default,390,plugins.sm ABOUT CODON CODE ALIGNER From: Nicholas Crawford [mailto:ngcrawfo@bu.edu] I think CodonCode Aligner will do it. http://www.codoncode.com/aligner/ ABOUT CVHAPLOT From: Julie B. Hebert [mailto:byrdie@umd.edu] I've been working on the same problem (different gene) and have found the following program useful: CVHaplot. http://www.ipm.ioz.ac.cn/them_zhangdexing/CVhaplot.htm I'm attaching the reference where I learned about it. Apparently a new version is coming out next week, so I would wait until then to check it out. Also, in the paper, they compare many different phase programs, and have websites for all of them at the end, so it is a great reference! OTHER RELEVANT COMMENTS From: Mark McMullan [mailto:M.Mc-Mullan@biosci.hull.ac.uk] I too work on MHC and would be keen to hear a little more detail about the specifics of your project. I use a protocol to estimate the number of clones I require to find all alleles with 95% certainty. I have seen software that claims to be able to separate heterozygous sequences into their constituent versions but the example shown to me was used to separate indel sequences. While this is very useful, indel sequences have certain properties that I can imagine would be used to better separate such sequences. With MHC alleles sharing motifs between alleles I can't see how separation of sequences could be done reliably. But I remain hopeful, if you receive anything useful on this I'd be grateful if you could share it with me. Nevertheless, it would still be good to hear a bit more about your project, I might be able to make some suggestions. From: Edward Grant [mailto:edpgrant@gmail.com] See the paper by Huang et al in Mol Ecol: Haplotype reconstruction for scnp DNA: a consensus vote approach with extensive sequence data from populations of the migratory locust (Locusta migratoria). Mol Ecol. 2008 Apr;17(8):1930-47. http://www.ncbi.nlm.nih.gov/pubmed/18346127 From: Taylor, Jerry F. [mailto:taylorjerr@missouri.edu] We use fastPHASE for this kind of analysis. However the trouble with the MHC is that it is very heterozygous and the accuracy of fastPHASE can be limited if you do not have a large number of individuals genotyped. Axayacatl Rocha-Olivares, Ph.D. Biological Oceanography Department CICESE P. O. Box 434844 San Diego, CA, 92143-4844 DOMESTIC: Apartado Postal 360 Ensenada, Baja California, CP 22830 Mexico COURIER: Km 107 Carretera Tijuana-Ensenada Ensenada, Baja California, CP 22860 Mexico Office: +52(646)175-0500 (ext. 24240) Lab: +52(646)175-0500 (ext. 24318) Fax: +52(646)175-0587 Email: arocha@cicese.mx http://dob.cicese.mx/labs/ecolmolecular/index.html arocha@cicese.mx