Hello,

A few days ago I posted a message asking for help with a program that would
estimate or determine individual haplotypes (i.e., gametic phase) from a set
of direct sequences of heterozygote individuals.

Thank you so much to those of you who took the time to reply, particularly
to those friends who went out of their way to highlight potential problems
and alternatives, BTW I will reply personally :^)#. I have compiled excerpts
for the suggested programs below for the benefit of the community.

Thanx again and cheers!

Axa

ABOUT PHASE AND fastPHASE

From: John C Garza [mailto:carlosjg@ucsc.edu] 

PHASE by Matt Stephens (University of Chicago) is the best program for
determining the population frequencies of haplotypes and therefore
determining the most likely phase of heterozygous sequences. However, it
(and all other phasing algorithms) can run into trouble with MHC, since
there is so much gene conversion/recombination.

In what species are you are studying MHC? How we minimize sequencing is by
using SSCP with Sybr Gold staining to determine mobility patterns for select
sequenced alleles, then do population genetic screening with SSCP.

 

From: Darren Obbard [mailto:dobbard@staffmail.ed.ac.uk] 

phase and fastPhase from Mathew Stephens

http://stephenslab.uchicago.edu/software.html

Are both widely used, and trusted.

 

From: Magdalena Zarowiecki [mailto:m.zarowiecki@nhm.ac.uk] 

I've been spending a lot of time with this problem.

The most commonly used software is PHASE, but it doesn't work for my
dataset, as it is too big (300 taxa, 80 variable characters.) Instead I've
been using fastPHASE, it is available for Mac, PC and Linux, but I only ever
got the Linux version to work. I have also used the ELB algorithm in
Arlequin, and a new software called BEAGLE, all of which are relatively
hassle-free.

You can convert your data to PHASE/fastPHASE input on this website: 

http://www.mnhn.fr/jfflot/seqphase/

There is a nice paper that came out last year that compares several
different phase-solving algorithms, and also suggests a method for solving a
problem just like yours, complete with some useful PERL-scripts.

 

From: Michael Sorenson [mailto:msoren@bu.edu] 

Not sure how well the software will work on MHC, but please see the
following paper for a favorable evaluation of PHASE:

Harrigan, R.J., M.E. Mazza & M.D. Sorenson. 2008. Computation versus
cloning: evaluation of two methods for haplotype determination. Molecular
Ecology Resources 8: 1239-1248. 

ABOUT DNAsp/PHASE

 

From: Mark Chapman [mailto:mchapman@plantbio.uga.edu] 

The PHASE algorithm in DnaSP will do this. you make a fasta file of your
(aligned) genotypes and open it as "unphase/genotype data file". I have come
across some alignments that it cant handle for whatever reason and the
program crashes, but this is only maybe 5% of the time and is probably due
to very few homozygous genotypes. Each file takes 1 minute to 1 hour to run
typically, so it can be frustrating to do this several times. A significant
advantage is that once this has been done in DnaSP you can export several
filetypes, like nexus, to use in other programs.

See our Plant Cell paper for a little more info
(http://markachapman.googlepages.com/ChapmanPlantCell2008.pdf).

 

From: John Wares [mailto:jpwares@uga.edu] 

You can use software like PHASE - it is currently fully implemented in  
DNAsp, which makes life MUCH easier than only a few months ago when  
you had to know or use a little PERL to get data in and out for  
analysis.  It is a likelihood-based analysis that considers the  
frequency of homozygous alleles at each site as well as the  
combinations of alleles to infer haplotypes directly from sequences.

So, that is the easy answer to your question.  The thing I'm worried  
about with doing this for MHC data is that probably every individual  
is a heterozygote, so there will be no non-statistical inference of  
haplotypes, and they are all heterozygotes because MHC harbors so many  
alleles - maybe hundreds, right?  Depends on what you are working on  
these days but even devil's hole pupfish carries a ton of diversity.   

PHASE software works great when you can count the number of  
segregating sites on your hands, but I think you are going to have a  
great deal of uncertainty once you cross that threshold.  Plus, there  
is the issue of what is a heterozygous site - we all know that  
sequencing is usually pretty clean, but probably a few sites - even in  
haploid mtDNA - that are ambiguous.  You would want to sequence in  
both directions to eliminate most of the sequencing/coding errors, and  
maybe multiple sequences per individual, because each of those  
chemistry errors (not actual heterozygous sites) will be used by PHASE  
to infer yet another segregating site and thus another haplotype.  You  
might also consider using high fidelity polymerase.  But, of course,  
these things cause the expense and/or time to go up, which is what  
you're trying to avoid.  The other thing that you may be able to do -  
depending on your question - is consider any very-low-frequency  
variant (as in, only a single inferred haplotype in your data set) an  
'unknown'.  Singletons are unlikely to affect your end analysis much  
(again, depends on question) and that helps eliminate things that only  
popped up because of sequencing error (but this is just a somewhat  
messy solution).
 

ABOUT HAPLOTYPER

 

From: Adkins, Ron [mailto:radkins1@utmem.edu] 

      Probably the most popular program (and easy to use) is Phase by
Matthew Stephens.
http://stephenslab.uchicago.edu/software.html

 

      Other programs are haplotyper

http://www.people.fas.harvard.edu/~junliu/Haplo/docMain.htm

and plem. There are many more, but it would probably be simplest for you
to just go with Phase. They all have a similar accuracy. Given that you
direct sequenced what I assume are PCR products, I think you will end up
having to treat your data as if you collected SNP genotypic data at
several sites directly. 

 

ABOUT ARLEQUIN

From: Alexander Weigand [mailto:WeigandA@gmx.net] 
perhabs the program Arlequin will be of any purpose for you?

http://cmpg.unibe.ch/software/arlequin3/#Implemented%20methods

 

ABOUT HAPSTAT
 

From: Steve McKechnie [mailto:stephen.mckechnie@sci.monash.edu.au] 

HAPSTAT should do the job - you would need to feed in the diploid state 
of all variable sites, in sequential order.

 

ABOUT RSCA

 

From: Kraus, Robert [mailto:robert.kraus@wur.nl]

If times are tough you might look at an alternative method called RSCA:
Worley Ka., Gillingham M, Jensen P, et al. (2008) Single locus typing of MHC
class I and class IIB loci in a population of red jungle fowl.

Immunogenetics 60, 233-247.

 

 

ABOUT CHAMPURU

 

From: Jean-François Flot [mailto:jflot@uni-goettingen.de] 

If your MHC alleles are of variable lengths, my program Champuru

(http://www.mnhn.fr/jfflot/champuru) may be exactly what you are looking for
since it extracts directly the gametic phases of length-variant
heterozygotes from the patterns of double peaks in the corresponding forward
and reverse chromatograms. To find out the phases of heterozygotes whose two
haplotypes have the same length, you could use PHASE (see
http://www.mnhn.fr/jfflot/seqphase for a web tool that generates PHASE input
files from FASTA alignments and converts PHASE output files back into FASTA)
or resequence your PCR products with allele-specific primers.

 

 

ABOUT GENEIOUS PLUGIN

 

From: Eamonn Mallon [mailto:ebm3@leicester.ac.uk] 

I know Geneious has a heterozygote plugin, I haven't used it myself.

http://www.biomatters.com/default,390,plugins.sm

 

ABOUT CODON CODE ALIGNER

 

From: Nicholas Crawford [mailto:ngcrawfo@bu.edu] 

I think CodonCode Aligner will do it.
http://www.codoncode.com/aligner/
 

 

ABOUT CVHAPLOT

 

From: Julie B. Hebert [mailto:byrdie@umd.edu] 

I've been working on the same problem (different gene) and have found the
following program useful: CVHaplot.
http://www.ipm.ioz.ac.cn/them_zhangdexing/CVhaplot.htm
I'm attaching the reference where I learned about it.  Apparently a new
version is coming out next week, so I would wait until then to check it out.
Also, in the paper, they compare many different phase programs, and have
websites for all of them at the end, so it is a great reference!

OTHER RELEVANT COMMENTS

From: Mark McMullan [mailto:M.Mc-Mullan@biosci.hull.ac.uk] 

I too work on MHC and would be keen to hear a little more detail about the
specifics of your project.  I use a protocol to estimate the number of
clones I require to find all alleles with 95% certainty.  I have seen
software that claims to be able to separate heterozygous sequences into
their constituent versions but the example shown to me was used to separate
indel sequences.  While this is very useful, indel sequences have certain
properties that I can imagine would be used to better separate such
sequences.  With MHC alleles sharing motifs between alleles I can't see how
separation of sequences could be done reliably.  But I remain hopeful, if
you receive anything useful on this I'd be grateful if you could share it
with me.  Nevertheless, it would still be good to hear a bit more about your
project, I might be able to make some suggestions.

 

 

From: Edward Grant [mailto:edpgrant@gmail.com] 

See the paper by Huang et al in Mol Ecol:

Haplotype reconstruction for scnp DNA: a consensus vote approach with
extensive sequence data from populations of the migratory locust (Locusta
migratoria).

Mol Ecol. 2008 Apr;17(8):1930-47.

 <http://www.ncbi.nlm.nih.gov/pubmed/18346127>
http://www.ncbi.nlm.nih.gov/pubmed/18346127

 

 

From: Taylor, Jerry F. [mailto:taylorjerr@missouri.edu] 

We use fastPHASE for this kind of analysis. However the trouble with the MHC
is that it is very heterozygous and the accuracy of fastPHASE can be limited
if you do not have a large number of individuals genotyped.

 

 

Axayacatl Rocha-Olivares, Ph.D.
Biological Oceanography Department
CICESE
P. O. Box 434844
San Diego, CA, 92143-4844

 

DOMESTIC:
Apartado Postal 360
Ensenada, Baja California, CP 22830
Mexico

 

COURIER:
Km 107 Carretera Tijuana-Ensenada
Ensenada, Baja California, CP 22860
Mexico

 

Office: +52(646)175-0500 (ext. 24240)
Lab:    +52(646)175-0500 (ext. 24318)
Fax:     +52(646)175-0587

 

Email: arocha@cicese.mx 
http://dob.cicese.mx/labs/ecolmolecular/index.html

 

 

arocha@cicese.mx