Thank you for your many helpful responses to my original post.
Many people responded to request that I share the outcome.
Hence, copies follow, in no particular order, with email addresses,
salutations, signatures, and affiliations scrubbed (as a spam/
phishing prevention measure).
In hindsight, I ought to have clarified that I am working on
whole-genome alignments of the hepatitis C virus, which spans both
coding and non-coding regions, and encodes one polypeptide without
introns or rearrangements. The sequences are ~9.5K nt for ~30 taxa.
Per the moderator's instructions, any discussion should be
taken off-list to the USENET/Google groups...
Best regards,
Peter Hraber
Do you know of a multiple sequence alignment algorithm that uses an
objective goodness-of-fit criterion? I am presently working with
clustalw(x) and t-coffee. Both use penalty scores for mismatches.
However, I see no way to compare the quality of alignments obtained
with varied parameters.
I'm surprised that no one has applied an information-based
model-selection criterion (AIC, BIC, MDL) to this problem. Am I
mistaken? How do you know when an alignment is near optimal? Is this
an entirely subjective decision?
>From Jeff Blanchard:
We use a partial order graph approach implemented in
POA (http://www.bioinformatics.ucla.edu/poa/) for some of our work.
However, this is mainly because it scoots the unaligned sequences to
the end for easy trimming with the initial settings we tried. I have
also heard the some of the EST alignment methods do something similar.
There is no optimal alignment. Usually the goal is to try and align
sites that share a common ancestry. Thus there is a circular problem
of wanting to use a tree to build the alignment. Then using the
alignment to build a tree. The use of Bayesian approaches that allow
the incorporation of prior knowledge is rapidly changing this field,
but there is no easy answer.
>From Ilya Nemenman:
Another objective, and formally very sound, way to quantify the
sequence similarity is to measure the length of the shortest
program that can transform one sequence into another. This is
obviously related to Kolmogorov complexity, and Li and
Vitanyi, some of the biggest names in the theory of
algorithmic complexity in the recent years, have worked on
this subject. The algorithm, described in
http://www.ams.org/mathscinet-getitem?mr=2005f%3A68028 works
quite nicely for phylogeny studies. There might have been
later works on this approach, but I am, unfortunately,
unfamiliar with them.
>From Joe Felsenstein:
Why do you think that the issue has never been addressed? There is an
active literature on statistical approaches to multiple sequence
alignment, using models of insertion and deletion (there are major
unsolved problems, including treatment of multiple-base insertions and
deletions). The literature goes back to Bishop and Thompson 1986 and
to Thorne et al. 1991 and 1992. In addition there is a minimum
message length approach of Allison and Yee (1990). References are in
my phylogenies book (2004), chapter 29.
>From Will Fischer:
This is an excellent question, and there is distressingly little
formal work in the area. Most "alignment accuracy" studies simply
compare the alignments from algorithm A or algorithm B to a set of
hand-curated alignments, e.g. BaliBase.
One objective criterion for alignment quality is the length of the
most parsimonious tree derived from the alignment -- this is
unfortunately too time-consuming to compute for each step of an
alignment procedure, but can be used to compare alignments from
different methods.
I would encourage you to explore some alternatives to the programs you
are using:
muscle (www.drive5.com/muscle) is freely available, is much faster
than either clustalw or t-coffee, and produces excellent amino-acid
alignments (certainly more accurate than clustalw's, and as good or
better than t-coffee's according to published data; I haven't done a
side-by-side muscle/t-coffee comparison since t-coffee is so slow).
Nucleotide alignments may require more post-alignment tweaking than
amino acids. It builds an initial guide tree using common-substring
counts.
Muscle references:
http://nar.oxfordjournals.org/cgi/content/full/32/5/1792?ijkey=48Nmt1tta0fMg&keytype=ref
http://nar.oxfordjournals.org/cgi/content/full/32/5/1792?ijkey=48Nmt1tta0fMg&keytype=ref
poa (partial order alignment; http://www.bioinformatics.ucla.edu/poa/)
uses a graph-theoretical representation of the alignment to cope with
divergent regions; as it runs, it generates a non-degenerate profile;
as each sequence is added, the sequence is aligned to the profile and
the profile is updated. It is quite fast and has produced good
alignments for me.
Poa references:
http://www.bioinformatics.ucla.edu/poa/Poa_Tutorial.htm (slide-show tutorial)
http://bioinformatics.oxfordjournals.org/cgi/reprint/20/10/1546
Finally, for nucleotide sequences that are divergent or gappy
(e.g. rRNAs): BlastAlign
(http://evolve.zoo.ox.ac.uk/software.html?id=blastalign) is quite
impressive.
http://bioinformatics.oxfordjournals.org/cgi/content/full/21/1/122
I have used all the above and found them useful. Other programs worth
investigating include MAAFT and MAVID, dialign, and perhaps align-m.
Also, this link is for software that will score alignments RELATIVE TO
A REFERENCE ALIGNMENT:
http://bioinformatics.vub.ac.be/software/software.html
>From Sergios-Orestis Kolokotronis:
POY (http://research.amnh.org/scicomp/projects/poy.php) does
simultaneous alignment and inference of phylogeny (from a Maximum
Parsimony perspective).
Bali-PHY (http://www.biomath.medsch.ucla.edu/msuchard/bali-phy)
implements a joint Bayesian estimation of alignment and phylogeny. It
also considers near-optimal alignments when estimating the phylogeny.
I am not aware of any algorithm using Akaike's Information Criterion
to select models that would result in near-optimal alignments. Please
do let me know if you come across such a program.
>From Keith A. Crandall:
There are some methods that do joint estimates of alignment and
phylogeny and use a phylogeny optimization as a goodness-of-fit
criterion. This has been implemented in both a Bayesian (Redelings
and Suchard, 2005) and parsimony (Wheeler, 1996) framework.
Redelings, B.D., Suchard, M.A., 2005. Joint Bayesian estimation of
alignment and phylogeny. Syst. Biol. 54, 401-418.
Wheeler, W.C., 1996. Optimization alignment: The end of multiple
alignment in phylogenetics? Cladistics 12, 1-9.
>From Michael Sorenson:
Have a look at papers by Ward C. Wheeler starting with Wheeler 1996
Cladistics 12:1-9 and his software POY:
http://research.amnh.org/scicomp/projects/poy.php
>From Darren Obbard:
I think this depends on whether you have a model of sequence evolution
that you believe. My understanding is that, if you don't have an
explicit model, then goodness of fit must be subjective.
Have you seen MCAlign? (Model-based, and intended for non-coding DNA)
http://www.genome.org/cgi/doi/10.1101/gr.1571904
http://homepages.ed.ac.uk/eang33/mcalign/mcinstructions.html
>From Thomas Leitner:
Yes, alignments are sort of still in the air, and not well understood.
There has been a lot of work on them, however. Calculating trees and
the alignment at the same time is probably the real way of doing it,
but it is computationally very expensive. Jotun Hein has done some
interesting work on that problem. You may want to check his papers.
Alternatively, does there exist an algorithm that will utilize amino
acid translations to inform alignment of nucleotide sequences?
>From Jeff Blanchard:
It is common practice for proteins to align using the amino acids then
substitute the nucleotide sequence post alignment. We have a perl
script that does this for the aln and poa formats for our projects.
There are lots of other similar programs/scripts out there and I think
I remember using one in bioperl. NCBI has one in the SEALS package
called gap_cds...
>From Keith A. Crandall:
Computer programs such as ALIGNMENTHELPER
(http://inbio.byu.edu/faculty/dam83/cdm) and REVTRANS (Wernersson and
Pedersen, 2003) both do this. My understanding is that Mega will do
this too.
Wernersson, R., Pedersen, A.G., 2003. RevTrans: Multiple alignment of
coding DNA from aligned amino acid sequences. Nucleic Acids Res. 31,
3537-3539.
>From Max Telford:
I have written a perl prog "translatorx" that aligns nucleic acids
according to translated amino acids. Attached.
Completely unsupported at the moment although people I have sent it to
have not had problems using it. It uses Muscle/T-Coffee/ClustalW to
do the alignment. Put them at least one of these three in your bin
folder (Unix compiled versions).
Make sure Translatorx3.pl is executable and in your bin folder
`chmod +x translatorx3.pl`
If you are not Unix savvy (I am barely) ask someone to tell you how to
make sure all executables are in your path.
The input format is NBRF or FASTA nucleotide seqs (with Unix line
endings - although mac line endings will get converted)
Hopefully the functioning of the programme is obvious.
at the unix prompt: `translatorx.pl`.
Please let me know if this works.
>From Will Fischer:
Another site of interest will align DNA sequences based on amino-acid
translations: http://www.cbs.dtu.dk/services/RevTrans/
>From Mike Barker:
With regard to your last question
about using amino acid alignments, there are a few options available. I
have used both transAlign and RevTrans to accomplish amino acid guided
DNA alignments. transAlign is a perl script that is available from the
following website
http://141.40.125.5:8080/WWW/Homepages/Bininda-Emonds/ProgramsMain.html
RevTrans (http://www.cbs.dtu.dk/services/RevTrans/) is another program
that I prefer to use in my own pipelines. I use it in conjunction with
MUSCLE (http://www.drive5.com/muscle/), which provides the amino acid
alignments, and then I use RevTrans to align the DNAs with these aa
alignments. I have some Unix shell scripts to accomplish this if you are
interested.
>From Darren Obbard:
tcoffee and clustal both align (and I think were originally intended to
align) amino acids. For coding sequence you may wish to align the protein,
and then go translate back to the DNA. Bioedit and clustal are very useful
for this (Toggle translation, -> clustal -> toggle translation)
EOF
Peter Hraber