Thank you for your many helpful responses to my original post. Many people responded to request that I share the outcome. Hence, copies follow, in no particular order, with email addresses, salutations, signatures, and affiliations scrubbed (as a spam/ phishing prevention measure). In hindsight, I ought to have clarified that I am working on whole-genome alignments of the hepatitis C virus, which spans both coding and non-coding regions, and encodes one polypeptide without introns or rearrangements. The sequences are ~9.5K nt for ~30 taxa. Per the moderator's instructions, any discussion should be taken off-list to the USENET/Google groups... Best regards, Peter Hraber Do you know of a multiple sequence alignment algorithm that uses an objective goodness-of-fit criterion? I am presently working with clustalw(x) and t-coffee. Both use penalty scores for mismatches. However, I see no way to compare the quality of alignments obtained with varied parameters. I'm surprised that no one has applied an information-based model-selection criterion (AIC, BIC, MDL) to this problem. Am I mistaken? How do you know when an alignment is near optimal? Is this an entirely subjective decision? >From Jeff Blanchard: We use a partial order graph approach implemented in POA (http://www.bioinformatics.ucla.edu/poa/) for some of our work. However, this is mainly because it scoots the unaligned sequences to the end for easy trimming with the initial settings we tried. I have also heard the some of the EST alignment methods do something similar. There is no optimal alignment. Usually the goal is to try and align sites that share a common ancestry. Thus there is a circular problem of wanting to use a tree to build the alignment. Then using the alignment to build a tree. The use of Bayesian approaches that allow the incorporation of prior knowledge is rapidly changing this field, but there is no easy answer. >From Ilya Nemenman: Another objective, and formally very sound, way to quantify the sequence similarity is to measure the length of the shortest program that can transform one sequence into another. This is obviously related to Kolmogorov complexity, and Li and Vitanyi, some of the biggest names in the theory of algorithmic complexity in the recent years, have worked on this subject. The algorithm, described in http://www.ams.org/mathscinet-getitem?mr=2005f%3A68028 works quite nicely for phylogeny studies. There might have been later works on this approach, but I am, unfortunately, unfamiliar with them. >From Joe Felsenstein: Why do you think that the issue has never been addressed? There is an active literature on statistical approaches to multiple sequence alignment, using models of insertion and deletion (there are major unsolved problems, including treatment of multiple-base insertions and deletions). The literature goes back to Bishop and Thompson 1986 and to Thorne et al. 1991 and 1992. In addition there is a minimum message length approach of Allison and Yee (1990). References are in my phylogenies book (2004), chapter 29. >From Will Fischer: This is an excellent question, and there is distressingly little formal work in the area. Most "alignment accuracy" studies simply compare the alignments from algorithm A or algorithm B to a set of hand-curated alignments, e.g. BaliBase. One objective criterion for alignment quality is the length of the most parsimonious tree derived from the alignment -- this is unfortunately too time-consuming to compute for each step of an alignment procedure, but can be used to compare alignments from different methods. I would encourage you to explore some alternatives to the programs you are using: muscle (www.drive5.com/muscle) is freely available, is much faster than either clustalw or t-coffee, and produces excellent amino-acid alignments (certainly more accurate than clustalw's, and as good or better than t-coffee's according to published data; I haven't done a side-by-side muscle/t-coffee comparison since t-coffee is so slow). Nucleotide alignments may require more post-alignment tweaking than amino acids. It builds an initial guide tree using common-substring counts. Muscle references: http://nar.oxfordjournals.org/cgi/content/full/32/5/1792?ijkey=48Nmt1tta0fMg&keytype=ref http://nar.oxfordjournals.org/cgi/content/full/32/5/1792?ijkey=48Nmt1tta0fMg&keytype=ref poa (partial order alignment; http://www.bioinformatics.ucla.edu/poa/) uses a graph-theoretical representation of the alignment to cope with divergent regions; as it runs, it generates a non-degenerate profile; as each sequence is added, the sequence is aligned to the profile and the profile is updated. It is quite fast and has produced good alignments for me. Poa references: http://www.bioinformatics.ucla.edu/poa/Poa_Tutorial.htm (slide-show tutorial) http://bioinformatics.oxfordjournals.org/cgi/reprint/20/10/1546 Finally, for nucleotide sequences that are divergent or gappy (e.g. rRNAs): BlastAlign (http://evolve.zoo.ox.ac.uk/software.html?id=blastalign) is quite impressive. http://bioinformatics.oxfordjournals.org/cgi/content/full/21/1/122 I have used all the above and found them useful. Other programs worth investigating include MAAFT and MAVID, dialign, and perhaps align-m. Also, this link is for software that will score alignments RELATIVE TO A REFERENCE ALIGNMENT: http://bioinformatics.vub.ac.be/software/software.html >From Sergios-Orestis Kolokotronis: POY (http://research.amnh.org/scicomp/projects/poy.php) does simultaneous alignment and inference of phylogeny (from a Maximum Parsimony perspective). Bali-PHY (http://www.biomath.medsch.ucla.edu/msuchard/bali-phy) implements a joint Bayesian estimation of alignment and phylogeny. It also considers near-optimal alignments when estimating the phylogeny. I am not aware of any algorithm using Akaike's Information Criterion to select models that would result in near-optimal alignments. Please do let me know if you come across such a program. >From Keith A. Crandall: There are some methods that do joint estimates of alignment and phylogeny and use a phylogeny optimization as a goodness-of-fit criterion. This has been implemented in both a Bayesian (Redelings and Suchard, 2005) and parsimony (Wheeler, 1996) framework. Redelings, B.D., Suchard, M.A., 2005. Joint Bayesian estimation of alignment and phylogeny. Syst. Biol. 54, 401-418. Wheeler, W.C., 1996. Optimization alignment: The end of multiple alignment in phylogenetics? Cladistics 12, 1-9. >From Michael Sorenson: Have a look at papers by Ward C. Wheeler starting with Wheeler 1996 Cladistics 12:1-9 and his software POY: http://research.amnh.org/scicomp/projects/poy.php >From Darren Obbard: I think this depends on whether you have a model of sequence evolution that you believe. My understanding is that, if you don't have an explicit model, then goodness of fit must be subjective. Have you seen MCAlign? (Model-based, and intended for non-coding DNA) http://www.genome.org/cgi/doi/10.1101/gr.1571904 http://homepages.ed.ac.uk/eang33/mcalign/mcinstructions.html >From Thomas Leitner: Yes, alignments are sort of still in the air, and not well understood. There has been a lot of work on them, however. Calculating trees and the alignment at the same time is probably the real way of doing it, but it is computationally very expensive. Jotun Hein has done some interesting work on that problem. You may want to check his papers. Alternatively, does there exist an algorithm that will utilize amino acid translations to inform alignment of nucleotide sequences? >From Jeff Blanchard: It is common practice for proteins to align using the amino acids then substitute the nucleotide sequence post alignment. We have a perl script that does this for the aln and poa formats for our projects. There are lots of other similar programs/scripts out there and I think I remember using one in bioperl. NCBI has one in the SEALS package called gap_cds... >From Keith A. Crandall: Computer programs such as ALIGNMENTHELPER (http://inbio.byu.edu/faculty/dam83/cdm) and REVTRANS (Wernersson and Pedersen, 2003) both do this. My understanding is that Mega will do this too. Wernersson, R., Pedersen, A.G., 2003. RevTrans: Multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res. 31, 3537-3539. >From Max Telford: I have written a perl prog "translatorx" that aligns nucleic acids according to translated amino acids. Attached. Completely unsupported at the moment although people I have sent it to have not had problems using it. It uses Muscle/T-Coffee/ClustalW to do the alignment. Put them at least one of these three in your bin folder (Unix compiled versions). Make sure Translatorx3.pl is executable and in your bin folder `chmod +x translatorx3.pl` If you are not Unix savvy (I am barely) ask someone to tell you how to make sure all executables are in your path. The input format is NBRF or FASTA nucleotide seqs (with Unix line endings - although mac line endings will get converted) Hopefully the functioning of the programme is obvious. at the unix prompt: `translatorx.pl`. Please let me know if this works. >From Will Fischer: Another site of interest will align DNA sequences based on amino-acid translations: http://www.cbs.dtu.dk/services/RevTrans/ >From Mike Barker: With regard to your last question about using amino acid alignments, there are a few options available. I have used both transAlign and RevTrans to accomplish amino acid guided DNA alignments. transAlign is a perl script that is available from the following website http://141.40.125.5:8080/WWW/Homepages/Bininda-Emonds/ProgramsMain.html RevTrans (http://www.cbs.dtu.dk/services/RevTrans/) is another program that I prefer to use in my own pipelines. I use it in conjunction with MUSCLE (http://www.drive5.com/muscle/), which provides the amino acid alignments, and then I use RevTrans to align the DNAs with these aa alignments. I have some Unix shell scripts to accomplish this if you are interested. >From Darren Obbard: tcoffee and clustal both align (and I think were originally intended to align) amino acids. For coding sequence you may wish to align the protein, and then go translate back to the DNA. Bioedit and clustal are very useful for this (Toggle translation, -> clustal -> toggle translation) EOF Peter Hraber