Version 5.0
October 2000
Copyright 1999-2000 by Heiko A. Schmidt, Korbinian Strimmer, Martin Vingron, and Arndt von Haeseler
Copyright 1995-1999 by Korbinian Strimmer and Arndt von Haeseler
Heiko A. Schmidt, email: hschmidt@molgen.mpg.de, Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Ihnestrasse 73, D-14195 Berlin, Germany.
Korbinian Strimmer, email: strimmer@stat.uni-muenchen.de, Department of Statistics, University of Munich, Ludwigstr. 33, 80539 Munich, Germany.
Martin Vingron, email: vingron@molgen.mpg.de, Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Ihnestrasse 73, D-14195 Berlin, Germany.
Arndt von Haeseler, email: haeseler@eva.mpg.de, Max-Planck-Institute for Evolutionary Anthropology, Inselstr. 22, D-04103 Leipzig, Germany.
The official name of the program has been changed to TREE-PUZZLE to avoid legal conflict with the Fraunhofer Gesellschaft. We are sorry for any inconvenience this may cause to you. Any reference to PUZZLE in this package is only colloquial and refers to TREE-PUZZLE.
TREE-PUZZLE is a computer program to reconstruct phylogenetic trees from molecular sequence data by maximum likelihood. It implements a fast tree search algorithm, quartet puzzling, that allows analysis of large data sets and automatically assigns estimations of support to each internal branch. TREE-PUZZLE also computes pairwise maximum likelihood distances as well as branch lengths for user specified trees. Branch lengths can also be calculated under the clock-assumption. In addition, TREE-PUZZLE offers likelihood mapping, a method to investigate the support of a hypothesized internal branch without computing an overall tree and to visualize the phylogenetic content of a sequence alignment. TREE-PUZZLE also conducts a number of statistical tests on the data set (chi-square test for homogeneity of base composition, likelihood ratio to test the clock hypothesis, Kishino-Hasegawa test). The models of substitution provided by TREE-PUZZLE are TN, HKY, F84, SH for nucleotides, Dayhoff, JTT, mtREV24, BLOSUM 62, VT, WAG for amino acids, and F81 for two-state data. Rate heterogeneity is modeled by a discrete Gamma distribution and by allowing invariable sites. The corresponding parameters can be inferred from the data set.
TREE-PUZZLE is available free of charge from
Please read the installation section for more details.
We suggest that this documentation should be read before using TREE-PUZZLE the first time. If you do not have the time to read this manual completely please do read at least the sections Input/Output Conventions and Quick Start below. Then you should be able to use the TREE-PUZZLE program, especially if you have some experience with the PHYLIP programs. The other sections should then be read at a later time.
To find out what's new in version 5.0 please read the Version History.
tar xvf tree-puzzle-5.0.tarThe newly created directory "tree-puzzle-5.0" contains four subdirectories called "doc", "data", "bin", and "src". The "doc" directory contains this manual in HTML format. The "data" directory contains example input files. The "src" directory contains the ANSI C sources of TREE-PUZZLE. Switch to this directory by typing
cd tree-puzzle-5.0To compile we recommend the GNU gcc (or GNU egcs) compiler. If gcc is installed just type
sh ./configure
make
make installand the executable puzzle is compiled and put into the /usr/local/bin directory. If you want to have puzzle installed into another directory you can set this by setting the --prefix=/name/of/the/wanted/directory directive at the sh ./configure command line. The parallel version should have been built and installed as well, if configure found a known MPI compiler (cf. Parallel TREE-PUZZLE section). Then type
make cleanand everything will be nicely cleaned up. If your compiler is not the GNU gcc compiler and not found by configure you will have to modify that, by setting the CC variable (e.g. setenv CC cc under csh or CC=cc; export CC under sh) before running sh ./configure. If you still cannot compile properly then your compiler or its runtime library is most probably not ANSI compliant (e.g., old SUN compilers). In most cases, however, you will succeed to compile by changing some parameters in the "makefile". Ask your local Unix expert for help.
The MacOS executables have been compiled for the PowerMac using Metrowerks CodeWarrior.
Note: It is probably a good idea to install PPC Linux (or MkLinux) on your Macintosh. TREE-PUZZLE (as any other program) runs 20-50% faster under Linux compared to the same program under MacOS (on the same machine!), and the Mac does not freeze during execution because of Linux´ multitasking capabilities (maybe this changes in MacOS X).
The executable has been compiled using Microsoft Visual C++ and the "makefile.w32" (contained in "src").
If you have a Linux partition on your PC we recommend to install and use TREE-PUZZLE under Linux (see Unix section) because it runs TREE-PUZZLE significantly faster than Windows.
Get the Unix sources and install the package on your computer (ask your local VMS expert for help). Go to the subdirectory "src" and compile TREE-PUZZLE using the command file "makefile.com".
To compile and run the parallelized TREE-PUZZLE you need an implementation of the Message Passing Interface (MPI) library, a widely used message passing library standard. Implementations of the MPI libraries are available for almost all parallel platforms and computer systems, and there are free implementations for most platforms as well.
To find an MPI implementation suitable for your platform visit the following web sites:
To install the parallel version of TREE-PUZZLE you need the Unix sources for TREE-PUZZLE and install the package on your computer as described above. The configure should configure the Makefiles appropriately. If there is no known MPI compiler found on the system the parallel version is not configured. (If problems occur ask your local system administrator for help.)
Than you should be able to compile the parallel version of TREE-PUZZLE using the following commands:
sh ./configure
make
make installand the executable ppuzzle is compiled and put into the /usr/local/bin directory. If you want to have the executable installed into another directory please proceed as described in the Unix section. If your compiler is non out of mpcc (IBM), hcc (LAM), mpicc_lam (LAM under LINUX), mpicc_mpich (MPICH under LINUX), and mpicc (LAM, MPICH, HP-UX, etc.) and not found by configure you will have to modify that by setting the MPICC variable (e.g. setenv MPICC /another/mpicc under csh or MPICC=/another/mpicc; export MPICC under sh) before running sh ./configure. The way you have to start ppuzzle depends on the MPI implementation installed. So please refer to your MPI manual or ask your local MPI expert for help.
Note:
The parallelization of the tree reconstruction method follows a
master-worker-concept, i.e., a master process handles the scheduling of
the computation to the n worker processes, while the worker processes are
doing almost all the computation work of evaluating the quartets and
constructing the puzzling step trees.
Since the master process does not require a lot of CPU time,
it can be scheduled sharing one processor with a worker process.
Thus, you can run ppuzzle by assigning n+1 processes.
If you want to evaluate a usertree or perform likelihood
mapping analysis it is not recommended to do a parallel run, because all
the computation will be done by the master process. Hence a run of the
sequential version of TREE-PUZZLE is more appropriate for usertree or likelihood
mapping analysis.
The naming conventions of the output files have changed as well. As prefix of the output filenames the name of the sequence input file (or the usertree file in the usertree analysis case) is used and an extension added to denote the content of the file. If no input filename is given at the command line the default filenames of the earlier versions are used. The following extensions/default filenames are possible:
Extension | default filename | file content |
.puzzle | outfile | for the TREE-PUZZLE report |
.dist | outdist | for the ML distances |
.tree | outtree | for the final tree(s) |
.qlist | outqlist | for the list of unresolved quartets |
.ptorder | outptorder | for the list of unique puzzling step tree topologies |
.pstep | outpstep | for the list of puzzling step tree topologies in chronological order |
.eps | outlm.eps | for the EPS file generated in the likelihood mapping analysis |
The "INFILENAME.ptorder"/"outptorder" file contains the unique tree topologies in PHYLIP format preceded by PHYLIP-format comment (in parenthesis). A typical line in the ptorder file looks like this:
[ 2. 60 6.00 2 5 1000 ](chicken,((cat,(horse,(mouse,rat))),(opossum,platypus)));
The entries (separated by single blanks) in the parenthesis mean the following:The "INFILENAME.pstep"/"outpstep" file contains a log of the puzzling steps performed and the occurring tree topologies. A typical line in the pstep file contains the following entries (separated by tabstops):
6. 55 698 3 5 828
The entries in the rows mean the following:(Gibbon:0.1393, ((Human:0.0414, Chimpanzee:0.0538)99:0.0175, Gorilla:0.0577)98:0.0531, Orangutan:0.1003);
The likelihood value of each tree is added in parenthesis before the tree string (e.g. "[ lh=-1621.201605 ]"). Parenthesis mark comments in the Newick or PHYLIP tree format. In some cases the comment has to be removed before using them with other programs.
With the programs TreeView and TreeTool it is possible to view a tree both with its branch lengths and simultaneously with the support values for the internal branches (here 98% and 99%). Note, the PHYLIP programs DRAWTREE and DRAWGRAM may also be used with the CLUSTAL W treefile format. However, in the current version (3.5) they ignore the internal labels and simply print the tree topology along with branch lengths.
Then, a menu (PHYLIP "look and feel") appears with default options set. It is possible to change all available options. For example, if you want to incorporate rate heterogeneity you have to select option "w" as rate heterogeneity is switched off by default. Then type "y" at the input prompt and start the analysis. You will see a number of status messages on the screen during computation. When the analysis is finished all output files (e.g., "outfile", "outtree", "outdist", "outqlist", "outlm.eps", "outpstep", "outptlist" or "INFILENAME.puzzle", "INFILENAME.tree", "INFILENAME.dist", "INFILENAME.qlist", "INFILENAME.eps", "INFILENAME.pstep", "INFILENAME.ptorder") will be in the same directory as the input files.
To obtain a high quality picture of the output tree (including node labels) you might want to use use the TreeView program by Roderic Page. It is available free of charge and runs on MacOS and MS-Windows. It can be retrieved from http://taxonomy.zoology.gla.ac.uk/rod/treeview.html. TreeView understands the CLUSTAL W treefile conventions, reads multifurcating trees and is able to simultaneously display branch lengths and support values for each branch. Open the "INFILENAME.tree"/"outtree" file with TreeView, choose "Phylogram" to draw branch lengths, and select "Show internal edge labels".
On a Unix you can use the TreeTool program to display and manipulate TREE-PUZZLE trees (See ftp://rdp.life.uiuc.edu/pub/RDP/programs/TreeTool for precompiled Sun executables. A version that runs on Linux has been prepared by Anders Holmberg from the Dept. of Biochemistry at the Royal Institute of Technology, Stockholm).
| Q_{ij} pi_j for i != j R_{ij} = | | - Sum_m Q_{im} pi_m for i == jThe matrix Q_{ij} is symmetric with Q_{ii} == 0 (diagonals are zero). For nucleotides the most general model built into TREE-PUZZLE is the Tamura-Nei model (TN, Tamura and Nei, 1993). The matrix Q_{ij} for this model equals
| 4*t*gamma/(gamma+1) for i -> j pyrimidine transition | Q_{ij} = | 4*t/(gamma+1) for i -> j purine transition | | 1 for i -> j transversionThe parameter gamma is called the "Y/R transition parameter" whereas t is the "Transition/transversion parameter". If gamma is equal to 1 we get the HKY model (Hasegawa et al., 1985). Note, the ratio of the transition and transversion rates (without frequencies) is kappa = 2*t. There is a subtle but important difference between the transition-transversion parameter, the expected transition-transversion ratio, and the observed transition transversion ratio. The transition-transversion parameter simply is a parameter in the rate matrix. The expected transition-transversion ratio is the ratio of actually occurring transitions to actually occurring transversions taking into account nucleotide frequencies in the alignment. Due to saturation and multiple hits not all substitutions are observable. Thus, the observed transition-transversion ratio counts observable transitions and transversions only. If the base frequencies in the HKY model are homogeneous (pi_i = 0.25) HKY further reduces to the Kimura model. In this case t is identical to the expected transition/transversion ratio. If t is set to 0.5 the Jukes-Cantor model is obtained. The F84 model (as implemented in the various PHYLIP programs, Felsenstein, 1984) is a special case of the Tamura-Nei model.
For amino acids the matrix Q_{ij} is fixed and does not contain any free parameters. Depending on the type of input data four different Q_{ij} matrices are available in TREE-PUZZLE. The Dayhoff (Dayhoff et al., 1978) and JTT (Jones et al., 1992) matrices are for use with proteins encoded on nuclear DNA, the mtREV24 (Adachi and Hasegawa, 1996) matrix is for use with proteins encoded on mtDNA, and the BLOSUM 62 (Henikoff and Henikoff, 1992) and the WAG model (Whelan and Goldman) are for more distantly related amino acid sequences. The WAG matrix has been inferred from a database of 3905 globular protein sequences, forming 182 distinct gene families spanning a broad range of evolutionary distances (Whelan and Goldman). The VT model is based an new estimator for amino acid replacement rates, the resolvent method. The VT matrix has been computed from a large set alignments of varying degree of divergence. Hence VT is for use with proteins of distant relatedness as well (Müller and Vingron, 2000).
For doublets (pairs of dependent nucleotides) the SH model (Schöniger and von Haeseler, 1994) is implemented in TREE-PUZZLE. The corresponding matrix Q_{ij} reads
| 2*t for i -> j transition substitution | Q_{ij} = | 1 for i -> j transversion substitution | | 0 for i -> j two substitutionsThe SH model basically is a F81 model (Felsenstein, 1981) for single substitutions in doublets.
For invariable sites the parameter theta ("Fraction of invariable sites") determines the probability of a given site to be invariable. If a site is invariable the probability for the constant site patterns is pi_i, the frequency of each nucleotide (amino acid).
The rates r for variable sites are determined by a discrete Gamma distribution that approximates the continuous Gamma distribution
alpha alpha-1 alpha r g(r) = ------------------------ alpha r e Gamma(alpha)where the parameter alpha ranges from alpha = infinity (no rate heterogeneity) to alpha < 1 (strong heterogeneity). The mean expectation of r under this distribution is 1.
A mixed model of rate heterogeneity (Gamma plus invariable sites) is also available. In this case the total rate heterogeneity rho (as defined by Gu et al., 1995) computes as rho = (1+theta alpha)/(1+alpha).
GENERAL OPTIONS b Type of analysis? Tree reconstruction k Tree search procedure? Quartet puzzling v Approximate quartet likelihood? No u List unresolved quartets? No n Number of puzzling steps? 1000 j List puzzling step trees? No o Display as outgroup? Gibbon z Compute clocklike branch lengths? No e Parameter estimates? Approximate (faster) x Parameter estimation uses? Neighbor-joining tree SUBSTITUTION PROCESS d Type of sequence input data? Nucleotides m Model of substitution? HKY (Hasegawa et al. 1985) t Transition/transversion parameter? Estimate from data set f Nucleotide frequencies? Estimate from data set RATE HETEROGENEITY w Model of rate heterogeneity? Uniform rate Quit [q], confirm [y], or change [menu] settings:By typing the letters shown in the menu you can either change settings or enter new parameters. Some options (for example "m" and "w") can be invoked several times to switch through a number of different settings. The parameters of the models of sequence evolution can be estimated from the data by a variety of procedures based on maximum likelihood. The analysis is started by typing "y" at the input prompt. To quit the program type "q".
The following table lists in alphabetical order all TREE-PUZZLE options. Be aware, however, not all of them are accessible at the same time:
|
|
|
Gamma rate heterogeneity parameter alpha. This is the so-called shape parameter of the Gamma distribution. |
|
Type of analysis. Allows to switch between tree reconstruction by maximum likelihood and likelihood mapping. |
|
Number of rate categories (4-16) for the discrete Gamma distribution (rate heterogeneity). |
|
Data type. Specifies whether nucleotide, amino acid sequences, or two-state data serve as input. The default is automatically set by inspection of the input data. After TREE-PUZZLE has selected an appropriate data type (marked by 'Auto:') the 'd'-option changes the type in the following order: selected type -> Nucleotides -> Amino acids -> automatically selected type. |
|
Approximation option. Determines whether an approximate or the exact likelihood function is used to estimate parameters of the models of sequence evolution. The approximate likelihood function is in most cases sufficient and is faster. |
|
Base frequencies. The maximum likelihood calculation needs the frequency of each nucleotide (amino acid, doublet) as input. TREE-PUZZLE estimates these values from the sequence input data. This option allows specification of other values. |
|
Group sequences in clusters. Allows to define clusters of sequences as needed for the likelihood mapping analysis. Only available when likelihood mapping is selected ("b" option). |
|
Codon positions or definition of doublets. For nucleotide data only. If the TN or HKY model of substitution is used and the number of sites in the alignment is a multiple of three the analysis can be restricted to each of the three codon positions and to the 1st and 2nd positions. If the SH model is used this options allows to specify that the 1st and 2nd codon positions in the alignment define a doublet. |
|
Fraction of invariable sites. Probability of a site to be invariable. This parameter can be estimated from the data by TREE-PUZZLE (only if the approximation option for the likelihood function is turned off). |
|
List puzzling steps trees. Writes all intermediate trees (puzzling step trees) used to compute the quartet puzzling tree into a file, either as a list of topologies ordered by number of occurrences (*.ptorder), or as list about the chronological occurrence of the topologies (*.pstep), or both. |
|
Tree search. Determines how the overall tree is obtained. The topology is either computed with the quartet puzzling algorithm or is defined by the user. Maximum likelihood branch lengths will be computed for this tree. Alternatively, a maximum likelihood distance matrix only can also be computed (no overall tree). |
|
Location of root. Only for computation of clock-like maximum likelihood branch lengths. Allows to specify the branch where the root should be placed in an unrooted tree topology. For example, in the tree (a,b,(c,d)) l = 1 places the root at the branch leading to sequence a whereas l=5 places the root at the internal branch. |
|
Model of substitution. The following models are implemented for nucleotides:
the Tamura-Nei (TN) model,
the Hasegawa et al. (HKY) model, and
the Schöniger & von Haeseler (SH) model.
The SH model describes the evolution of
pairs of dependent nucleotides (pairs are the first and the second nucleotide,
the third and the fourth nucleotide and so on). It allows for specification
of the transition-transversion ratio. The original model
(Schöniger & von Haeseler, 1994)
is obtained by setting the transition-transversion parameter to 0.5.
The Jukes-Cantor (1969),
the Felsenstein (1981), and
the Kimura (1980) model are all special cases of
the HKY model.
For amino acid sequence data the Dayhoff et al. (Dayhoff) model, the Jones et al. (JTT) model, the Adachi and Hasegawa (mtREV24) model, the Henikoff and Henikoff (BLOSUM 62), the Müller and Vingron (VT), and the Whelan and Goldman (WAG) substitution model are implemented in TREE-PUZZLE. The mtREV24 model describes the evolution of amino acids encoded on mtDNA, and BLOSUM 62 is for distantly related amino acid sequences, as well as the VT model. After TREE-PUZZLE has selected an appropriate amino acid substitution model (marked by 'Auto:') the 'm'-option changes the model in the following order: selected model -> Dayhoff -> JTT -> mtREV24 -> BLOSUM62 -> VT -> WAG -> automatically selected model For more information please read the section in this manual about models of sequence evolution. See also option "w" (model of rate heterogeneity). |
|
If tree reconstruction is selected: number of puzzling steps. Parameter
of the quartet puzzling tree search. Generally,
the more sequences are used the more puzzling steps are advised. The default
value varies depending on the number of sequences (at least 1000). If likelihood mapping is selected: number of quartets in a likelihood mapping analysis. Equal to the number of dots in the likelihood mapping diagram. By default 10000 dots/quartets are assumed. To use all possible quartets in clustered likelihood mapping you have to specify a value of n=0. |
|
Outgroup. For displaying purposes of the unrooted quartet puzzling tree only. The default outgroup is the first sequence of the data set. |
|
Constrain the TN model to the F84 model. This option is only available for the Tamura-Nei model. With this option the expected (!) transition-transversion ratio for the F84 model have to be entered and TREE-PUZZLE computes the corresponding parameters of the TN model (this depends on base frequencies of the data). This allows to compare the results of TREE-PUZZLE and the PHYLIP maximum likelihood programs which use the F84 model. |
|
Quits analysis. |
|
Y/R transition parameter. This option is only available for the TN model. This parameter is the ratio of the rates for pyrimidine transitions and purine transitions. You do not need to specify this parameter as TREE-PUZZLE estimates it from the data. For precise definition please read the section in this manual about models of sequence evolution. |
|
Symmetrize doublet frequencies. This option is only available for the SH model. With this option the doublet frequencies are symmetrized. For example, the frequencies of "AT" and "TA" are then set to the average of both frequencies. |
|
Transition/transversion parameter. For nucleotide data only. You do not need to specify this parameter as TREE-PUZZLE estimates it from the data. The precise definition of this parameter is given in the section on models of sequence evolution in this manual. |
|
Show unresolved quartets. During the quartet puzzling tree search TREE-PUZZLE counts the number of unresolved quartet trees. An unresolved quartet is a quartet where the maximum likelihood values for each of the three possible quartet topologies are so similar that it is not possible to prefer one of them (Strimmer, Goldman, and von Haeseler, 1997). If this option is selected you will get a detailed list of all star-like quartets. Note, for some data sets there may be a lot of unresolved quartets. In this case a list of all unresolved quartets is probably not very useful and also needs a lot of disk space. |
|
Approximate quartet likelihood. For the quartet puzzling tree search only. Only for very small data sets it is necessary to compute an exact maximum likelihood. For larger data sets this option should always be turned on. |
|
Model of rate heterogeneity. TREE-PUZZLE provides several different models of rate heterogeneity: uniform rate over all sites (rate homogeneity), Gamma distributed rates, two rates (1 invariable + 1 variable), and a mixed model (1 invariable rate + Gamma distributed rates). All necessary parameters can be estimated by TREE-PUZZLE. Note that whenever invariable sites are taken into account the parameter estimation will invoke the "e" option to use an exact likelihood function. For more detailed information please read the section in this manual about models of sequence evolution. See also option "m" (model of substitution). |
|
Selects the methods used in the estimation of the model parameters. Neighbor-joining tree means that a NJ tree is used to estimate the parameters. Quartet sampling means that a number of random sets of four sequences are selected to estimate parameters. |
|
Starts analysis. |
|
Computation of clock-like maximum likelihood branch lengths. This option also invokes the likelihood ratio clock test. |
TREE-PUZZLE also tests with a 5% level chi-square-test whether the base composition of each sequence is identical to the average base composition of the whole alignment. All sequences with deviating composition are listed in the TREE-PUZZLE report file. It is desired that no sequence (possibly except for the outgroup) has a deviating base composition. Otherwise a basic assumption implicit in the maximum likelihood calculation is violated.
A hidden feature of TREE-PUZZLE (since version 2.5) is the employment of a weighting scheme of quartets (Strimmer, Goldman, and von Haeseler, 1997) in the quartet puzzling tree search.
TREE-PUZZLE also computes the average distance between all pairs of sequences (maximum likelihood distances). The average distances can be viewed as a rough measure for the overall sequence divergence.
If more than one input tree is provided TREE-PUZZLE uses the Kishino-Hasegawa test (1989) to check which trees are significantly worse than the best tree.
If clock-like maximum-likelihood branch lengths are computed TREE-PUZZLE checks with the help of a likelihood-ratio test (Felsenstein, 1988) whether the data set is clock-like.
TREE-PUZZLE also detects sequences that occur more than once in the data and that therefore can be removed from the data set to speed up analysis.
If rate heterogeneity is taken into account in the analysis TREE-PUZZLE also computes the most probable assignment of rate categories to sequence positions, according Felsenstein and Churchill (1996).
Parameter (Symbol) | Minimal Value | Maximal Value |
Transition/transversion parameter (t) | 0.20 | 30.00 |
Y/R transition parameter (gamma) | 0.10 | 6.00 |
Fraction of invariable sites (theta) | 0.00 | 0.99 |
Gamma rate heterogeneity parameter (alpha) | 0.01 | 99 |
puzzle << EOF t 10 y EOFAnother possibility is to pipe a parameter file into it, i.e.
puzzle < paramsor
cat params | puzzlewhere in this exaple the parameter file params contains:
t 10 yAll other parameters can also be accessed the same way. Note that the y parameter is always needed at the end.
If problems are encountered TREE-PUZZLE terminates program execution and returns a plain text error message. Depending on the severity errors can be classified into three groups:
"HALT " errors: | Very severe. You should never ever see one of these messages. If so, please contact the developers! |
"Unable to proceed" errors: | Harmless but annoying. Mostly memory errors (not enough RAM) or problems with the format of the input files. |
Other errors: | Completely uncritical. Occur mostly when options of TREE-PUZZLE are being set. |
However, this is not true. According to Hillis' findings (Hillis, 1996), quartets can be hard, but extra information helps. That is, if all you have are data on species (A, B, C, D) then it might be relatively difficult to find the correct tree for them. But if you have additional data (species E, F, G, ...) and try to find a tree for all the species, then that part of the tree relating (A, B, C, D) will more likely be correct than if you had just the data for (A, B, C, D). In Hillis' big 'model' tree, there are many examples of subsets of 4 species which in themselves might be hard to resolve correctly, but which are correctly resolved thanks to the (...large amount of...) additional data. TREE-PUZZLE (quartet puzzling) also gains advantage from extra data in the same way. It's 'understanding' or resolution of the quartet (A, B, C, D) might be incorrect, but the information on the relationships of (A, B, C, D) implicit in its treatment of (A, B, C, E), (A, B, E, D), (A, E, C, D), (E, B, C, D), (A, B, C, F), (A, B, F, D), (A, F, C, D), (F, B, C, D), (A, B, C, G), etc. etc. should overcome this problem.
The facts about how well TREE-PUZZLE actually works have been investigated in the Strimmer and von Haeseler (1996) and Strimmer, Goldman, and von Haeseler (1997) papers. Their results cannot be altered by Hillis' findings. Considered as a heuristic search for maximum likelihood trees, quartet puzzling works very well.
(This section follows N. Goldman, personal communication).
Adachi, J., and M. Hasegawa. 1996. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42: 459-468.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. In: Dayhoff, M. O. (ed.) Atlas of Protein Sequence Structure, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington DC, pp. 345-352.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17: 368-376.
Felsenstein, J. 1984. Distance methods for inferring phylogenies: A Justification. Evolution 38: 16-24.
Felsenstein, J. 1988. Phylogenies from molecular sequences: Inference and reliability. Annu. Rev. Genet. 22: 521-565.
Felsenstein, J. 1993. PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle.
Felsenstein, J., and G.A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13: 93-104.
Gropp, W., S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, and M. Snir. 1998. MPI - The Complete Reference: Volume 2, The MPI Extensions. 2nd Edition, The MIT Press, Cambridge, MA.
Gu, X., Y.-X. Fu, and W.-H. Li. 1995. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12: 546-557.
Hasegawa, M., H. Kishino, and K. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160-174.
Henikoff, S., J. G. Henikoff. 1992. Amino acid substitution matrices from protein blocks. PNAS (USA) 89:10915-10919.
Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130-131.
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. In: Munro, H. N. (ed.) Mammalian Protein Metabolism, New York: Academic Press, pp. 21-132.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8: 275-282.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120.
Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29: 170-179.
Müller, T., and M. Vingron. 2000. Modeling Amino Acid Replacement. J. Comp. Biol.7: 761-776. (preprint of the article)
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 1406-425.
Schöniger, M., and A. von Haeseler. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phyl. Evol. 3: 240-247.
Snir, M., S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. 1998. MPI - The Complete Reference: Volume 1, The MPI Core. 2nd Edition, The MIT Press, Cambridge, MA.
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13: 964-969.
Strimmer, K., N. Goldman, and A. von Haeseler. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol. 14: 210-211.
Strimmer, K., and A. von Haeseler. 1997. Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. PNAS (USA). 94:6815-6819.
Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10: 512-526.
Tamura K. 1994. Model selection in the estimation of the number of nucleotide substitutions. Mol. Biol. Evol. 11: 154-157.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22: 4673-4680.
Whelan, S. and Goldman, N. 2001. A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach. Mol. Biol. Evol. 18:691-699.
Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306-314.
5.0 | Puzzle tree reconstruction part is now parallelized using the MPI standard
(Message Passing Interface).
Possibility added to specify the names of the input file and the user tree file at the command line. Output files renamed to the form PREFIX.EXTENSION, where PREFIX is the input file name or, if used, the user tree file name. The EXTENSION could be one of the following: puzzle (PUZZLE report), tree (tree file), dist (ML distance file), eps (likelihood mapping output in eps format), qlist (bad quartets), qstep (puzzling step tree IDs as they occur in the analysis), or qtorder (sorted unique list of puzzling step trees). The tree likelihood value is added to the treefile as a leading comment ("[ lh=x.xxx ]") to the tree string. VT (variable time) matrix (Müller and Vingron, 2000) and WAG matrix (Whelan and Goldman, 2000) are added to the AA substitution models. The Data type and AA-model options in the menu now show the automatically set type/model first. These can now be changed by using the 'd' or 'm' key independently from the type/model selected. This makes it possible to select a desired AA substitution model or data type by piping letters to the standard input without knowing PUZZLE's preselection. Parameters are written to file when estimated before evaluation of the quartets. The inconsistency with respect to other programs in handling invariable sites has been fixed. Some minor bug fixes (e.g. the clockbug and another in the optimization routine have been fixed). Source code organization adopted to the GNU standards (configure, make, make install under UNIX) |
4.0.2 | Update to provide precompiled Windows 95/98/NT executables. In addition: Internal rearrangement of rate matrices. Improved BLOSUM 62 matrix. Endless input loop for input files restricted to 10 trials. Source code clean up to remove compile time warnings. Explicit quit option in menu. Changes in NJ tree code. Updates of documentation (address changes, correction of errors). |
4.0.1 | Maintenance release. Correction of mtREV matrix. Fix of the "intree bug". Removal of stringent runtime-compatibility check to allow out-of-the-box compile on alpha. More accurate gamma distribution allowing 16 instead of 8 categories and ensuring a better alpha > 1.0. Update of documentation (mainly address changes). More Unix-like file layout, and change of license to GPL. |
4.0 | Executables for Windows 95/NT and OS/2 instead of MS-DOS. Computation of clock-like branch lengths (also for amino acids and for non-binary trees). Automatic likelihood ratio clock test. Model for two-state sequences data (0,1) included. Display of most probable assignment of rates to sites. Identification of groups of identical sequences. Possibility to read multiple input trees. Kishino-Hasegawa test to check whether trees are significantly different. BLOSUM 62 model of amino acid substitution (Henikoff-Henikoff, 1992). Use of parameter alpha instead of eta = 1/(1+alpha) (for rate heterogeneity). Improvements to user interface. SH model can be applied to 1st and 2nd codon positions. Automatic check for compatible compiler settings. Workaround for severe runtime problem when the gcc compiler was used. |
3.1 | Much improved user interface to rate heterogeneity (less confusing menu, rearranged outfile, additional out-of-range check). Possibility to read rooted input trees (automatic removal of basal bifurcation). Computation of average distance between all pairs of sequences. Fix of a bug that caused PUZZLE 3.0 to crash on some systems (DEC Alpha). Cosmetic changes in program and documentation. |
3.0 | Rate heterogeneity included in all models of substitution (Gamma distribution plus invariable sites). Likelihood mapping analysis with Postscript output added. Much more sophisticated maximum likelihood parameter estimation for all model parameters including those of rate heterogeneity. Codon positions selectable. Update to mtREV24. New icon. Less verbose runtime messages. HTML documentation. Better internal error classification. More information in outfile (number of constant positions etc.). |
2.5.1 | Fix of a bug (present only in version 2.5) related to computation of the variance of the maximum likelihood branch lengths that caused occasional crashes of PUZZLE on some systems when applied to data sets containing many very similar sequences. Drop of support for non-FPU Macintosh version. Corrections in manual. |
2.5 | Improved QP algorithm (Strimmer, Goldman, and von Haeseler, 1997). Bug fixes in ML engine, computation of ML distances and ML branch lengths, optional input of a user tree, F84 model added, estimation of all TN model parameters and corresponding standard errors, CLUSTAL W treefile convention adopted to allow to show branch lengths and QP support values simultaneously, display of unresolved quartets, update of mtREV matrix, source code more compatible with some almost-ANSI compilers, more safety checks in the code. |
2.4 | Automatic data type recognition, chi-square-test on base composition, automatic selection of best amino acid model, estimation of transition-transversion parameter, ASCII plot of quartet puzzling tree into the outfile. |
2.3 | More models, many usability improvements, built-in consensus tree routines, more supported systems, bug fixes, no more dependencies of input order. First EBI distributed version. |
2.2 | Optimized internal data structure requiring much less computer memory. Bug fixes. |
2.1 | Bug fixes concerning algorithm and transition/transversion parameter. |
2.0 | Complete revision merging the maximum likelihood and the quartet puzzling routines into one user friendly program. First electronic distribution. |
1.0 | First public release, presented at the 1995 phylogenetic workshop (15-17 June 1995) at the University of Bielefeld, Germany. |