Hi all, with your help, I was able to solve my little optimization problem and, thereby, to save the I+Gamma (at least for me): The trick is to bound alpha > 1. Smaller alphas should not be used. This avoids a situation where both parameters have nearly the same effect on the distribution shape and, as a consequence, 'fight' to explain the data. On the other hand, this constrained I+Gamma has still the advantages it was made for: It can produce two-peaked rate distributions as well as one-peaked ones, ranging from homogeneity to extremely L-shaped and anything between. However, it is better not to the interpret the heterogeneity parameters. Thanks to all who contributed valuable information. As this might be useful to others, I have included the most informative answers below. Best, Gangolf Jobb ---------------------------------------- Hi all, just some questions. The "I+Gamma" is very popular model of sequence evolution with rate heterogeneity among sites. It is a superposition of two simpler models, which are both able to account for invariable sites in a sequence alignment. The first distinguishes two classes of sites, of which one may evolve while the other is invariable. The other part is Yang's discrete gamma model, which is used for the variable sites. The "I+Gamma" has two parameters, namely the fraction of invariable sites (theta) and the shape parameter of the discrete gamma distribution (alpha). When I simulate sequences under the "I+Gamma" and try to re-estimate the two heterogeneity parameters by maximum likelihood optimization afterwards, it turns out that the estimated parameters often differ severely from the original parameters that were previously used in the simulation. This indicates to me that the "I+Gamma" is over-parameterized, that alpha and theta are not genuinely independent parameters in the sense that an under-estimated theta can be easily compensated by a smaller alpha and vice versa. It is very difficult to distinguish invariable sites, which are described by theta, from the slow-evolving sites in the discrete gamma distribution, which are described by alpha. The good news is that the "I+Gamma" gives higher likelihoods compared with each of the two simpler models and is therefore often preferred. The bad news is that parameter estimates do not mean much as there may exist several combinations of theta and alpha that appear almost equally probable. What is the best way to deal with the "I+Gamma"? Should we (1) not use it at all? (2) use it, but ignore the parameter estimates? (3) optimize only the alpha and determine theta empirically, i. e. by counting the invariable sites? (4) introduce some constraint between alpha and theta, which turns the "I+Gamma" effectively into an one-parameter model? Any suggestions? Gangolf Jobb ---------------------------------------- Hi Gangolf, The two parameters are clearly confounded to some extent. It seems that there is little information in the data, without large number of sequences for determining whether a constant site is "invariable" or "slowly evolving". You might want to look at these two papers by Jack Sullivan, in case you haven't already. Regards, Thomas Sullivan, J. and D. L. Swofford. 2001. Should we use model-based methods for phylogenetic inference when we know assumptions about among-site rate variation and nucleotide substitution pattern are violated? Systematic Biology, 50:723-729 Sullivan, J., D. L. Swofford, and G. J. P. Naylor. 1999. The effect of taxon sampling on estimating rate-heterogeneity parameters of maximum-likelihood models. Molecular Biology and Evolution. 16:1347-1356. ---------------------------------------- I think this is an interesting question. I too have been concerned about the use of I+Gamma and the interpretation of estimated parameters. I would be interested to hear what other responses you get, and I encourage you to post summaries to the EVOLDIR list. > When I simulate sequences under the "I+Gamma" and try to re-estimate the > two heterogeneity parameters by maximum likelihood optimization > afterwards, it turns out that the estimated parameters often differ > severely from the original parameters that were previously used in the > simulation. I too have observed this. I recall Ziheng Yang telling me something similar; he performed experiments sub-sampling sequences from a multiple alignment, and estimated the parameter you are calling theta (he was not using +Gamma then, I think). He found that as you included more and more of the sequences, the parameter you are calling theta got lower and lower. Evidently the +I part of the model has some trouble distinguishing sites that never vary, and sites which may vary but happen not to have in the tree under consideration. But you should check with Ziheng if you are interested; I am simply reporting what he once told me, and may have got it wrong. > This indicates to me that the "I+Gamma" is > over-parameterized, that alpha and theta are not genuinely independent parameters in the sense that an under-estimated theta can be easily compensated by a smaller alpha and vice versa. I would not say it was over-parameterised, when your simulations also used +I+Gamma; in that case, clearly it is 'correctly parameterised'... > It is very difficult to distinguish invariable sites, which are described by theta, from the slow-evolving sites in the discrete gamma distribution, which are described by alpha. ...but the interaction between +I and the slowly-evolving sites under the Gamma distribution (particularly cases where alpha is small), which you describe nicely here, means that it is difficult to distinguish the parameter values accurately. I would imagine (but this is not tested) that given enough data, it would be possible to infer all parameters accurately. If you are interested in doing this test, feel free to contact me, as I have a couple of ideas for interesting ways to go about it! Personally, I think it is a bad model to use +I+Gamma. There are very few people who would really argue that any site of any sequence was truly unable to vary over evolution. (Note that this is very different from a site where we observe no changes in our data set.) Collect enough data, and you will always find an organism that's managed to accept a change at a site. What success (in goodness-of-fit statistics) the +I model has is probably because it is giving a better fit to the observations at slowly evolving sites than any Gamma distribution can. A better method would be to allow something other than a Gamma distribution to account for the range of evolutionary rates exhibited across typcial sequences. > The good news is that the "I+Gamma" gives higher likelihoods compared with each of the two simpler models This is necessarily true, as either simpler model is nested within the more complex one (either setting theta=0, or alpha=infinity). Note that the statistical testing of significance here is slightly non-standard---see the attached paper. > and is therefore often preferred. The bad news is that parameter estimates do not mean much as there may exist several combinations of theta and alpha that appear almost equally probable. Yes. I suppose the appropriate interpretation is exactly that: the more complex model is preferred, suggesting that there are both invariant sites and slowly evolving ones as described by the Gamma distribution, but that it is difficult to tell the proportions of each, or which site is which. This should also be apparent by looking at the estimated std. devs. of the inferred parameters. I don't know if any existing software allows Empirical Bayes estimation of posterior probabilities of being an invariant site or of membership of each of the discrete Gamma classes when doing analysis under +I+Gamma (analogous to Empirical Bayes estimation of /omega in codon models used to detect positive selection). If so, it would be interesting to check whether the posterior mean rate at sites was very different under +I, +Gamma and +I+Gamma, even in cases where the corresponding parameter estimates under these models were very different. > What is the best way to deal with the "I+Gamma"? Should we I don't claim to know, but here are some comments on your suggestions! > (1) not use it at all? That's what I've always done! > (2) use it, but ignore the parameter estimates? I wouldn't ignore them, just report them and emphasise their (presumably large) std. devs. > (3) optimize only the alpha and determine theta empirically, > i. e. by counting the invariable sites? No, surely this would be very very bad. If you estimate alpha and it turns out to be small, then a lot of the sites invariant *in your data set* are being inferred to be 'variable, but just didn't happen to change'. It would be a very poor estimate of invariability to count these sites. > (4) introduce some constraint between alpha and theta, which > turns the "I+Gamma" effectively into an one-parameter model? If you thought of a good way to do this, it would be a bit like my suggestion above ("A better method would be to allow..."). Nick Goldman ---------------------------------------- On the invariants sites gamma model. Dear Gangolf, Much of what you say is correct, but not quite. The I+Gamma ML model was first developed as part of my thesis well before Ziheng Yang's work. The probabilistic calculations can use the algorithm's in Steel et al. (1993). It can also use the discretised method a method first developed by myself, and described to Ziheng. Ziheng acknowledges this in letters to me from that period, but he then published the method without permission or acknowledgement. He repeated this "omission" in his subsequent "review" articles). An early example of it implementation is Waddell and Penny (accepted 1993, published 1996 after many unprofessional errors by the publishers delayed the whole book that was supposed to appear 1993!). In this example, the I+G optimal solution (with constraints) has invariant sites set to zero. If there is no constraint, then the ML model has both invariant sites (as a negative number) and a gamma distribution (see also Waddell, Penny and Arnold 1997 and Waddell 2005 for further discussion). The boundary issue encountered here is explained and discussed in Ota et al. (1999), (2000). There is no doubt that invariant sites and a gamma (or many other continuous and discrete distributions) mimic each other to a first approximation (see Waddell et al. 1997 for a graphical example). Consequently they are correlated, but they are still discrete parameters, and only in very special cases are they truly non-independent (which occurs only when the model is not identifiable; see Waddell 1995 for examples, while Waddell and Steel 1997 prove that this model (at least any GTR variant) is identifiable if the parameters of the site rate distribution are known). Now to answer your questions, as discussed in nearly all our original work (see Waddell et al. 1997 for example) on the topic, this model is not biologically realistic and is an approximation to the effects of a covarion (Fitch) or other model in which higher order structure and selection play key roles. This is the key to interpreting the results in most situations. So to answer your questions: (1) Use it like any other model, and if it does not fit the true model, be suspicious of it as the original work suggests. This is not to say that the trees it gives are not useful (in fact often quite the opposite), but their validity is often best checked in practice using congruence (e.g., Waddell and Shelly 2003). (2) Reporting the parameter estimates is often justified, if only to give others a ballpark feeling for the deviation form an identical sites rates "neutral" model. (3) There are other ML estimators of invariants sites (see Waddell, Cao, Hauf and Hasegawa 1999, for example). If these differ markedly from the tree based ML estimates, all the more evidence that there are serious deficiencies in both, and the very concept of "invariant" sites may ultimately be misleading. (4) This is not feasible. These are discrete parameters. Rather, use other types of distribution to characterise the "site rate variability". If all infer much the same thing, then that is about as far as you can go until structural models such as those of Bastolla et al. or Robinson et al. (which may lead the way to better models) are amenable to ML analysis with many species. I hope this helps. Sincerely, Peter Waddell. WADDELL, P.J. (2005). Measuring the fit of sequence data to phylogenetic model. Allowing for missing data. Molecular Biology and Evolution 22: 295-301 (epub 2004). WADDELL, P.J., AND SHELLEY, S. (2003). Evaluating placental inter-ordinal phylogenies with novel sequences including RAG1, gamma-fibrinogen, ND6, and mt-tRNA, plus MCMC-driven nucleotide, amino acid, and codon models. Molecular Phylogenetics and Evolution 28: 197-224. OTA, R., P.J. WADDELL, M. HASEGAWA, H. SHIMODAIRA, AND H. KISHINO. (2000). Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Molecular Biology and Evolution 17: 798-803. WADDELL, P. J., Y. CAO, J. HAUF, AND M. HASEGAWA. (1999). Using novel phylogenetic methods to evaluate mammalian mtDNA, including AA invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the position of hedgehog, armadillo, and elephant. Systematic Biology 48: 31-53. OTA, R., P. J. WADDELL, AND H. KISHINO (1999). Statistical distribution for testing the resolved tree against the star tree. Pp. 15-20 in "Proceeding of the annual joint conference of the Japanese biometrics and applied statistics societies," Sin-fonica, Minato-ku, Tokyo. STEEL, M.A., AND WADDELL, P.J. (1999). Approximating Likelihoods Under Low but Variable Rates Across Sites. Applied Maths Letters 12: 13-19. WADDELL, P.J., AND M.A. STEEL. (1997). General time reversible distances with unequal rates across sites: Mixing ě and inverse Gaussian distributions with invariant sites. Mol. Phyl. Evol. 8: 398-414. WADDELL, P.J., D. PENNY, AND T. MOORE. (1997). Extending Hadamard conjugations to model sequence evolution with variable rates across sites. Molecular Phylogenetics and Evolution 8: 33-50. WADDELL, P.J., AND D. PENNY. (1996). Evolutionary trees of apes and humans from DNA sequences. In: "Handbook of Symbolic Evolution", pp 53-73. Ed. A.J. Lock and C.R. Peters. Clarendon Press, Oxford. WADDELL, P.J. (1995). Statistical methods of phylogenetic analysis, including Hadamard conjugations, LogDet transforms, and maximum likelihood. PhD Thesis. Massey University, New Zealand. STEEL, M.A., L. SZÉKELY, P.L. ERDÖS, AND P.J. WADDELL. (1993). A complete family of phylogenetic invariants for any number of taxa under Kimura's 3ST model. New Zealand Journal of Botany (Conference Issue) 31: 289-296. ---------------------------------------- I am sure someone will trash what I say, but here are some comments. I will restrict my comments to use of models in likelihood analysis. My position has been that the I+G model is pathological because of the strong correlation between parameters. It is bigger than an elephant and should not be used. The only exception I can foresee is where one has a lot of sequences and is interested in comparing different rate distributions, in which case I don't have an objection of using it for comparison with other models, such as the gamma, the invariable sites model, the lognormal etc. For most analysis such as tree reconstruction or branch-length estimation, where rate variation among sites is a nuisance aspect of the model assumption, which we can't ignore but are nevertheless not terribly interested in, I don't see the point of using an overly complex model such as I+G. Most likely these different models or rate variation will produce similar results, but if not, I suspect we are in deeper trouble if our conclusion depends on such intricate differences among such models and the I+G model might not do a better job than the simpler I or G models. I am happy with either I or G, but not the mixture. A criticism with the I is that the MLE of the parameter (the proportion of invariable sites) is sensitive to the sequence sampling, whereas the parameter is supposed to reflect features of the molecular rather than sampling of sequences. When one adds more sequences to the data, the estimated proportion of invariable sites drops. The gamma parameter is affected by sequence sampling as well, but to a lesser extent, which might be considered an advantage over the I model. There seems to be a trend in molecular phylogenetics towards using overly-complex models, which I think is deplorable. There are some papers in the statistics literature which suggest that in large data sets the LRT and even the AIC tend to prefer complex parameter-rich models. People implementing new models in molecular evolution have all had the pleasant experience that one can easily get a significant improvement in the fit to data by adding whatever frivolous parameters one fancies, so I don't think a significant improvement in fit to data is justifies addition of unnecessary parameters. We should also consider whether the parameters are biologically meaningful, whether the analysis is robust to the model assumptions etc. Best, Ziheng Yang ---------------------------------------- Dear Gangolf, I haven't looked particularly hard at the alpha + I question, but some of what you asked can be answered quickly in a general way. (1) The fact that parameters are poorly estimated (have a high variance) does not necessarily mean that they should not be included. Presumably the reason that they are supported by your likelihood scores is that the parameter space is justifiably more constrained than it was before you started. (2) In your simulations, was the variance in the results of greater magnitude than the range of confidence you had in the parameters in the first place (determined by looking at the posterior probability space, or by bootstrapping)? (3) It is perfectly acceptable to ignore the parameter estimates if you have no interest in them. The question of whether the model is useful to you may depend on your goals. For example, if you are interested in phylogenetic reconstruction, the question is whether your phylogenetic results are more reliable with or without the I in I + G, not whether the proportion I or alpha parameter estimated is unacceptably highly variable. (4) Your option (3) is not clearly justifiable. Just because a site didn't change doesn't mean it might not have changed. The I+G model takes that into account (that is, such sites might have some probability of being invariant, and some posterior probability of being in a very slowly evolving gamma class). (5) Before doing (4), I would check further whether there is a single constraint that is consistent throughout the reasonable parameter space. If a parameter truly cannot be identified or separated from another parameter, that is the same as saying that the same maximum likelihood value can be found for any particular parameter setting by adjusting the other parameter. Again, looking at the posterior space would tell you if this is precisely true, or just close to true in a qualitative way. The likelihood ratio increase suggests that it is not precisely true. At any rate, these are just some quick thoughts; I hope they help. Cheers, David Pollock ---------------------------------------- Hello, The problem you report is genuine and the explanation you gave for the phenomenon is certainly the right one: "an under-estimated theta can be easily compensated by a smaller alpha and vice versa". I would not say that +I +gamma is "overparametrized" however. Since ML is a consistent method, your ML estimates will always converge to the right values as long as: 1)the generating substitution model and the model used for the inference are the same 2)the two parameters are identifiable (Note that this is not exactly the definition of consistency you will see in most phylogenetic papers and my definition might be wrong/incomplete) As you said, the problem with alpha and theta is that they are barely identifiable separately (they probably are but it would require a huge amount of sites) and that the likelihood is almost flat in the region where theta and alpha can change to compensate each other. I think the best way to deal with the +I+gamma problem is to provide a measure of uncertainty with your estimates. I think there are some methods in the ML framework to do that. I do not know a lot about it but I believe something standard is to assume that the likelihood is distributed like a multi-dimensional Gaussian around the ML point. Such techniques would certainly detect the strong correlation between theta and alpha (that would appear in the covariance matrix of your Gaussian). I hope someone will provide you with better pointers but see the literature about the likelihood ratio-test because this test relies on such an assumption. See also p311 (Interval estimates) of Felsenstein's book. Regards, Vivek ---------------------------------------- Gangolf, I'm curious: do you see this behavior with a wide range of theta/alpha combinations? If your true alpha is small, then there would be substantial overlap between the low-rate Gamma categories and the I category, and I could see why this issue would arise. In this case I don't see it as a big problem because 'alpha' is not analogous to any real biological parameter; it's just meant to provide a convenient distribution of rates. The real measure for me would be to map the posterior probability that each site is in each rate category, be it the I category or one of the several categories in the Gamma. If each site ends up having most of its posterior density in a rate category that includes its original rate, then I think you're still OK. The low-rate sites should have similar posterior probabilities in the invariant category and the lowest Gamma category, which is consistent with what you'd expect. Some questions: 1. Can you provide details of the simulation conditions and the range of simulation values where the problem is seen? 2. Are you using Bayesian estimates or ML estimates? If you did this with a program like MrBayes, I would hope that you would see a large variance on the estimates of theta and alpha to reflect this degeneracy. Best, Matt Dimmic ---------------------------------------- Dear Gangolf, You will not be surprised to know that you are not the first person to spot this problem with the +I+G models. I believe that David Swofford and collaborators spotted similar behaviour in a systematic biology paper a few years back. I?m afraid I can?t supply a reference at the moment because I?m in the US and don?t have access to journals at the moment; let me know if you have trouble finding it and I?ll mail it to you when I get back. The formal description of this kind of the problem is unidentifiability (which you probably already know!), caused by trying to estimate too many parameters from too few degrees of freedom in the data. This is known to occur with certain models when examining pairs of sequences; for example JC+G/I, K2P+G/Iand REV+G/I. The models with rate variation between these (e.g. FEL+G/I; HKY+G/I) also show some interesting properties, although not unidentifiability. The general case of +I+G models with multiple sequences, however, is a little more complex, and the models are probably not unidentifiable: consider the case when the number of sequences becomes very large and the sequences very divergent. With respect to your questions, I would personally favour option (1). My justification for this is related to how DNA is thought to evolve, specifically which bases are thought never to undergo change in evolution and why? I don?t know if I believe that there is such a thing as an invariable site, only sites where change is highly improbable due to selective constraints. When the +I+G parameters are used, the models they are added to are relatively naďve, and I believe this is why it appears useful. Perhaps, rather than trying include parameters that are difficult to interpret in a biological manner, we should attempt to build more realistic models that better describe selection and then adding an I parameter would be less useful. Regards, Simon ---------------------------------------- >> >> Hi all, >> >> just some questions. >> >> The "I+Gamma" is very popular model of sequence evolution with rate >> heterogeneity among sites. It is a superposition of two simpler models, >> which are both able to account for invariable sites in a sequence >> alignment. I suspect that Invariant plus gamma is very good for protein-coding regions. It may also be good for structural RNA genes. I suspect the gamma alone is better for non-coding DNA and pseudogenes. >> The first distinguishes two classes of sites, of which one >> may evolve while the other is invariable. The other part is Yang's >> discrete gamma model, which is used for the variable sites. The >> "I+Gamma" has two parameters, namely the fraction of invariable sites >> (theta) and the shape parameter of the discrete gamma distribution (alpha). The problem with gamma alone is that it takes an alpha value of less than one to get a significant number of invariant sites, and the shape of the gamma distribution then has few moderately variable sites. See figure 2 in: Leitner T, Kumar S, Albert J. Tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type 1 populations with a known transmission history. J Virol. 1997 Jun;71(6):4761-70. PMID: 9151870 When we observe coding regions of HIV-1 and other lentiviruses evolving over time, by using iterations of tree building, calculating rates with Gary Olsen's DNArates program ( http://geta.life.uiuc.edu/~gary/programs/DNArates.html ) then rebuilding the tree using those site-specific rates, then recalculating the rates... We find that the distibution of numbers of sites and their rates looks like a gamma distribution with an alpha of about 5 to 7 plus a gamma distribution with a gamma of 0.1 or 0.2 (i.e. A large peak of invariant sites then a valley, then another peak (not quite a nicely shaped one like the gamma distribution, but close) and a long tail (a few sites are EXTREMELY variable or hypervariable in HIV-1, most likely driven by positive selection of the host immune system, or in the case of the pol gene proteasee and RT genes it can be driven by antiretroviral drug selection pressure in treated patients). I have not yet done a comparison with coding regions from other organisms, but I suspect the results would be qualitatively similar. The HIV-1 M group rates and more formal discussion of calculating site-specific rates can be found here: http://www.santafe.edu/~btk/science-paper/bette.html >> >> When I simulate sequences under the "I+Gamma" and try to re-estimate the >> two heterogeneity parameters by maximum likelihood optimization >> afterwards, it turns out that the estimated parameters often differ >> severely from the original parameters that were previously used in the >> simulation. This indicates to me that the "I+Gamma" is >> over-parameterized, that alpha and theta are not genuinely independent >> parameters in the sense that an under-estimated theta can be easily >> compensated by a smaller alpha and vice versa. It is very difficult to >> distinguish invariable sites, which are described by theta, from the >> slow-evolving sites in the discrete gamma distribution, which are >> described by alpha. That depends very much on the available data sets. For example, within the HIV-1 M group, which has apparently been evolving in humans for roughly 70 years following a single chimpanzee to human transmission event, the silent sites are not yet fully saturated with mutations, so truly "invariable" vs just "slow" is not proven. However, if we include HIV-1 N and O group sequences and chimpanzee SIV sequences, then we jump out to a data set that has apparently shared a common ancestor at least tens of thousands of years ago and silent sites are saturated. Including all primate lentiviruses gives us a data set which has apparently been evolving from a single common ancestor for tens of millions of years, at a rate that is estimated to be close to 10 million times faster than the rate of eukaryotic DNA evolution, so this is potentially equivalent to 10 to the 14th power years or more of eukaryotic evolution. The gag and pol gene and protein sequences are still very alignable, because there are many absolutely invariable sites. All RNA-dependent DNA polymerases have the same catalytic core, and despite it being impossible to say which family of retroviruses the lentiviruses came from, it is very evident that the primate lentiviruses shared a common ancestor with non-primate lentiviruses (EIAVs, FIVs, Visna/CAEV, BIV/JDV, etc) and that there was a single common ancestor for the primates (as opposed to multiple transfers from non-primates into primates). So anyway, within a single group of sequences, for example the complete mitochondrial genome of fruit flies, there may not be enough evolutionary distance to prove which sites are invariant. But including outgroups, such as the complete mitochodrial genomes of other insects, spiders, springtails etc, it is possible to get a data set that includes sequences of the appropriate distances. >> The good news is that the "I+Gamma" gives higher >> likelihoods compared with each of the two simpler models and is >> therefore often preferred. The bad news is that parameter estimates do >> not mean much as there may exist several combinations of theta and alpha >> that appear almost equally probable. >> >> What is the best way to deal with the "I+Gamma"? Should we >> >> (1) not use it at all? Use it when appropriate and/or needed. If I just want to know if a virus sequence is HIV-1 M group subtype B or HIV-1 M group subtype C, any method, even a simple hamming distance (% Identity) calculation is good enough, if the sequence is long enough (and if it is too short, I suspect no mathematical correction can save me, it can only give me a better estimate of the probability of the subtype guestimate). Only when I want to attempt to say more about the dates of more recent common ancestors (calibrating a molecular clock) or other more difficult problems, does it become critical to use more accurate methods. >> >> (2) use it, but ignore the parameter estimates? Garbage in: garbage out. Until we have proven that we know what values to use for a given class or genes (enzymes like Reverse transcriptase have a different percentage of invariant sites, than structural proteins like Gag p24 or immune-evading proteins like Envelope gp120) or a given class of genes within a given class or organisms (the Env protein of HIV-1 evolves quite differently than the Env gene of human inlfuenza virus, because flu hops from one host to the next before a significant antibody response is made whereas HIV-1 is often transmitted long after seroconversion) would we know beforehand what parameter values to use. >> >> (3) optimize only the alpha and determine theta empirically, >> i. e. by counting the invariable sites? This will work well if we choose the right data set. Include "outgroups" that are distant enough from our group(s) of interest. >> >> (4) introduce some constraint between alpha and theta, which >> turns the "I+Gamma" effectively into an one-parameter model? I think we should expect each protein-coding gene to have a different pattern of evolution. Perhaps there is similarity among a given class of genes, such as those that encode mammalian tyrosine kinases. But I doubt that we will ever accurately model the evolution of all protein-coding genes with a single parameter. I have some MASSIVE data sets from lentiviruses, and I am also gathering mitochondrial complete genomes and other data sets. I would love to collaborate with someone who has other data sets and/or methods that we could compare. My funding is strictly for primate lentiviruses, so I've been working up the other data in my off hours. Brian Foley, PhD ---------------------------------------- Gangolf Jobb