Hi all,

with your help, I was able to solve my little optimization problem and, 
thereby, to save the I+Gamma (at least for me): The trick is to bound 
alpha > 1. Smaller alphas should not be used. This avoids a situation 
where both parameters have nearly the same effect on the distribution 
shape and, as a consequence, 'fight' to explain the data. On the other 
hand, this constrained I+Gamma has still the advantages it was made for: 
It can produce two-peaked rate distributions as well as one-peaked ones, 
ranging from homogeneity to extremely L-shaped and anything between. 
However, it is better not to the interpret the heterogeneity parameters.

Thanks to all who contributed valuable information. As this might be 
useful to others, I have included the most informative answers below.

Best,

Gangolf Jobb

----------------------------------------

Hi all,

just some questions.

The "I+Gamma" is very popular model of sequence evolution with rate
heterogeneity among sites. It is a superposition of two simpler models,
which are both able to account for invariable sites in a sequence
alignment. The first distinguishes two classes of sites, of which one
may evolve while the other is invariable. The other part is Yang's
discrete gamma model, which is used for the variable sites. The
"I+Gamma" has two parameters, namely the fraction of invariable sites
(theta) and the shape parameter of the discrete gamma distribution (alpha).

When I simulate sequences under the "I+Gamma" and try to re-estimate the
two heterogeneity parameters by maximum likelihood optimization
afterwards, it turns out that the estimated parameters often differ
severely from the original parameters that were previously used in the
simulation. This indicates to me that the "I+Gamma" is
over-parameterized, that alpha and theta are not genuinely independent
parameters in the sense that an under-estimated theta can be easily
compensated by a smaller alpha and vice versa. It is very difficult to
distinguish invariable sites, which are described by theta, from the
slow-evolving sites in the discrete gamma distribution, which are
described by alpha. The good news is that the "I+Gamma" gives higher
likelihoods compared with each of the two simpler models and is
therefore often preferred. The bad news is that parameter estimates do
not mean much as there may exist several combinations of theta and alpha
that appear almost equally probable.

What is the best way to deal with the "I+Gamma"? Should we

    (1) not use it at all?

    (2) use it, but ignore the parameter estimates?

    (3) optimize only the alpha and determine theta empirically,
        i. e. by counting the invariable sites?

    (4) introduce some constraint between alpha and theta, which
        turns the "I+Gamma" effectively into an one-parameter model?

Any suggestions?

Gangolf Jobb

----------------------------------------

Hi Gangolf,

The two parameters are clearly confounded to some extent.  It seems that 
there is little information in the data, without large number of 
sequences for determining whether a constant site is "invariable" or 
"slowly evolving".  You might want to look at these two papers by Jack 
Sullivan, in case you haven't already.

Regards,

Thomas

Sullivan, J. and D. L. Swofford. 2001. Should we use model-based methods 
for phylogenetic inference when we know assumptions about among-site 
rate variation and nucleotide substitution pattern are violated? 
Systematic Biology, 50:723-729

Sullivan, J., D. L. Swofford, and G. J. P. Naylor. 1999. The effect of 
taxon sampling on estimating rate-heterogeneity parameters of 
maximum-likelihood models. Molecular Biology and Evolution. 16:1347-1356.

----------------------------------------

I think this is an interesting question.  I too have been concerned 
about the use of I+Gamma and the interpretation of estimated parameters. 
I would be interested to hear what other responses you get, and I 
encourage you to post summaries to the EVOLDIR list.

 > When I simulate sequences under the "I+Gamma" and try to re-estimate the
 > two heterogeneity parameters by maximum likelihood optimization
 > afterwards, it turns out that the estimated parameters often differ
 > severely from the original parameters that were previously used in the
 > simulation.

I too have observed this.

I recall Ziheng Yang telling me something similar; he performed 
experiments sub-sampling sequences from a multiple alignment, and 
estimated the parameter you are calling theta (he was not using +Gamma 
then, I think).  He found that as you included more and more of the 
sequences, the parameter you are calling theta got lower and lower. 
Evidently the +I part of the model has some trouble distinguishing sites 
that never vary, and sites which may vary but happen not to have in the 
tree under consideration.  But you should check with Ziheng if you are 
interested; I am simply reporting what he once told me, and may have got 
it wrong.

 > This indicates to me that the "I+Gamma" is
 > over-parameterized, that alpha and theta are not genuinely 
independent parameters in the sense that an under-estimated theta can be 
easily compensated by a smaller alpha and vice versa.

I would not say it was over-parameterised, when your simulations also 
used +I+Gamma; in that case, clearly it is 'correctly parameterised'...

 > It is very difficult to distinguish invariable sites, which are 
described by theta, from the slow-evolving sites in the discrete gamma 
distribution, which are described by alpha.

...but the interaction between +I and the slowly-evolving sites under 
the Gamma distribution (particularly cases where alpha is small), which 
you describe nicely here, means that it is difficult to distinguish the 
parameter values accurately.  I would imagine (but this is not tested) 
that given enough data, it would be possible to infer all parameters 
accurately.  If you are interested in doing this test, feel free to 
contact me, as I have a couple of ideas for interesting ways to go about it!

Personally, I think it is a bad model to use +I+Gamma.  There are very 
few people who would really argue that any site of any sequence was 
truly unable to vary over evolution.  (Note that this is very different 
from a site where we observe no changes in our data set.)  Collect 
enough data, and you will always find an organism that's managed to 
accept a change at a site.  What success (in goodness-of-fit statistics) 
the +I model has is probably because it is giving a better fit to the 
observations at slowly evolving sites than any Gamma distribution can. A 
better method would be to allow something other than a Gamma 
distribution to account for the range of evolutionary rates exhibited 
across typcial sequences.

 > The good news is that the "I+Gamma" gives higher likelihoods compared 
with each of the two simpler models

This is necessarily true, as either simpler model is nested within the 
more complex one (either setting theta=0, or alpha=infinity).  Note that 
the statistical testing of significance here is slightly 
non-standard---see the attached paper.

 > and is therefore often preferred. The bad news is that parameter 
estimates do not mean much as there may exist several combinations of 
theta and alpha that appear almost equally probable.

Yes.  I suppose the appropriate interpretation is exactly that:  the 
more complex model is preferred, suggesting that there are both 
invariant sites and slowly evolving ones as described by the Gamma 
distribution, but that it is difficult to tell the proportions of each, 
or which site is which.  This should also be apparent by looking at the 
estimated std. devs. of the inferred parameters.

I don't know if any existing software allows Empirical Bayes estimation 
of posterior probabilities of being an invariant site or of membership 
of each of the discrete Gamma classes when doing analysis under +I+Gamma 
(analogous to Empirical Bayes estimation of /omega in codon models used 
to detect positive selection).  If so, it would be interesting to check 
whether the posterior mean rate at sites was very different under +I, 
+Gamma and +I+Gamma, even in cases where the corresponding parameter 
estimates under these models were very different.

 > What is the best way to deal with the "I+Gamma"? Should we

I don't claim to know, but here are some comments on your suggestions!

 >   (1) not use it at all?

That's what I've always done!

 >   (2) use it, but ignore the parameter estimates?

I wouldn't ignore them, just report them and emphasise their (presumably 
large) std. devs.

 >   (3) optimize only the alpha and determine theta empirically,
 >       i. e. by counting the invariable sites?

No, surely this would be very very bad.  If you estimate alpha and it 
turns out to be small, then a lot of the sites invariant *in your data 
set* are being inferred to be 'variable, but just didn't happen to 
change'.  It would be a very poor estimate of invariability to count 
these sites.

 >   (4) introduce some constraint between alpha and theta, which
 >       turns the "I+Gamma" effectively into an one-parameter model?

If you thought of a good way to do this, it would be a bit like my 
suggestion above ("A better method would be to allow...").

   Nick Goldman

----------------------------------------

On the invariants sites gamma model.

Dear Gangolf,

Much of what you say is correct, but not quite. The I+Gamma ML model was 
first developed as part of my thesis well before Ziheng Yang's work. The 
probabilistic calculations can use the algorithm's in Steel et al. 
(1993). It can also use the discretised method a method first developed 
by myself, and described to Ziheng. Ziheng acknowledges this in letters 
to me from that period, but he then published the method without 
permission or acknowledgement. He repeated this "omission" in his 
subsequent "review" articles). An early example of it implementation is 
Waddell and Penny (accepted 1993, published 1996 after many 
unprofessional errors by the publishers delayed the whole book that was 
supposed to appear 1993!). In this example, the I+G optimal solution 
(with constraints) has invariant sites set to zero. If there is no 
constraint, then the ML model has both invariant sites (as a negative 
number) and a gamma distribution (see also Waddell, Penny and Arnold 
1997 and Waddell 2005 for further discussion). The boundary issue 
encountered here is explained and discussed in Ota et al. (1999), (2000).

There is no doubt that invariant sites and a gamma (or many other 
continuous and discrete distributions) mimic each other to a first 
approximation (see Waddell et al. 1997 for a graphical example). 
Consequently they are correlated, but they are still discrete 
parameters, and only in very special cases are they truly 
non-independent (which occurs only when the model is not identifiable; 
see Waddell 1995 for examples, while Waddell and Steel 1997 prove that 
this model (at least any GTR variant) is identifiable if the parameters 
of the site rate distribution are known).

Now to answer your questions, as discussed in nearly all our original 
work (see Waddell et al. 1997 for example) on the topic, this model is 
not biologically realistic and is an approximation to the effects of a 
covarion (Fitch) or other model in which higher order structure and 
selection play key roles. This is the key to interpreting the results in 
most situations.

So to answer your questions:

(1) Use it like any other model, and if it does not fit the true model, 
be suspicious of it as the original work suggests. This is not to say 
that the trees it gives are not useful (in fact often quite the 
opposite), but their validity is often best checked in practice using 
congruence (e.g., Waddell and Shelly 2003).

(2) Reporting the parameter estimates is often justified, if only to 
give others a ballpark feeling for the deviation form an identical sites 
rates "neutral" model.

(3) There are other ML estimators of invariants sites (see Waddell, Cao, 
Hauf and Hasegawa 1999, for example). If these differ markedly from the 
tree based ML estimates, all the more evidence that there are serious 
deficiencies in both, and the very concept of "invariant" sites may 
ultimately be misleading.

(4) This is not feasible. These are discrete parameters. Rather, use 
other types of distribution to characterise the "site rate variability". 
If all infer much the same thing, then that is about as far as you can 
go until structural models such as those of Bastolla et al. or Robinson 
et al. (which may lead the way to better models) are amenable to ML 
analysis with many species.

I hope this helps.

Sincerely,

Peter Waddell.

WADDELL, P.J. (2005). Measuring the fit of sequence data to phylogenetic 
model. Allowing for missing data. Molecular Biology and Evolution 22: 
295-301 (epub 2004).
WADDELL, P.J., AND SHELLEY, S. (2003). Evaluating placental 
inter-ordinal phylogenies with novel sequences including RAG1, 
gamma-fibrinogen, ND6, and mt-tRNA, plus MCMC-driven nucleotide, amino 
acid, and codon models. Molecular Phylogenetics and Evolution 28: 197-224.
OTA, R., P.J. WADDELL, M. HASEGAWA, H. SHIMODAIRA, AND H. KISHINO. 
(2000). Appropriate likelihood ratio tests and marginal distributions 
for evolutionary tree models with constraints on parameters. Molecular 
Biology and Evolution 17: 798-803.
WADDELL, P. J., Y. CAO, J. HAUF, AND M. HASEGAWA. (1999). Using novel 
phylogenetic methods to evaluate mammalian mtDNA, including AA invariant 
sites-LogDet plus site stripping, to detect internal conflicts in the 
data, with special reference to the position of hedgehog, armadillo, and 
elephant. Systematic Biology 48: 31-53.
OTA, R., P. J. WADDELL, AND H. KISHINO (1999). Statistical distribution 
for testing the resolved tree against the star tree. Pp. 15-20 in 
"Proceeding of the annual joint conference of the Japanese biometrics 
and applied statistics societies," Sin-fonica, Minato-ku, Tokyo.
STEEL, M.A., AND WADDELL, P.J. (1999). Approximating Likelihoods Under 
Low but Variable Rates Across Sites. Applied Maths Letters 12: 13-19.
WADDELL, P.J., AND M.A. STEEL. (1997). General time reversible distances 
with unequal rates across sites: Mixing ì and inverse Gaussian 
distributions with invariant sites. Mol. Phyl. Evol. 8: 398-414.
WADDELL, P.J., D. PENNY, AND T. MOORE. (1997). Extending Hadamard 
conjugations to model sequence evolution with variable rates across 
sites. Molecular Phylogenetics and Evolution 8: 33-50.
WADDELL, P.J., AND  D. PENNY. (1996). Evolutionary trees of apes and 
humans from DNA sequences. In: "Handbook of Symbolic Evolution", pp 
53-73. Ed. A.J. Lock and C.R. Peters. Clarendon Press, Oxford.
WADDELL, P.J. (1995). Statistical methods of phylogenetic analysis, 
including Hadamard conjugations, LogDet transforms, and maximum 
likelihood. PhD Thesis. Massey University, New Zealand.
STEEL, M.A., L. SZÉKELY, P.L. ERDÖS, AND P.J. WADDELL. (1993). A 
complete family of phylogenetic invariants for any number of taxa under 
Kimura's 3ST model. New Zealand Journal of Botany (Conference Issue) 31: 
289-296.

----------------------------------------

I am sure someone will trash what I say, but here are some comments.  I 
will restrict my comments to use of models in likelihood analysis.  My 
position has been that the I+G model is pathological because of the 
strong correlation between parameters.  It is bigger than an elephant 
and should not be used.  The only exception I can foresee is where one 
has a lot of sequences and is interested in comparing different rate 
distributions, in which case I don't have an objection of using it for 
comparison with other models, such as the gamma, the invariable sites 
model, the lognormal etc.  For most analysis such as tree reconstruction 
or branch-length estimation, where rate variation among sites is a 
nuisance aspect of the model assumption, which we can't ignore but are 
nevertheless not terribly interested in, I don't see the point of using 
an overly complex model such as I+G.  Most likely these different models 
or rate variation will produce similar results, but if not, I suspect we 
are in deeper trouble if our conclusion depends on such intricate 
differences among such models and the I+G model might not do a better 
job than the simpler I or G models.  I am happy with either I or G, but 
not the mixture.  A criticism with the I is that the MLE of the 
parameter (the proportion of invariable sites) is sensitive to the 
sequence sampling, whereas the parameter is supposed to reflect features 
of the molecular rather than sampling of sequences.  When one adds more 
sequences to the data, the estimated proportion of invariable sites 
drops.  The gamma parameter is affected by sequence sampling as well, 
but to a lesser extent, which might be considered an advantage over the 
I model.

There seems to be a trend in molecular phylogenetics towards using 
overly-complex models, which I think is deplorable.  There are some 
papers in the statistics literature which suggest that in large data 
sets the LRT and even the AIC tend to prefer complex parameter-rich 
models.  People implementing new models in molecular evolution have all 
had the pleasant experience that one can easily get a significant 
improvement in the fit to data by adding whatever frivolous parameters 
one fancies, so I don't think a significant improvement in fit to data 
is justifies addition of unnecessary parameters.  We should also 
consider whether the parameters are biologically meaningful, whether the 
analysis is robust to the model assumptions etc.

Best,
Ziheng Yang

----------------------------------------

Dear Gangolf,

I haven't looked particularly hard at the alpha + I question, but some 
of what you asked can be answered quickly in a general way.

(1) The fact that parameters are poorly estimated (have a high variance) 
does not necessarily mean that they should not be included. Presumably 
the reason that they are supported by your likelihood scores is that the 
parameter space is justifiably more constrained than it was before you 
started.

(2) In your simulations, was the variance in the results of greater 
magnitude than the range of confidence you had in the parameters in the 
first place (determined by looking at the posterior probability space, 
or by bootstrapping)?

(3) It is perfectly acceptable to ignore the parameter estimates if you 
have no interest in them. The question of whether the model is useful to 
you may depend on your goals. For example, if you are interested in 
phylogenetic reconstruction, the question is whether your phylogenetic 
results are more reliable with or without the I in I + G, not whether 
the proportion I or alpha parameter estimated is unacceptably highly 
variable.

(4) Your option (3) is not clearly justifiable. Just because a site 
didn't change doesn't mean it might not have changed. The I+G model 
takes that into account (that is, such sites might have some probability 
of being invariant, and some posterior probability of being in a very 
slowly evolving gamma class).

(5) Before doing (4), I would check further whether there is a single 
constraint that is consistent throughout the reasonable parameter space. 
If a parameter truly cannot be identified or separated from another 
parameter, that is the same as saying that the same maximum likelihood 
value can be found for any particular parameter setting by adjusting the 
other parameter. Again, looking at the posterior space would tell you if 
this is precisely true, or just close to true in a qualitative way. The 
likelihood ratio increase suggests that it is not precisely true.

At any rate, these are just some quick thoughts; I hope they help.

Cheers,

David Pollock

----------------------------------------

Hello,

The problem you report is genuine and the explanation you gave for the 
phenomenon is certainly the right one:
"an under-estimated theta can be easily compensated by a smaller alpha 
and vice versa".
I would not say that +I +gamma is "overparametrized" however. Since ML 
is a consistent method, your ML estimates will always converge to the 
right values as long as:
1)the generating substitution model and the model used for the inference 
are the same
2)the two parameters are identifiable
(Note that this is not exactly the definition of consistency you will 
see in most phylogenetic papers and my definition might be wrong/incomplete)

As you said, the problem with alpha and theta is that they are barely 
identifiable separately (they probably are but it would require a huge 
amount of sites) and that the likelihood is almost flat in the region 
where theta and alpha can change to compensate each other.

I think the best way to deal with the +I+gamma problem is to provide a 
measure of uncertainty with your estimates. I think there are some 
methods in the ML framework to do that. I do not know a lot about it but 
I believe something standard is to assume that the likelihood is 
distributed like a multi-dimensional Gaussian around the ML point. Such 
techniques would certainly detect the strong correlation between theta 
and alpha (that would appear in the covariance matrix of your Gaussian).
I hope someone will provide you with better pointers but see the 
literature about the likelihood ratio-test because this test relies
on such an assumption. See also p311 (Interval estimates) of 
Felsenstein's book.

Regards,
Vivek

----------------------------------------

Gangolf,

I'm curious: do you see this behavior with a wide range of theta/alpha
combinations? If your true alpha is small, then there would be substantial
overlap between the low-rate Gamma categories and the I category, and I
could see why this issue would arise. In this case I don't see it as a big
problem because 'alpha' is not analogous to any real biological parameter;
it's just meant to provide a convenient distribution of rates. The real
measure for me would be to map the posterior probability that each site is
in each rate category, be it the I category or one of the several categories
in the Gamma. If each site ends up having most of its posterior density in a
rate category that includes its original rate, then I think you're still OK.
The low-rate sites should have similar posterior probabilities in the
invariant category and the lowest Gamma category, which is consistent with
what you'd expect.

Some questions:

1. Can you provide details of the simulation conditions and the range of
simulation values where the problem is seen?
2. Are you using Bayesian estimates or ML estimates? If you did this with a
program like MrBayes, I would hope that you would see a large variance on
the estimates of theta and alpha to reflect this degeneracy.

Best,
Matt Dimmic

----------------------------------------

Dear Gangolf,

You will not be surprised to know that you are not the first person to
spot this problem with the +I+G models. I believe that David Swofford and
collaborators spotted similar behaviour in a systematic biology paper a
few years back. I?m afraid I can?t supply a reference at the moment
because I?m in the US and don?t have access to journals at the moment; let
me know if you have trouble finding it and I?ll mail it to you when I get
back.

The formal description of this kind of the problem is unidentifiability
(which you probably already know!), caused by trying to estimate too many
parameters from too few degrees of freedom in the data. This is known to
occur with certain models when examining pairs of sequences; for example
JC+G/I, K2P+G/Iand REV+G/I. The models with rate variation between these
(e.g. FEL+G/I; HKY+G/I) also show some interesting properties, although
not unidentifiability.

The general case of +I+G models with multiple sequences, however, is a
little more complex, and the models are probably not unidentifiable:
consider the case when the number of sequences becomes very large and the
sequences very divergent. With respect to your questions, I would
personally favour option (1). My justification for this is related to how
DNA is thought to evolve, specifically which bases are thought never to
undergo change in evolution and why? I don?t know if I believe that there
is such a thing as an invariable site, only sites where change is highly
improbable due to selective constraints.

When the +I+G parameters are used, the models they are added to are
relatively naïve, and I believe this is why it appears useful. Perhaps,
rather than trying include parameters that are difficult to interpret in a
biological manner, we should attempt to build more realistic models that
better describe selection and then adding an I parameter would be less
useful.

Regards,

Simon

----------------------------------------

 >>
 >> Hi all,
 >>
 >> just some questions.
 >>
 >> The "I+Gamma" is very popular model of sequence evolution with rate
 >> heterogeneity among sites. It is a superposition of two simpler models,
 >> which are both able to account for invariable sites in a sequence
 >> alignment.

I suspect that Invariant plus gamma is very good for protein-coding regions.
It may also be good for structural RNA genes.  I suspect the gamma alone is
better for non-coding DNA and pseudogenes.

 >> The first distinguishes two classes of sites, of which one
 >> may evolve while the other is invariable. The other part is Yang's
 >> discrete gamma model, which is used for the variable sites. The
 >> "I+Gamma" has two parameters, namely the fraction of invariable sites
 >> (theta) and the shape parameter of the discrete gamma distribution 
(alpha).

The problem with gamma alone is that it takes an alpha value of less than
one to get a significant number of invariant sites, and the shape of the
gamma distribution then has few moderately variable sites.  See figure 2 in:

Leitner T, Kumar S, Albert J.
Tempo and mode of nucleotide substitutions in gag and env gene fragments in
human immunodeficiency virus type 1 populations with a known transmission
history.
J Virol. 1997 Jun;71(6):4761-70.
PMID: 9151870

When we observe coding regions of HIV-1 and other lentiviruses evolving over
time, by using iterations of tree building, calculating rates with Gary
Olsen's DNArates program (
http://geta.life.uiuc.edu/~gary/programs/DNArates.html ) then rebuilding the
tree using those site-specific rates, then recalculating the rates... We
find that the distibution of numbers of sites and their rates looks like a
gamma distribution with an alpha of about 5 to 7 plus a gamma distribution
with a gamma of 0.1 or 0.2 (i.e. A large peak of invariant sites then a
valley, then another peak (not quite a nicely shaped one like the gamma
distribution, but close) and a long tail (a few sites are EXTREMELY variable
or hypervariable in HIV-1, most likely driven by positive selection of the
host immune system, or in the case of the pol gene proteasee and RT genes it
can be driven by antiretroviral drug selection pressure in treated
patients).

I have not yet done a comparison with coding regions from other organisms,
but I suspect the results would be qualitatively similar.

The HIV-1 M group rates and more formal discussion of calculating
site-specific rates can be found here:

http://www.santafe.edu/~btk/science-paper/bette.html

 >>
 >> When I simulate sequences under the "I+Gamma" and try to re-estimate the
 >> two heterogeneity parameters by maximum likelihood optimization
 >> afterwards, it turns out that the estimated parameters often differ
 >> severely from the original parameters that were previously used in the
 >> simulation. This indicates to me that the "I+Gamma" is
 >> over-parameterized, that alpha and theta are not genuinely independent
 >> parameters in the sense that an under-estimated theta can be easily
 >> compensated by a smaller alpha and vice versa. It is very difficult to
 >> distinguish invariable sites, which are described by theta, from the
 >> slow-evolving sites in the discrete gamma distribution, which are
 >> described by alpha.

That depends very much on the available data sets.  For example, within the
HIV-1 M group, which has apparently been evolving in humans for roughly 70
years following a single chimpanzee to human transmission event, the silent
sites are not yet fully saturated with mutations, so truly "invariable" vs
just "slow" is not proven.  However, if we include HIV-1 N and O group
sequences and chimpanzee SIV sequences, then we jump out to a data set that
has apparently shared a common ancestor at least tens of thousands of years
ago and silent sites are saturated.  Including all primate lentiviruses
gives us a data set which has apparently been evolving from a single common
ancestor for tens of millions of years, at a rate that is estimated to be
close to 10 million times faster than the rate of eukaryotic DNA evolution,
so this is potentially equivalent to 10 to the 14th power years or more of
eukaryotic evolution.  The gag and pol gene and protein sequences are still
very alignable, because there are many absolutely invariable sites.  All
RNA-dependent DNA polymerases have the same catalytic core, and despite it
being impossible to say which family of retroviruses the lentiviruses came
from, it is very evident that the primate lentiviruses shared a common
ancestor with non-primate lentiviruses (EIAVs, FIVs, Visna/CAEV, BIV/JDV,
etc) and that there was a single common ancestor for the primates (as
opposed to multiple transfers from non-primates into primates).

So anyway, within a single group of sequences, for example the complete
mitochondrial genome of fruit flies, there may not be enough evolutionary
distance to prove which sites are invariant.  But including outgroups, such
as the complete mitochodrial genomes of other insects, spiders, springtails
etc, it is possible to get a data set that includes sequences of the
appropriate distances.

 >> The good news is that the "I+Gamma" gives higher
 >> likelihoods compared with each of the two simpler models and is
 >> therefore often preferred. The bad news is that parameter estimates do
 >> not mean much as there may exist several combinations of theta and alpha
 >> that appear almost equally probable.
 >>
 >> What is the best way to deal with the "I+Gamma"? Should we
 >>
 >>    (1) not use it at all?

Use it when appropriate and/or needed.  If I just want to know if a virus
sequence is HIV-1 M group subtype B or HIV-1 M group subtype C, any method,
even a simple hamming distance (% Identity) calculation is good enough, if
the sequence is long enough (and if it is too short, I suspect no
mathematical correction can save me, it can only give me a better estimate
of the probability of the subtype guestimate).  Only when I want to attempt
to say more about the dates of more recent common ancestors (calibrating a
molecular clock) or other more difficult problems, does it become critical
to use more accurate methods.

 >>
 >>    (2) use it, but ignore the parameter estimates?

Garbage in: garbage out.  Until we have proven that we know what values to
use for a given class or genes (enzymes like Reverse transcriptase have a
different percentage of invariant sites, than structural proteins like Gag
p24 or immune-evading proteins like Envelope gp120) or a given class of
genes within a given class or organisms (the Env protein of HIV-1 evolves
quite differently than the Env gene of human inlfuenza virus, because flu
hops from one host to the next before a significant antibody response is
made whereas HIV-1 is often transmitted long after seroconversion) would we
know beforehand what parameter values to use.

 >>
 >>    (3) optimize only the alpha and determine theta empirically,
 >>        i. e. by counting the invariable sites?

This will work well if we choose the right data set.  Include "outgroups"
that are distant enough from our group(s) of interest.

 >>
 >>    (4) introduce some constraint between alpha and theta, which
 >>        turns the "I+Gamma" effectively into an one-parameter model?

I think we should expect each protein-coding gene to have a different
pattern of evolution.  Perhaps there is similarity among a given class of
genes, such as those that encode mammalian tyrosine kinases.  But I doubt
that we will ever accurately model the evolution of all protein-coding genes
with a single parameter.

I have some MASSIVE data sets from lentiviruses, and I am also gathering
mitochondrial complete genomes and other data sets.  I would love to
collaborate with someone who has other data sets and/or methods that we
could compare.  My funding is strictly for primate lentiviruses, so I've
been working up the other data in my off hours.

Brian Foley, PhD

----------------------------------------

Gangolf Jobb <g@treefinder.de>