Hi all,
with your help, I was able to solve my little optimization problem and,
thereby, to save the I+Gamma (at least for me): The trick is to bound
alpha > 1. Smaller alphas should not be used. This avoids a situation
where both parameters have nearly the same effect on the distribution
shape and, as a consequence, 'fight' to explain the data. On the other
hand, this constrained I+Gamma has still the advantages it was made for:
It can produce two-peaked rate distributions as well as one-peaked ones,
ranging from homogeneity to extremely L-shaped and anything between.
However, it is better not to the interpret the heterogeneity parameters.
Thanks to all who contributed valuable information. As this might be
useful to others, I have included the most informative answers below.
Best,
Gangolf Jobb
----------------------------------------
Hi all,
just some questions.
The "I+Gamma" is very popular model of sequence evolution with rate
heterogeneity among sites. It is a superposition of two simpler models,
which are both able to account for invariable sites in a sequence
alignment. The first distinguishes two classes of sites, of which one
may evolve while the other is invariable. The other part is Yang's
discrete gamma model, which is used for the variable sites. The
"I+Gamma" has two parameters, namely the fraction of invariable sites
(theta) and the shape parameter of the discrete gamma distribution (alpha).
When I simulate sequences under the "I+Gamma" and try to re-estimate the
two heterogeneity parameters by maximum likelihood optimization
afterwards, it turns out that the estimated parameters often differ
severely from the original parameters that were previously used in the
simulation. This indicates to me that the "I+Gamma" is
over-parameterized, that alpha and theta are not genuinely independent
parameters in the sense that an under-estimated theta can be easily
compensated by a smaller alpha and vice versa. It is very difficult to
distinguish invariable sites, which are described by theta, from the
slow-evolving sites in the discrete gamma distribution, which are
described by alpha. The good news is that the "I+Gamma" gives higher
likelihoods compared with each of the two simpler models and is
therefore often preferred. The bad news is that parameter estimates do
not mean much as there may exist several combinations of theta and alpha
that appear almost equally probable.
What is the best way to deal with the "I+Gamma"? Should we
(1) not use it at all?
(2) use it, but ignore the parameter estimates?
(3) optimize only the alpha and determine theta empirically,
i. e. by counting the invariable sites?
(4) introduce some constraint between alpha and theta, which
turns the "I+Gamma" effectively into an one-parameter model?
Any suggestions?
Gangolf Jobb
----------------------------------------
Hi Gangolf,
The two parameters are clearly confounded to some extent. It seems that
there is little information in the data, without large number of
sequences for determining whether a constant site is "invariable" or
"slowly evolving". You might want to look at these two papers by Jack
Sullivan, in case you haven't already.
Regards,
Thomas
Sullivan, J. and D. L. Swofford. 2001. Should we use model-based methods
for phylogenetic inference when we know assumptions about among-site
rate variation and nucleotide substitution pattern are violated?
Systematic Biology, 50:723-729
Sullivan, J., D. L. Swofford, and G. J. P. Naylor. 1999. The effect of
taxon sampling on estimating rate-heterogeneity parameters of
maximum-likelihood models. Molecular Biology and Evolution. 16:1347-1356.
----------------------------------------
I think this is an interesting question. I too have been concerned
about the use of I+Gamma and the interpretation of estimated parameters.
I would be interested to hear what other responses you get, and I
encourage you to post summaries to the EVOLDIR list.
> When I simulate sequences under the "I+Gamma" and try to re-estimate the
> two heterogeneity parameters by maximum likelihood optimization
> afterwards, it turns out that the estimated parameters often differ
> severely from the original parameters that were previously used in the
> simulation.
I too have observed this.
I recall Ziheng Yang telling me something similar; he performed
experiments sub-sampling sequences from a multiple alignment, and
estimated the parameter you are calling theta (he was not using +Gamma
then, I think). He found that as you included more and more of the
sequences, the parameter you are calling theta got lower and lower.
Evidently the +I part of the model has some trouble distinguishing sites
that never vary, and sites which may vary but happen not to have in the
tree under consideration. But you should check with Ziheng if you are
interested; I am simply reporting what he once told me, and may have got
it wrong.
> This indicates to me that the "I+Gamma" is
> over-parameterized, that alpha and theta are not genuinely
independent parameters in the sense that an under-estimated theta can be
easily compensated by a smaller alpha and vice versa.
I would not say it was over-parameterised, when your simulations also
used +I+Gamma; in that case, clearly it is 'correctly parameterised'...
> It is very difficult to distinguish invariable sites, which are
described by theta, from the slow-evolving sites in the discrete gamma
distribution, which are described by alpha.
...but the interaction between +I and the slowly-evolving sites under
the Gamma distribution (particularly cases where alpha is small), which
you describe nicely here, means that it is difficult to distinguish the
parameter values accurately. I would imagine (but this is not tested)
that given enough data, it would be possible to infer all parameters
accurately. If you are interested in doing this test, feel free to
contact me, as I have a couple of ideas for interesting ways to go about it!
Personally, I think it is a bad model to use +I+Gamma. There are very
few people who would really argue that any site of any sequence was
truly unable to vary over evolution. (Note that this is very different
from a site where we observe no changes in our data set.) Collect
enough data, and you will always find an organism that's managed to
accept a change at a site. What success (in goodness-of-fit statistics)
the +I model has is probably because it is giving a better fit to the
observations at slowly evolving sites than any Gamma distribution can. A
better method would be to allow something other than a Gamma
distribution to account for the range of evolutionary rates exhibited
across typcial sequences.
> The good news is that the "I+Gamma" gives higher likelihoods compared
with each of the two simpler models
This is necessarily true, as either simpler model is nested within the
more complex one (either setting theta=0, or alpha=infinity). Note that
the statistical testing of significance here is slightly
non-standard---see the attached paper.
> and is therefore often preferred. The bad news is that parameter
estimates do not mean much as there may exist several combinations of
theta and alpha that appear almost equally probable.
Yes. I suppose the appropriate interpretation is exactly that: the
more complex model is preferred, suggesting that there are both
invariant sites and slowly evolving ones as described by the Gamma
distribution, but that it is difficult to tell the proportions of each,
or which site is which. This should also be apparent by looking at the
estimated std. devs. of the inferred parameters.
I don't know if any existing software allows Empirical Bayes estimation
of posterior probabilities of being an invariant site or of membership
of each of the discrete Gamma classes when doing analysis under +I+Gamma
(analogous to Empirical Bayes estimation of /omega in codon models used
to detect positive selection). If so, it would be interesting to check
whether the posterior mean rate at sites was very different under +I,
+Gamma and +I+Gamma, even in cases where the corresponding parameter
estimates under these models were very different.
> What is the best way to deal with the "I+Gamma"? Should we
I don't claim to know, but here are some comments on your suggestions!
> (1) not use it at all?
That's what I've always done!
> (2) use it, but ignore the parameter estimates?
I wouldn't ignore them, just report them and emphasise their (presumably
large) std. devs.
> (3) optimize only the alpha and determine theta empirically,
> i. e. by counting the invariable sites?
No, surely this would be very very bad. If you estimate alpha and it
turns out to be small, then a lot of the sites invariant *in your data
set* are being inferred to be 'variable, but just didn't happen to
change'. It would be a very poor estimate of invariability to count
these sites.
> (4) introduce some constraint between alpha and theta, which
> turns the "I+Gamma" effectively into an one-parameter model?
If you thought of a good way to do this, it would be a bit like my
suggestion above ("A better method would be to allow...").
Nick Goldman
----------------------------------------
On the invariants sites gamma model.
Dear Gangolf,
Much of what you say is correct, but not quite. The I+Gamma ML model was
first developed as part of my thesis well before Ziheng Yang's work. The
probabilistic calculations can use the algorithm's in Steel et al.
(1993). It can also use the discretised method a method first developed
by myself, and described to Ziheng. Ziheng acknowledges this in letters
to me from that period, but he then published the method without
permission or acknowledgement. He repeated this "omission" in his
subsequent "review" articles). An early example of it implementation is
Waddell and Penny (accepted 1993, published 1996 after many
unprofessional errors by the publishers delayed the whole book that was
supposed to appear 1993!). In this example, the I+G optimal solution
(with constraints) has invariant sites set to zero. If there is no
constraint, then the ML model has both invariant sites (as a negative
number) and a gamma distribution (see also Waddell, Penny and Arnold
1997 and Waddell 2005 for further discussion). The boundary issue
encountered here is explained and discussed in Ota et al. (1999), (2000).
There is no doubt that invariant sites and a gamma (or many other
continuous and discrete distributions) mimic each other to a first
approximation (see Waddell et al. 1997 for a graphical example).
Consequently they are correlated, but they are still discrete
parameters, and only in very special cases are they truly
non-independent (which occurs only when the model is not identifiable;
see Waddell 1995 for examples, while Waddell and Steel 1997 prove that
this model (at least any GTR variant) is identifiable if the parameters
of the site rate distribution are known).
Now to answer your questions, as discussed in nearly all our original
work (see Waddell et al. 1997 for example) on the topic, this model is
not biologically realistic and is an approximation to the effects of a
covarion (Fitch) or other model in which higher order structure and
selection play key roles. This is the key to interpreting the results in
most situations.
So to answer your questions:
(1) Use it like any other model, and if it does not fit the true model,
be suspicious of it as the original work suggests. This is not to say
that the trees it gives are not useful (in fact often quite the
opposite), but their validity is often best checked in practice using
congruence (e.g., Waddell and Shelly 2003).
(2) Reporting the parameter estimates is often justified, if only to
give others a ballpark feeling for the deviation form an identical sites
rates "neutral" model.
(3) There are other ML estimators of invariants sites (see Waddell, Cao,
Hauf and Hasegawa 1999, for example). If these differ markedly from the
tree based ML estimates, all the more evidence that there are serious
deficiencies in both, and the very concept of "invariant" sites may
ultimately be misleading.
(4) This is not feasible. These are discrete parameters. Rather, use
other types of distribution to characterise the "site rate variability".
If all infer much the same thing, then that is about as far as you can
go until structural models such as those of Bastolla et al. or Robinson
et al. (which may lead the way to better models) are amenable to ML
analysis with many species.
I hope this helps.
Sincerely,
Peter Waddell.
WADDELL, P.J. (2005). Measuring the fit of sequence data to phylogenetic
model. Allowing for missing data. Molecular Biology and Evolution 22:
295-301 (epub 2004).
WADDELL, P.J., AND SHELLEY, S. (2003). Evaluating placental
inter-ordinal phylogenies with novel sequences including RAG1,
gamma-fibrinogen, ND6, and mt-tRNA, plus MCMC-driven nucleotide, amino
acid, and codon models. Molecular Phylogenetics and Evolution 28: 197-224.
OTA, R., P.J. WADDELL, M. HASEGAWA, H. SHIMODAIRA, AND H. KISHINO.
(2000). Appropriate likelihood ratio tests and marginal distributions
for evolutionary tree models with constraints on parameters. Molecular
Biology and Evolution 17: 798-803.
WADDELL, P. J., Y. CAO, J. HAUF, AND M. HASEGAWA. (1999). Using novel
phylogenetic methods to evaluate mammalian mtDNA, including AA invariant
sites-LogDet plus site stripping, to detect internal conflicts in the
data, with special reference to the position of hedgehog, armadillo, and
elephant. Systematic Biology 48: 31-53.
OTA, R., P. J. WADDELL, AND H. KISHINO (1999). Statistical distribution
for testing the resolved tree against the star tree. Pp. 15-20 in
"Proceeding of the annual joint conference of the Japanese biometrics
and applied statistics societies," Sin-fonica, Minato-ku, Tokyo.
STEEL, M.A., AND WADDELL, P.J. (1999). Approximating Likelihoods Under
Low but Variable Rates Across Sites. Applied Maths Letters 12: 13-19.
WADDELL, P.J., AND M.A. STEEL. (1997). General time reversible distances
with unequal rates across sites: Mixing ě and inverse Gaussian
distributions with invariant sites. Mol. Phyl. Evol. 8: 398-414.
WADDELL, P.J., D. PENNY, AND T. MOORE. (1997). Extending Hadamard
conjugations to model sequence evolution with variable rates across
sites. Molecular Phylogenetics and Evolution 8: 33-50.
WADDELL, P.J., AND D. PENNY. (1996). Evolutionary trees of apes and
humans from DNA sequences. In: "Handbook of Symbolic Evolution", pp
53-73. Ed. A.J. Lock and C.R. Peters. Clarendon Press, Oxford.
WADDELL, P.J. (1995). Statistical methods of phylogenetic analysis,
including Hadamard conjugations, LogDet transforms, and maximum
likelihood. PhD Thesis. Massey University, New Zealand.
STEEL, M.A., L. SZÉKELY, P.L. ERDÖS, AND P.J. WADDELL. (1993). A
complete family of phylogenetic invariants for any number of taxa under
Kimura's 3ST model. New Zealand Journal of Botany (Conference Issue) 31:
289-296.
----------------------------------------
I am sure someone will trash what I say, but here are some comments. I
will restrict my comments to use of models in likelihood analysis. My
position has been that the I+G model is pathological because of the
strong correlation between parameters. It is bigger than an elephant
and should not be used. The only exception I can foresee is where one
has a lot of sequences and is interested in comparing different rate
distributions, in which case I don't have an objection of using it for
comparison with other models, such as the gamma, the invariable sites
model, the lognormal etc. For most analysis such as tree reconstruction
or branch-length estimation, where rate variation among sites is a
nuisance aspect of the model assumption, which we can't ignore but are
nevertheless not terribly interested in, I don't see the point of using
an overly complex model such as I+G. Most likely these different models
or rate variation will produce similar results, but if not, I suspect we
are in deeper trouble if our conclusion depends on such intricate
differences among such models and the I+G model might not do a better
job than the simpler I or G models. I am happy with either I or G, but
not the mixture. A criticism with the I is that the MLE of the
parameter (the proportion of invariable sites) is sensitive to the
sequence sampling, whereas the parameter is supposed to reflect features
of the molecular rather than sampling of sequences. When one adds more
sequences to the data, the estimated proportion of invariable sites
drops. The gamma parameter is affected by sequence sampling as well,
but to a lesser extent, which might be considered an advantage over the
I model.
There seems to be a trend in molecular phylogenetics towards using
overly-complex models, which I think is deplorable. There are some
papers in the statistics literature which suggest that in large data
sets the LRT and even the AIC tend to prefer complex parameter-rich
models. People implementing new models in molecular evolution have all
had the pleasant experience that one can easily get a significant
improvement in the fit to data by adding whatever frivolous parameters
one fancies, so I don't think a significant improvement in fit to data
is justifies addition of unnecessary parameters. We should also
consider whether the parameters are biologically meaningful, whether the
analysis is robust to the model assumptions etc.
Best,
Ziheng Yang
----------------------------------------
Dear Gangolf,
I haven't looked particularly hard at the alpha + I question, but some
of what you asked can be answered quickly in a general way.
(1) The fact that parameters are poorly estimated (have a high variance)
does not necessarily mean that they should not be included. Presumably
the reason that they are supported by your likelihood scores is that the
parameter space is justifiably more constrained than it was before you
started.
(2) In your simulations, was the variance in the results of greater
magnitude than the range of confidence you had in the parameters in the
first place (determined by looking at the posterior probability space,
or by bootstrapping)?
(3) It is perfectly acceptable to ignore the parameter estimates if you
have no interest in them. The question of whether the model is useful to
you may depend on your goals. For example, if you are interested in
phylogenetic reconstruction, the question is whether your phylogenetic
results are more reliable with or without the I in I + G, not whether
the proportion I or alpha parameter estimated is unacceptably highly
variable.
(4) Your option (3) is not clearly justifiable. Just because a site
didn't change doesn't mean it might not have changed. The I+G model
takes that into account (that is, such sites might have some probability
of being invariant, and some posterior probability of being in a very
slowly evolving gamma class).
(5) Before doing (4), I would check further whether there is a single
constraint that is consistent throughout the reasonable parameter space.
If a parameter truly cannot be identified or separated from another
parameter, that is the same as saying that the same maximum likelihood
value can be found for any particular parameter setting by adjusting the
other parameter. Again, looking at the posterior space would tell you if
this is precisely true, or just close to true in a qualitative way. The
likelihood ratio increase suggests that it is not precisely true.
At any rate, these are just some quick thoughts; I hope they help.
Cheers,
David Pollock
----------------------------------------
Hello,
The problem you report is genuine and the explanation you gave for the
phenomenon is certainly the right one:
"an under-estimated theta can be easily compensated by a smaller alpha
and vice versa".
I would not say that +I +gamma is "overparametrized" however. Since ML
is a consistent method, your ML estimates will always converge to the
right values as long as:
1)the generating substitution model and the model used for the inference
are the same
2)the two parameters are identifiable
(Note that this is not exactly the definition of consistency you will
see in most phylogenetic papers and my definition might be wrong/incomplete)
As you said, the problem with alpha and theta is that they are barely
identifiable separately (they probably are but it would require a huge
amount of sites) and that the likelihood is almost flat in the region
where theta and alpha can change to compensate each other.
I think the best way to deal with the +I+gamma problem is to provide a
measure of uncertainty with your estimates. I think there are some
methods in the ML framework to do that. I do not know a lot about it but
I believe something standard is to assume that the likelihood is
distributed like a multi-dimensional Gaussian around the ML point. Such
techniques would certainly detect the strong correlation between theta
and alpha (that would appear in the covariance matrix of your Gaussian).
I hope someone will provide you with better pointers but see the
literature about the likelihood ratio-test because this test relies
on such an assumption. See also p311 (Interval estimates) of
Felsenstein's book.
Regards,
Vivek
----------------------------------------
Gangolf,
I'm curious: do you see this behavior with a wide range of theta/alpha
combinations? If your true alpha is small, then there would be substantial
overlap between the low-rate Gamma categories and the I category, and I
could see why this issue would arise. In this case I don't see it as a big
problem because 'alpha' is not analogous to any real biological parameter;
it's just meant to provide a convenient distribution of rates. The real
measure for me would be to map the posterior probability that each site is
in each rate category, be it the I category or one of the several categories
in the Gamma. If each site ends up having most of its posterior density in a
rate category that includes its original rate, then I think you're still OK.
The low-rate sites should have similar posterior probabilities in the
invariant category and the lowest Gamma category, which is consistent with
what you'd expect.
Some questions:
1. Can you provide details of the simulation conditions and the range of
simulation values where the problem is seen?
2. Are you using Bayesian estimates or ML estimates? If you did this with a
program like MrBayes, I would hope that you would see a large variance on
the estimates of theta and alpha to reflect this degeneracy.
Best,
Matt Dimmic
----------------------------------------
Dear Gangolf,
You will not be surprised to know that you are not the first person to
spot this problem with the +I+G models. I believe that David Swofford and
collaborators spotted similar behaviour in a systematic biology paper a
few years back. I?m afraid I can?t supply a reference at the moment
because I?m in the US and don?t have access to journals at the moment; let
me know if you have trouble finding it and I?ll mail it to you when I get
back.
The formal description of this kind of the problem is unidentifiability
(which you probably already know!), caused by trying to estimate too many
parameters from too few degrees of freedom in the data. This is known to
occur with certain models when examining pairs of sequences; for example
JC+G/I, K2P+G/Iand REV+G/I. The models with rate variation between these
(e.g. FEL+G/I; HKY+G/I) also show some interesting properties, although
not unidentifiability.
The general case of +I+G models with multiple sequences, however, is a
little more complex, and the models are probably not unidentifiable:
consider the case when the number of sequences becomes very large and the
sequences very divergent. With respect to your questions, I would
personally favour option (1). My justification for this is related to how
DNA is thought to evolve, specifically which bases are thought never to
undergo change in evolution and why? I don?t know if I believe that there
is such a thing as an invariable site, only sites where change is highly
improbable due to selective constraints.
When the +I+G parameters are used, the models they are added to are
relatively naďve, and I believe this is why it appears useful. Perhaps,
rather than trying include parameters that are difficult to interpret in a
biological manner, we should attempt to build more realistic models that
better describe selection and then adding an I parameter would be less
useful.
Regards,
Simon
----------------------------------------
>>
>> Hi all,
>>
>> just some questions.
>>
>> The "I+Gamma" is very popular model of sequence evolution with rate
>> heterogeneity among sites. It is a superposition of two simpler models,
>> which are both able to account for invariable sites in a sequence
>> alignment.
I suspect that Invariant plus gamma is very good for protein-coding regions.
It may also be good for structural RNA genes. I suspect the gamma alone is
better for non-coding DNA and pseudogenes.
>> The first distinguishes two classes of sites, of which one
>> may evolve while the other is invariable. The other part is Yang's
>> discrete gamma model, which is used for the variable sites. The
>> "I+Gamma" has two parameters, namely the fraction of invariable sites
>> (theta) and the shape parameter of the discrete gamma distribution
(alpha).
The problem with gamma alone is that it takes an alpha value of less than
one to get a significant number of invariant sites, and the shape of the
gamma distribution then has few moderately variable sites. See figure 2 in:
Leitner T, Kumar S, Albert J.
Tempo and mode of nucleotide substitutions in gag and env gene fragments in
human immunodeficiency virus type 1 populations with a known transmission
history.
J Virol. 1997 Jun;71(6):4761-70.
PMID: 9151870
When we observe coding regions of HIV-1 and other lentiviruses evolving over
time, by using iterations of tree building, calculating rates with Gary
Olsen's DNArates program (
http://geta.life.uiuc.edu/~gary/programs/DNArates.html ) then rebuilding the
tree using those site-specific rates, then recalculating the rates... We
find that the distibution of numbers of sites and their rates looks like a
gamma distribution with an alpha of about 5 to 7 plus a gamma distribution
with a gamma of 0.1 or 0.2 (i.e. A large peak of invariant sites then a
valley, then another peak (not quite a nicely shaped one like the gamma
distribution, but close) and a long tail (a few sites are EXTREMELY variable
or hypervariable in HIV-1, most likely driven by positive selection of the
host immune system, or in the case of the pol gene proteasee and RT genes it
can be driven by antiretroviral drug selection pressure in treated
patients).
I have not yet done a comparison with coding regions from other organisms,
but I suspect the results would be qualitatively similar.
The HIV-1 M group rates and more formal discussion of calculating
site-specific rates can be found here:
http://www.santafe.edu/~btk/science-paper/bette.html
>>
>> When I simulate sequences under the "I+Gamma" and try to re-estimate the
>> two heterogeneity parameters by maximum likelihood optimization
>> afterwards, it turns out that the estimated parameters often differ
>> severely from the original parameters that were previously used in the
>> simulation. This indicates to me that the "I+Gamma" is
>> over-parameterized, that alpha and theta are not genuinely independent
>> parameters in the sense that an under-estimated theta can be easily
>> compensated by a smaller alpha and vice versa. It is very difficult to
>> distinguish invariable sites, which are described by theta, from the
>> slow-evolving sites in the discrete gamma distribution, which are
>> described by alpha.
That depends very much on the available data sets. For example, within the
HIV-1 M group, which has apparently been evolving in humans for roughly 70
years following a single chimpanzee to human transmission event, the silent
sites are not yet fully saturated with mutations, so truly "invariable" vs
just "slow" is not proven. However, if we include HIV-1 N and O group
sequences and chimpanzee SIV sequences, then we jump out to a data set that
has apparently shared a common ancestor at least tens of thousands of years
ago and silent sites are saturated. Including all primate lentiviruses
gives us a data set which has apparently been evolving from a single common
ancestor for tens of millions of years, at a rate that is estimated to be
close to 10 million times faster than the rate of eukaryotic DNA evolution,
so this is potentially equivalent to 10 to the 14th power years or more of
eukaryotic evolution. The gag and pol gene and protein sequences are still
very alignable, because there are many absolutely invariable sites. All
RNA-dependent DNA polymerases have the same catalytic core, and despite it
being impossible to say which family of retroviruses the lentiviruses came
from, it is very evident that the primate lentiviruses shared a common
ancestor with non-primate lentiviruses (EIAVs, FIVs, Visna/CAEV, BIV/JDV,
etc) and that there was a single common ancestor for the primates (as
opposed to multiple transfers from non-primates into primates).
So anyway, within a single group of sequences, for example the complete
mitochondrial genome of fruit flies, there may not be enough evolutionary
distance to prove which sites are invariant. But including outgroups, such
as the complete mitochodrial genomes of other insects, spiders, springtails
etc, it is possible to get a data set that includes sequences of the
appropriate distances.
>> The good news is that the "I+Gamma" gives higher
>> likelihoods compared with each of the two simpler models and is
>> therefore often preferred. The bad news is that parameter estimates do
>> not mean much as there may exist several combinations of theta and alpha
>> that appear almost equally probable.
>>
>> What is the best way to deal with the "I+Gamma"? Should we
>>
>> (1) not use it at all?
Use it when appropriate and/or needed. If I just want to know if a virus
sequence is HIV-1 M group subtype B or HIV-1 M group subtype C, any method,
even a simple hamming distance (% Identity) calculation is good enough, if
the sequence is long enough (and if it is too short, I suspect no
mathematical correction can save me, it can only give me a better estimate
of the probability of the subtype guestimate). Only when I want to attempt
to say more about the dates of more recent common ancestors (calibrating a
molecular clock) or other more difficult problems, does it become critical
to use more accurate methods.
>>
>> (2) use it, but ignore the parameter estimates?
Garbage in: garbage out. Until we have proven that we know what values to
use for a given class or genes (enzymes like Reverse transcriptase have a
different percentage of invariant sites, than structural proteins like Gag
p24 or immune-evading proteins like Envelope gp120) or a given class of
genes within a given class or organisms (the Env protein of HIV-1 evolves
quite differently than the Env gene of human inlfuenza virus, because flu
hops from one host to the next before a significant antibody response is
made whereas HIV-1 is often transmitted long after seroconversion) would we
know beforehand what parameter values to use.
>>
>> (3) optimize only the alpha and determine theta empirically,
>> i. e. by counting the invariable sites?
This will work well if we choose the right data set. Include "outgroups"
that are distant enough from our group(s) of interest.
>>
>> (4) introduce some constraint between alpha and theta, which
>> turns the "I+Gamma" effectively into an one-parameter model?
I think we should expect each protein-coding gene to have a different
pattern of evolution. Perhaps there is similarity among a given class of
genes, such as those that encode mammalian tyrosine kinases. But I doubt
that we will ever accurately model the evolution of all protein-coding genes
with a single parameter.
I have some MASSIVE data sets from lentiviruses, and I am also gathering
mitochondrial complete genomes and other data sets. I would love to
collaborate with someone who has other data sets and/or methods that we
could compare. My funding is strictly for primate lentiviruses, so I've
been working up the other data in my off hours.
Brian Foley, PhD
----------------------------------------
Gangolf Jobb