Dear Tatjana and other evoldir members.
I want to make a couple of comments to some of the comments you got.
***Consistency of results of ML programs, such as migrate, fluctuate,
lamarc, .... (or Bayesian programs such as IM, etc)
MCMC is difficult
as Joseph clearly outlined, programs based on MCMC can fail for
various reasons, the most common one is that the programs are not
run long enough. This is especially serious with data that is not
very informative (for example has only few variable sites) and were
researchers have sampled many individuals.
- few variable sites:
It is obvious that results will be hard to
obtain if there is nothing to show differentiation, with MCMC-ML
(likelihood approximated by MCMC) almost every change will be
accepted resulting in a Brownian motion walk in the solution space,
and of course short runs will deliver different results, MCMC is
working only in the limit, so for such data every run might be too
short. MCMC-Bayes in contrast to MCMC-ML is changing the driving
value (often called parameter_0, or Theta_0) using a prior distribution
for the MCMC chain more often than the MCMC-ML runs where the driving
values is typically fixed. If no information is contained in the
data the prior is defining the outcome. It is important to recognize
that under Bayes failure to achieve good results delivers the prior
distribution (several runs actually will deliver the same result
when run using the same prior) whereas ML delivers different MLE
estimates and we then consider the results inconsistent. With
Bayesian programs, like IM, you need to be able to distinguish
informative posterior (data is overpowering the prior) versus
posterior is similar to the prior (prior is overpowering the data).
To my knowledge IM or the other Bayesian programs out there
(migrate-bayes included) do not produce good statistics on this,
yet. A immediate solution is to use different prior settings and
compare the posteriors (if the data is informative the will look
the same). Of course one always can run these programs with no data
and compare with the data-runs, but this will work only if all
parameters fail to be pushed the data [the population sizes are
most often much easier to estimate than the migration rates].
- variable data but very complex scenario:
I get often email about
difficulties to run on machine x that turns out to be often a problem
that is unrelated to computers but to researchers (and perhaps the
manual). Programs like migrate can estimate a large number of
parameters: in fact from 1 to n*n+1, where n is the number of
populations. You can imagine that a researcher has investigated 10
populations -> typically a default migrate run will try to estimate
100 parameters. Now, if you sample 20 individuals per population
then you will have 200 individuals, for single locus data this is
not a lot of data to estimate the 100 parameters and most often
such an analysis will fail because many of the individuals are
"relatives' and so not perfect replicates. Would you run a regression
analysis with 100 parameters and 200 data points? Probably not with
lots of confidence. Another problem is that sometimes researchers
have sampled 5 populations (25 parameters) and have 300 individuals
per populations. In principle this is good, but because of the MCMC
runs, we need to calculate likelihoods on large trees (1500 for
the example), in phylogenetic research, people believe that this
is a __very__ difficult problem. As a result the MCMC runs will
not explore the solution space exhaustively, in short, the runs
will be to short and return inconsistent results. A remedy of this
is to subsample each population randomly and run 10 or 20 individuals
per population. Pluzhnikov and Donnelly (1996) and Felsenstein
(2006) describe that results from coalescent studies show smallest
variance when sampling about 6-10 individuals per locus but using
many loci: so a study with 1500 individuals and 1 locus will produce
worse results than to a study with 150 individuals and 10 loci.
*** Estimates from many populations (e.g. migrate, lamarc, batwing,
genetree , ...) versus pairwise estimators (FST based, IM)
In the
standard statistical literature they tell you that is a very bad
thing to test pairs of sets of data when their is interaction among
all sets , that's why people use Bonferoni correction and the like.
For estimation of molecular clock or time of divergence in earlier
times researchers used pairs and calibrated Nei distance and geology
events, these procedures are almost completely abandoned because
superior methods that can take all lineages jointly into account
arrived (for example, mdivtime). The same caveat holds for population
genetics studies, I do not see why pairwise-methods (i.e. IM) should
work better than others, for what it worth, you can run migrate on
pairs, too; but that is not what I would suggest. The comparison
of multi- versus pairwise estimators of population sizes and migration
rates in a structured population certainly needs more attention
[Beerli 2004]. Of course, if you believe that your populations have
very recently split AND you have multiple loci then you definitely
should try IM.
*** Critique of migrate by Abdo et al.
Abdo et al. (ACJ) critized
migrate for not being able to estimate migration rates: In my opinion
this critique is overshooting quite some, but you want to judge for
yourself. ACJ run a 9 cell simulation study with various combination
of population sizes and migration rates,
- 4 cells with moderate
or high size and moderate and high migration rates deliver OK results,
- 2 cells (with the low population size and moderate to high
migration rates) used data sets that average about 1 or 2 variable
sites per dataset [see ***Consistency] and migrate does not work
well on this (curious to see which program does [preliminary results
of migrate-bayes might indicate the problem better than migrate-ML]).
- 3 cells showed very high number of variable sites, these 3 cells
are all associated with a very low migration rate (a migrant every
100000+ generation), simulating data for a two population scenario
so forces a single migration at the bottom of the tree and two very
long branches on which each site will change the state many times
(randomized) and so as result the populations will be about 25%
similar per site. It would be interesting to see other program's
results under such a scenario, although I doubt that this interesting
for any biologically relevant data. When the article came out I
thought of writing a reply but then reconsidered thinking that
people will spot the obvious (*), but recently I realized that it still
will be worthwhile, hopefully by summer I have a more general
manuscript submitted on the topic.
(*) No attempt was made to explore whether the program was run too
short,
in fact for all simulation cells migrate was run at the default values,
I had the impression all discussion of getting good results in the
manual
were ignored. One fact is certainly true, but we know this already, that
if we run MCMC-ML to short the profile likelihoods deliver too narrow
confidence
limits, MCMC-Bayes seems to fair somewhat better on this but then see
the ***Consistency.
***Email and response to inquiries: Jen complained about
unresponsiveness
of the lamarc team. Well, migrate runs its own website
(http://popgen.scs.fsu.edu/ ) and I respond to all email I get about
migrate (of course occasionally email might fall prey to our spam
filter or gets buried in my email pile). I have received many
email-questions about how to run migrate (not from Tatjana, so) and
typically they fall into a couple of categories: - seeking advice
how to run the program (without reading the manual): answer is often
very easy and helps to run the program - not able to run the program,
although having used the manual to construct the data file: most
common problem is that there are errors in the data file (miscount
individuals or sites, adding blank lines etc) or that program
executable was transferred from one computer to other (mac->windows/
unix).
The data format is definitely not the greatest of all, but with a
simple text editor this typically works out fine, for microsatellite
data you might want to check out MSA. - program runs but takes too
long, or returns inconsistent results: see under ***Consistency -
try to do likelihood ratio test or other "obscure" features, that
need more explanation in the manual. Of course, getting feedback
helps to improve the manual. - very few people complain that migrate
does not run but never return email _why_ it does not run, so there
might be a silent majority out there that is not able to run the
program but is is very hard to know when the silent majority is not
speaking up.
It s great that the IM team gives good feedback, I always had the
impression that lamarc team in Seattle is also giving good feedback,
so.
Peter
----
Peter Beerli,
Computational Evolutionary Biology Group
School of Computational Science (SCS)
and Biological Sciences Department
150-T Dirac Science Library
Florida State University
Tallahassee, Florida 32306-4120 USA
Webpage: http://www.csit.fsu.edu/~beerli
Phone: 850.645.1324
Fax: 850.644.0094
On Mar 25, 2006, at 1:13 AM, evoldir@evol.biology.mcmaster.ca wrote:
Dear Members
Many thanks for all your helpful suggestions with the program Migrate.
Some people pointed out the fact that it is not a very reliable
program
and others gave some suggestions for the program parameters. I've
added
the answers below just in case someone else might need them.
Many thanks
Tatiana
Tatiana Zerjal
The Sanger Institute
UK
tz1@sanger.ac.uk
Hi Tatiana,
I've played around with Migrate some myself. Before you continue, I
would take a serious look at the following article:
Evaluating the performance of likelihood methods for detecting
population structure and migration
ZAID Abdo, Keith A. Crandall, PAUL Joyce
Molecular Ecology 2004 13:4 837
They find that the results from Migrate are not very reliable or
accurate. I might choose different software (Jody Hey's IM,
Prtichard's
STRUCTURE, etc.) to get at this question.
-Jeff
Hi Tatiana -
I've had major problems with consistency for the Migrate, Fluctuate,
Lamarc programs and despite multiple runs of the same dataset, I never
got convergence. Depending on which way the wind blows, you will get
all kinds of numbers. Like flipping a coin. It doesn't matter what
your starting parameters are or number of chains.. the programs are
horrible. So, I suggest you steer clear of that batch of programs -
there's a paper about Migrate by Paul Joyce, Zaid Abdo, and Keith
Crandall that shows how bad the Migrate program is - it's in Mol Ecol.
Most of the time, Migrate didn't find the simulated theta within
the 95%
confidence interval... and worse yet, it was *really* far off most of
the time.
There is a program called IM by Jody Hey (he's at Rutgers) which does
consistently give you "ballpark" theta and migration estimates. I
typically do 10 identical runs to watch if they are converging, and I
mark down the estimates. You have to run 2 populations at a time, so
you get a theta_1, theta_2, theta_ancestral, m1, m2, and time since
divergence. I haven't used the program yet for usats, but it takes
sequence data and usats. I then cancel the runs that are really
different from the others... typically, 2 or 3 will converge on some
strange number while all the others will hit the same ballpark.
Then I
take the run that falls in the middle of that convergent set and
let it
run until the effective sample sizes (ESSs) for time to convergence
are
well over 1000. This means I let the thing run for a few weeks.
You'll
get an output file that has probabilities for each of the estimates
(theta, m, and t) and 90% posterior density confidence intervals. I
used IM for my Mol Ecol paper - it's in the Dec 2005 issue... cave
crayfish with Keith Crandall. And Rasmus Nielsen and Jody Hey are
great
and have published in Genetics using usat data for the IM program -
they
actually write back and respond to your questions, unlike the Lamarc
crew.
I hope that helps,
Jen
Other people may have more detailed/technical suggestions...
A simple thing you can do is repeat the analysis. If the results are
very different (eg. no or small overlap in confidence intervals for
parameters) you definitely need to run it for longer. Also if the
confidence intervals are large, you may need to run it for longer,
although I would not like to specify how large is too large.
Hope this helps.
Karen Bell [karen.bell@wku.edu]
I would suggest you to start with FST values. Do some runs with say 10
short chains and 1 long chain starting with different random seed
numbers. Try to play with parameter that explore the sample space,
such
as how many trees to discard etc. Compare the results and see if they
converged to something common or if the results are completely
disparate. You will acquire confidence in your runs as you see the
results and experiment with the parameters. Once you are confident
with
the parameters you prepare the "official" run with the parameter
set you
have determined in your trials. Follow the manual for that and try to
optimize the number of runs in relation to the computer time and power
you have available. Do three or four runs, starting the first with
FST
parameters and the subsequent with what you obtained in the
previous. Do
a single very long run at last.
Good luck
Julianno
Hello Tatiana. I have run Migrate ad infinitum, and so hope I can
offer
some good suggestions. First I'll give you my advice, and then a
blurb
from something I wrote on MCMC sampling a while ago, if you are
interested.
Running the Program
1) Run it multiple times, starting from random positions in parameter
space. If you get consistent results, you be confident that you have
sampled adequately.
2) Use heated chains, typically one cold and three hot (NOTE: this is
not a default for the program, so you will have to specify this in
your
parmfile). This will greatly improve convergence and mixing of the
chains. Please note, however, that adding heated chains will slow
down
your analysis substantially (you are doing four searches rather than
one), so don't use more than four chains (unless you have access to
some
sort of NASA computer).
3) For preliminary runs, you may want to use the Brownian motion
model.
This is much (much) faster than the ladder model, and will give you
reasonable results. If you like, you can take the results from the
Brownian motion model and use them as starting parameters in the
ladder
model.
MCMC Sampling
The concern with MCMC analyses is whether we have achieved
convergence,
either to the posterior probability distribution or the global maximum
peak on a likelihood surface (you didn't mention whether you are
running
the ML or Bayesian version of Migrate). Though we can never prove
that
we are sampling from these distributions, there are steps we can
take to
increase our confidence (Tierney, 1994). First, we can run the
analysis
multiple times starting from random points in parameter space; if the
same results are obtained across runs we can be fairly confident that
the chain has converged on the desired distribution. Secondly, we can
run multiple chains simultaneously with communication between the
chains, preferably with "heating" (see below). This latter method not
only increases the likelihood of convergence, but also increases the
mixing ability of the chain (i.e. explores isolated peaks in parameter
space).
Clearly, the more samples that are taken, the better
approximation. But
how many is enough? The truth is that we can never be absolutely sure
that we have collected an appropriate number of samples. In other
words,
it is not possible to determine suitable run-lengths theoretically,
and
so this requires some experimentation on the part of the user. The
concerns here are: 1) is the MCMC sample representative of
distribution
that it was sampled from, and 2) are enough samples collected to
estimate a particular parameter with reasonable precision (i.e. low
variance). Roughly speaking, Monte Carlo error decreases as the
square
root of the number of iterations (e.g. to reduce error by a factor of
10, increase the number of iterations by a factor of 100). As for the
number of short/long chains you will require, this will depend much
more
on the number of populations than the number of samples, as each
population adds parameters to be estimated.
A strict number of samples is not the only concern in an MCMC search,
however, because samples from an Markov chain are autocorrelated. In
other words, the absolute number of samples taken is far greater than
the effective number of samples. There are two strategies to get
around
this: 1) take a far greater number of samples, or 2) thin your Markov
chain. Option 2 appears to be the preferred method, in part because
autocorrelation can be directly measured and thus controlled for.
Using
an autocorrelation plot from a pilot run, the lag time (n, the
number of
iterations) required for effective independence of samples can be
determined. The user can then take samples every n iterations in the
actual analysis and be confident that samples are roughly independent.
Regardless of the strategy taken, dealing with autocorrelation of MCMC
samples requires far greater run times.
Finally, you should use "heating" (Metropolis-Coupled MCMC) in your
Migrate runs. Without getting too technical here, heating serves to
allow chains to move more easily through parameter space by lowering
peaks and, more importantly, decreasing the depths of valleys. You
can
imagine that parameter space would be better explored if for some
chains
insurmountable valleys did not exist. The benefits of heating are
undoubtedly great, but they also come with a price. Each additional
heated chain added to the analysis considerably increases the time to
completion. The reason is simple: within each chain, each iteration
requires the calculation of a computationally expensive likelihood
function; running n chains therefore requires n calculations of the
likelihood function each iteration. What is more, each chain
requires a
burnin, which is wasted computing effort. This constraint forced
investigators to consider the tradeoff between the necessity for
running
multiple heated chains (at least 4 chains are required for sufficient
mixing) to better explore parameter space and the requirement of
running
the cold chain long enough to obtain a sufficiently valid sample from
the posterior probability distribution from which to draw meaningful
conclusions. The recent advent of parallel computing, however, has
greatly diminished this conflict. If you do not have access to
parallel
computing, then I would go with four chains, one cold and three hot
(i.e. chain temperatures of 1, 1.2, 1.5, and 3.0).
If you are interested, you can read a fuller description of MCMC in my
(now somewhat antiquated) paper, available here:
http://www.ummz.lsa.umich.edu/students/josephwb/
Brown_Bayesian_Paper.pdf
Hope this helps, and good luck.
Joseph.
tz1@sanger.ac.uk
Peter Beerli