Dear Members
Many thanks for all your helpful suggestions with the program Migrate.
Some people pointed out the fact that it is not a very reliable program
and others gave some suggestions for the program parameters. I've added
the answers below just in case someone else might need them.
Many thanks
Tatiana
Tatiana Zerjal
The Sanger Institute
UK
tz1@sanger.ac.uk
Hi Tatiana,
I've played around with Migrate some myself. Before you continue, I
would take a serious look at the following article:
Evaluating the performance of likelihood methods for detecting
population structure and migration
ZAID Abdo, Keith A. Crandall, PAUL Joyce
Molecular Ecology 2004 13:4 837
They find that the results from Migrate are not very reliable or
accurate. I might choose different software (Jody Hey's IM, Prtichard's
STRUCTURE, etc.) to get at this question.
-Jeff
Hi Tatiana -
I've had major problems with consistency for the Migrate, Fluctuate,
Lamarc programs and despite multiple runs of the same dataset, I never
got convergence. Depending on which way the wind blows, you will get
all kinds of numbers. Like flipping a coin. It doesn't matter what
your starting parameters are or number of chains.. the programs are
horrible. So, I suggest you steer clear of that batch of programs -
there's a paper about Migrate by Paul Joyce, Zaid Abdo, and Keith
Crandall that shows how bad the Migrate program is - it's in Mol Ecol.
Most of the time, Migrate didn't find the simulated theta within the 95%
confidence interval... and worse yet, it was *really* far off most of
the time.
There is a program called IM by Jody Hey (he's at Rutgers) which does
consistently give you "ballpark" theta and migration estimates. I
typically do 10 identical runs to watch if they are converging, and I
mark down the estimates. You have to run 2 populations at a time, so
you get a theta_1, theta_2, theta_ancestral, m1, m2, and time since
divergence. I haven't used the program yet for usats, but it takes
sequence data and usats. I then cancel the runs that are really
different from the others... typically, 2 or 3 will converge on some
strange number while all the others will hit the same ballpark. Then I
take the run that falls in the middle of that convergent set and let it
run until the effective sample sizes (ESSs) for time to convergence are
well over 1000. This means I let the thing run for a few weeks. You'll
get an output file that has probabilities for each of the estimates
(theta, m, and t) and 90% posterior density confidence intervals. I
used IM for my Mol Ecol paper - it's in the Dec 2005 issue... cave
crayfish with Keith Crandall. And Rasmus Nielsen and Jody Hey are great
and have published in Genetics using usat data for the IM program - they
actually write back and respond to your questions, unlike the Lamarc
crew.
I hope that helps,
Jen
Other people may have more detailed/technical suggestions...
A simple thing you can do is repeat the analysis. If the results are
very different (eg. no or small overlap in confidence intervals for
parameters) you definitely need to run it for longer. Also if the
confidence intervals are large, you may need to run it for longer,
although I would not like to specify how large is too large.
Hope this helps.
Karen Bell [karen.bell@wku.edu]
I would suggest you to start with FST values. Do some runs with say 10
short chains and 1 long chain starting with different random seed
numbers. Try to play with parameter that explore the sample space, such
as how many trees to discard etc. Compare the results and see if they
converged to something common or if the results are completely
disparate. You will acquire confidence in your runs as you see the
results and experiment with the parameters. Once you are confident with
the parameters you prepare the "official" run with the parameter set you
have determined in your trials. Follow the manual for that and try to
optimize the number of runs in relation to the computer time and power
you have available. Do three or four runs, starting the first with FST
parameters and the subsequent with what you obtained in the previous. Do
a single very long run at last.
Good luck
Julianno
Hello Tatiana. I have run Migrate ad infinitum, and so hope I can offer
some good suggestions. First I'll give you my advice, and then a blurb
from something I wrote on MCMC sampling a while ago, if you are
interested.
Running the Program
1) Run it multiple times, starting from random positions in parameter
space. If you get consistent results, you be confident that you have
sampled adequately.
2) Use heated chains, typically one cold and three hot (NOTE: this is
not a default for the program, so you will have to specify this in your
parmfile). This will greatly improve convergence and mixing of the
chains. Please note, however, that adding heated chains will slow down
your analysis substantially (you are doing four searches rather than
one), so don't use more than four chains (unless you have access to some
sort of NASA computer).
3) For preliminary runs, you may want to use the Brownian motion model.
This is much (much) faster than the ladder model, and will give you
reasonable results. If you like, you can take the results from the
Brownian motion model and use them as starting parameters in the ladder
model.
MCMC Sampling
The concern with MCMC analyses is whether we have achieved convergence,
either to the posterior probability distribution or the global maximum
peak on a likelihood surface (you didn't mention whether you are running
the ML or Bayesian version of Migrate). Though we can never prove that
we are sampling from these distributions, there are steps we can take to
increase our confidence (Tierney, 1994). First, we can run the analysis
multiple times starting from random points in parameter space; if the
same results are obtained across runs we can be fairly confident that
the chain has converged on the desired distribution. Secondly, we can
run multiple chains simultaneously with communication between the
chains, preferably with "heating" (see below). This latter method not
only increases the likelihood of convergence, but also increases the
mixing ability of the chain (i.e. explores isolated peaks in parameter
space).
Clearly, the more samples that are taken, the better approximation. But
how many is enough? The truth is that we can never be absolutely sure
that we have collected an appropriate number of samples. In other words,
it is not possible to determine suitable run-lengths theoretically, and
so this requires some experimentation on the part of the user. The
concerns here are: 1) is the MCMC sample representative of distribution
that it was sampled from, and 2) are enough samples collected to
estimate a particular parameter with reasonable precision (i.e. low
variance). Roughly speaking, Monte Carlo error decreases as the square
root of the number of iterations (e.g. to reduce error by a factor of
10, increase the number of iterations by a factor of 100). As for the
number of short/long chains you will require, this will depend much more
on the number of populations than the number of samples, as each
population adds parameters to be estimated.
A strict number of samples is not the only concern in an MCMC search,
however, because samples from an Markov chain are autocorrelated. In
other words, the absolute number of samples taken is far greater than
the effective number of samples. There are two strategies to get around
this: 1) take a far greater number of samples, or 2) thin your Markov
chain. Option 2 appears to be the preferred method, in part because
autocorrelation can be directly measured and thus controlled for. Using
an autocorrelation plot from a pilot run, the lag time (n, the number of
iterations) required for effective independence of samples can be
determined. The user can then take samples every n iterations in the
actual analysis and be confident that samples are roughly independent.
Regardless of the strategy taken, dealing with autocorrelation of MCMC
samples requires far greater run times.
Finally, you should use "heating" (Metropolis-Coupled MCMC) in your
Migrate runs. Without getting too technical here, heating serves to
allow chains to move more easily through parameter space by lowering
peaks and, more importantly, decreasing the depths of valleys. You can
imagine that parameter space would be better explored if for some chains
insurmountable valleys did not exist. The benefits of heating are
undoubtedly great, but they also come with a price. Each additional
heated chain added to the analysis considerably increases the time to
completion. The reason is simple: within each chain, each iteration
requires the calculation of a computationally expensive likelihood
function; running n chains therefore requires n calculations of the
likelihood function each iteration. What is more, each chain requires a
burnin, which is wasted computing effort. This constraint forced
investigators to consider the tradeoff between the necessity for running
multiple heated chains (at least 4 chains are required for sufficient
mixing) to better explore parameter space and the requirement of running
the cold chain long enough to obtain a sufficiently valid sample from
the posterior probability distribution from which to draw meaningful
conclusions. The recent advent of parallel computing, however, has
greatly diminished this conflict. If you do not have access to parallel
computing, then I would go with four chains, one cold and three hot
(i.e. chain temperatures of 1, 1.2, 1.5, and 3.0).
If you are interested, you can read a fuller description of MCMC in my
(now somewhat antiquated) paper, available here:
http://www.ummz.lsa.umich.edu/students/josephwb/Brown_Bayesian_Paper.pdf
Hope this helps, and good luck.
Joseph.
tz1@sanger.ac.uk