Dear Members Many thanks for all your helpful suggestions with the program Migrate. Some people pointed out the fact that it is not a very reliable program and others gave some suggestions for the program parameters. I've added the answers below just in case someone else might need them. Many thanks Tatiana Tatiana Zerjal The Sanger Institute UK tz1@sanger.ac.uk Hi Tatiana, I've played around with Migrate some myself. Before you continue, I would take a serious look at the following article: Evaluating the performance of likelihood methods for detecting population structure and migration ZAID Abdo, Keith A. Crandall, PAUL Joyce Molecular Ecology 2004 13:4 837 They find that the results from Migrate are not very reliable or accurate. I might choose different software (Jody Hey's IM, Prtichard's STRUCTURE, etc.) to get at this question. -Jeff Hi Tatiana - I've had major problems with consistency for the Migrate, Fluctuate, Lamarc programs and despite multiple runs of the same dataset, I never got convergence. Depending on which way the wind blows, you will get all kinds of numbers. Like flipping a coin. It doesn't matter what your starting parameters are or number of chains.. the programs are horrible. So, I suggest you steer clear of that batch of programs - there's a paper about Migrate by Paul Joyce, Zaid Abdo, and Keith Crandall that shows how bad the Migrate program is - it's in Mol Ecol. Most of the time, Migrate didn't find the simulated theta within the 95% confidence interval... and worse yet, it was *really* far off most of the time. There is a program called IM by Jody Hey (he's at Rutgers) which does consistently give you "ballpark" theta and migration estimates. I typically do 10 identical runs to watch if they are converging, and I mark down the estimates. You have to run 2 populations at a time, so you get a theta_1, theta_2, theta_ancestral, m1, m2, and time since divergence. I haven't used the program yet for usats, but it takes sequence data and usats. I then cancel the runs that are really different from the others... typically, 2 or 3 will converge on some strange number while all the others will hit the same ballpark. Then I take the run that falls in the middle of that convergent set and let it run until the effective sample sizes (ESSs) for time to convergence are well over 1000. This means I let the thing run for a few weeks. You'll get an output file that has probabilities for each of the estimates (theta, m, and t) and 90% posterior density confidence intervals. I used IM for my Mol Ecol paper - it's in the Dec 2005 issue... cave crayfish with Keith Crandall. And Rasmus Nielsen and Jody Hey are great and have published in Genetics using usat data for the IM program - they actually write back and respond to your questions, unlike the Lamarc crew. I hope that helps, Jen Other people may have more detailed/technical suggestions... A simple thing you can do is repeat the analysis. If the results are very different (eg. no or small overlap in confidence intervals for parameters) you definitely need to run it for longer. Also if the confidence intervals are large, you may need to run it for longer, although I would not like to specify how large is too large. Hope this helps. Karen Bell [karen.bell@wku.edu] I would suggest you to start with FST values. Do some runs with say 10 short chains and 1 long chain starting with different random seed numbers. Try to play with parameter that explore the sample space, such as how many trees to discard etc. Compare the results and see if they converged to something common or if the results are completely disparate. You will acquire confidence in your runs as you see the results and experiment with the parameters. Once you are confident with the parameters you prepare the "official" run with the parameter set you have determined in your trials. Follow the manual for that and try to optimize the number of runs in relation to the computer time and power you have available. Do three or four runs, starting the first with FST parameters and the subsequent with what you obtained in the previous. Do a single very long run at last. Good luck Julianno Hello Tatiana. I have run Migrate ad infinitum, and so hope I can offer some good suggestions. First I'll give you my advice, and then a blurb from something I wrote on MCMC sampling a while ago, if you are interested. Running the Program 1) Run it multiple times, starting from random positions in parameter space. If you get consistent results, you be confident that you have sampled adequately. 2) Use heated chains, typically one cold and three hot (NOTE: this is not a default for the program, so you will have to specify this in your parmfile). This will greatly improve convergence and mixing of the chains. Please note, however, that adding heated chains will slow down your analysis substantially (you are doing four searches rather than one), so don't use more than four chains (unless you have access to some sort of NASA computer). 3) For preliminary runs, you may want to use the Brownian motion model. This is much (much) faster than the ladder model, and will give you reasonable results. If you like, you can take the results from the Brownian motion model and use them as starting parameters in the ladder model. MCMC Sampling The concern with MCMC analyses is whether we have achieved convergence, either to the posterior probability distribution or the global maximum peak on a likelihood surface (you didn't mention whether you are running the ML or Bayesian version of Migrate). Though we can never prove that we are sampling from these distributions, there are steps we can take to increase our confidence (Tierney, 1994). First, we can run the analysis multiple times starting from random points in parameter space; if the same results are obtained across runs we can be fairly confident that the chain has converged on the desired distribution. Secondly, we can run multiple chains simultaneously with communication between the chains, preferably with "heating" (see below). This latter method not only increases the likelihood of convergence, but also increases the mixing ability of the chain (i.e. explores isolated peaks in parameter space). Clearly, the more samples that are taken, the better approximation. But how many is enough? The truth is that we can never be absolutely sure that we have collected an appropriate number of samples. In other words, it is not possible to determine suitable run-lengths theoretically, and so this requires some experimentation on the part of the user. The concerns here are: 1) is the MCMC sample representative of distribution that it was sampled from, and 2) are enough samples collected to estimate a particular parameter with reasonable precision (i.e. low variance). Roughly speaking, Monte Carlo error decreases as the square root of the number of iterations (e.g. to reduce error by a factor of 10, increase the number of iterations by a factor of 100). As for the number of short/long chains you will require, this will depend much more on the number of populations than the number of samples, as each population adds parameters to be estimated. A strict number of samples is not the only concern in an MCMC search, however, because samples from an Markov chain are autocorrelated. In other words, the absolute number of samples taken is far greater than the effective number of samples. There are two strategies to get around this: 1) take a far greater number of samples, or 2) thin your Markov chain. Option 2 appears to be the preferred method, in part because autocorrelation can be directly measured and thus controlled for. Using an autocorrelation plot from a pilot run, the lag time (n, the number of iterations) required for effective independence of samples can be determined. The user can then take samples every n iterations in the actual analysis and be confident that samples are roughly independent. Regardless of the strategy taken, dealing with autocorrelation of MCMC samples requires far greater run times. Finally, you should use "heating" (Metropolis-Coupled MCMC) in your Migrate runs. Without getting too technical here, heating serves to allow chains to move more easily through parameter space by lowering peaks and, more importantly, decreasing the depths of valleys. You can imagine that parameter space would be better explored if for some chains insurmountable valleys did not exist. The benefits of heating are undoubtedly great, but they also come with a price. Each additional heated chain added to the analysis considerably increases the time to completion. The reason is simple: within each chain, each iteration requires the calculation of a computationally expensive likelihood function; running n chains therefore requires n calculations of the likelihood function each iteration. What is more, each chain requires a burnin, which is wasted computing effort. This constraint forced investigators to consider the tradeoff between the necessity for running multiple heated chains (at least 4 chains are required for sufficient mixing) to better explore parameter space and the requirement of running the cold chain long enough to obtain a sufficiently valid sample from the posterior probability distribution from which to draw meaningful conclusions. The recent advent of parallel computing, however, has greatly diminished this conflict. If you do not have access to parallel computing, then I would go with four chains, one cold and three hot (i.e. chain temperatures of 1, 1.2, 1.5, and 3.0). If you are interested, you can read a fuller description of MCMC in my (now somewhat antiquated) paper, available here: http://www.ummz.lsa.umich.edu/students/josephwb/Brown_Bayesian_Paper.pdf Hope this helps, and good luck. Joseph. tz1@sanger.ac.uk