Dear Tatjana and other evoldir members. I want to make a couple of comments to some of the comments you got. ***Consistency of results of ML programs, such as migrate, fluctuate, lamarc, .... (or Bayesian programs such as IM, etc) MCMC is difficult as Joseph clearly outlined, programs based on MCMC can fail for various reasons, the most common one is that the programs are not run long enough. This is especially serious with data that is not very informative (for example has only few variable sites) and were researchers have sampled many individuals. - few variable sites: It is obvious that results will be hard to obtain if there is nothing to show differentiation, with MCMC-ML (likelihood approximated by MCMC) almost every change will be accepted resulting in a Brownian motion walk in the solution space, and of course short runs will deliver different results, MCMC is working only in the limit, so for such data every run might be too short. MCMC-Bayes in contrast to MCMC-ML is changing the driving value (often called parameter_0, or Theta_0) using a prior distribution for the MCMC chain more often than the MCMC-ML runs where the driving values is typically fixed. If no information is contained in the data the prior is defining the outcome. It is important to recognize that under Bayes failure to achieve good results delivers the prior distribution (several runs actually will deliver the same result when run using the same prior) whereas ML delivers different MLE estimates and we then consider the results inconsistent. With Bayesian programs, like IM, you need to be able to distinguish informative posterior (data is overpowering the prior) versus posterior is similar to the prior (prior is overpowering the data). To my knowledge IM or the other Bayesian programs out there (migrate-bayes included) do not produce good statistics on this, yet. A immediate solution is to use different prior settings and compare the posteriors (if the data is informative the will look the same). Of course one always can run these programs with no data and compare with the data-runs, but this will work only if all parameters fail to be pushed the data [the population sizes are most often much easier to estimate than the migration rates]. - variable data but very complex scenario: I get often email about difficulties to run on machine x that turns out to be often a problem that is unrelated to computers but to researchers (and perhaps the manual). Programs like migrate can estimate a large number of parameters: in fact from 1 to n*n+1, where n is the number of populations. You can imagine that a researcher has investigated 10 populations -> typically a default migrate run will try to estimate 100 parameters. Now, if you sample 20 individuals per population then you will have 200 individuals, for single locus data this is not a lot of data to estimate the 100 parameters and most often such an analysis will fail because many of the individuals are "relatives' and so not perfect replicates. Would you run a regression analysis with 100 parameters and 200 data points? Probably not with lots of confidence. Another problem is that sometimes researchers have sampled 5 populations (25 parameters) and have 300 individuals per populations. In principle this is good, but because of the MCMC runs, we need to calculate likelihoods on large trees (1500 for the example), in phylogenetic research, people believe that this is a __very__ difficult problem. As a result the MCMC runs will not explore the solution space exhaustively, in short, the runs will be to short and return inconsistent results. A remedy of this is to subsample each population randomly and run 10 or 20 individuals per population. Pluzhnikov and Donnelly (1996) and Felsenstein (2006) describe that results from coalescent studies show smallest variance when sampling about 6-10 individuals per locus but using many loci: so a study with 1500 individuals and 1 locus will produce worse results than to a study with 150 individuals and 10 loci. *** Estimates from many populations (e.g. migrate, lamarc, batwing, genetree , ...) versus pairwise estimators (FST based, IM) In the standard statistical literature they tell you that is a very bad thing to test pairs of sets of data when their is interaction among all sets , that's why people use Bonferoni correction and the like. For estimation of molecular clock or time of divergence in earlier times researchers used pairs and calibrated Nei distance and geology events, these procedures are almost completely abandoned because superior methods that can take all lineages jointly into account arrived (for example, mdivtime). The same caveat holds for population genetics studies, I do not see why pairwise-methods (i.e. IM) should work better than others, for what it worth, you can run migrate on pairs, too; but that is not what I would suggest. The comparison of multi- versus pairwise estimators of population sizes and migration rates in a structured population certainly needs more attention [Beerli 2004]. Of course, if you believe that your populations have very recently split AND you have multiple loci then you definitely should try IM. *** Critique of migrate by Abdo et al. Abdo et al. (ACJ) critized migrate for not being able to estimate migration rates: In my opinion this critique is overshooting quite some, but you want to judge for yourself. ACJ run a 9 cell simulation study with various combination of population sizes and migration rates, - 4 cells with moderate or high size and moderate and high migration rates deliver OK results, - 2 cells (with the low population size and moderate to high migration rates) used data sets that average about 1 or 2 variable sites per dataset [see ***Consistency] and migrate does not work well on this (curious to see which program does [preliminary results of migrate-bayes might indicate the problem better than migrate-ML]). - 3 cells showed very high number of variable sites, these 3 cells are all associated with a very low migration rate (a migrant every 100000+ generation), simulating data for a two population scenario so forces a single migration at the bottom of the tree and two very long branches on which each site will change the state many times (randomized) and so as result the populations will be about 25% similar per site. It would be interesting to see other program's results under such a scenario, although I doubt that this interesting for any biologically relevant data. When the article came out I thought of writing a reply but then reconsidered thinking that people will spot the obvious (*), but recently I realized that it still will be worthwhile, hopefully by summer I have a more general manuscript submitted on the topic. (*) No attempt was made to explore whether the program was run too short, in fact for all simulation cells migrate was run at the default values, I had the impression all discussion of getting good results in the manual were ignored. One fact is certainly true, but we know this already, that if we run MCMC-ML to short the profile likelihoods deliver too narrow confidence limits, MCMC-Bayes seems to fair somewhat better on this but then see the ***Consistency. ***Email and response to inquiries: Jen complained about unresponsiveness of the lamarc team. Well, migrate runs its own website (http://popgen.scs.fsu.edu/ ) and I respond to all email I get about migrate (of course occasionally email might fall prey to our spam filter or gets buried in my email pile). I have received many email-questions about how to run migrate (not from Tatjana, so) and typically they fall into a couple of categories: - seeking advice how to run the program (without reading the manual): answer is often very easy and helps to run the program - not able to run the program, although having used the manual to construct the data file: most common problem is that there are errors in the data file (miscount individuals or sites, adding blank lines etc) or that program executable was transferred from one computer to other (mac->windows/ unix). The data format is definitely not the greatest of all, but with a simple text editor this typically works out fine, for microsatellite data you might want to check out MSA. - program runs but takes too long, or returns inconsistent results: see under ***Consistency - try to do likelihood ratio test or other "obscure" features, that need more explanation in the manual. Of course, getting feedback helps to improve the manual. - very few people complain that migrate does not run but never return email _why_ it does not run, so there might be a silent majority out there that is not able to run the program but is is very hard to know when the silent majority is not speaking up. It s great that the IM team gives good feedback, I always had the impression that lamarc team in Seattle is also giving good feedback, so. Peter ---- Peter Beerli, Computational Evolutionary Biology Group School of Computational Science (SCS) and Biological Sciences Department 150-T Dirac Science Library Florida State University Tallahassee, Florida 32306-4120 USA Webpage: http://www.csit.fsu.edu/~beerli Phone: 850.645.1324 Fax: 850.644.0094 On Mar 25, 2006, at 1:13 AM, evoldir@evol.biology.mcmaster.ca wrote: Dear Members Many thanks for all your helpful suggestions with the program Migrate. Some people pointed out the fact that it is not a very reliable program and others gave some suggestions for the program parameters. I've added the answers below just in case someone else might need them. Many thanks Tatiana Tatiana Zerjal The Sanger Institute UK tz1@sanger.ac.uk Hi Tatiana, I've played around with Migrate some myself. Before you continue, I would take a serious look at the following article: Evaluating the performance of likelihood methods for detecting population structure and migration ZAID Abdo, Keith A. Crandall, PAUL Joyce Molecular Ecology 2004 13:4 837 They find that the results from Migrate are not very reliable or accurate. I might choose different software (Jody Hey's IM, Prtichard's STRUCTURE, etc.) to get at this question. -Jeff Hi Tatiana - I've had major problems with consistency for the Migrate, Fluctuate, Lamarc programs and despite multiple runs of the same dataset, I never got convergence. Depending on which way the wind blows, you will get all kinds of numbers. Like flipping a coin. It doesn't matter what your starting parameters are or number of chains.. the programs are horrible. So, I suggest you steer clear of that batch of programs - there's a paper about Migrate by Paul Joyce, Zaid Abdo, and Keith Crandall that shows how bad the Migrate program is - it's in Mol Ecol. Most of the time, Migrate didn't find the simulated theta within the 95% confidence interval... and worse yet, it was *really* far off most of the time. There is a program called IM by Jody Hey (he's at Rutgers) which does consistently give you "ballpark" theta and migration estimates. I typically do 10 identical runs to watch if they are converging, and I mark down the estimates. You have to run 2 populations at a time, so you get a theta_1, theta_2, theta_ancestral, m1, m2, and time since divergence. I haven't used the program yet for usats, but it takes sequence data and usats. I then cancel the runs that are really different from the others... typically, 2 or 3 will converge on some strange number while all the others will hit the same ballpark. Then I take the run that falls in the middle of that convergent set and let it run until the effective sample sizes (ESSs) for time to convergence are well over 1000. This means I let the thing run for a few weeks. You'll get an output file that has probabilities for each of the estimates (theta, m, and t) and 90% posterior density confidence intervals. I used IM for my Mol Ecol paper - it's in the Dec 2005 issue... cave crayfish with Keith Crandall. And Rasmus Nielsen and Jody Hey are great and have published in Genetics using usat data for the IM program - they actually write back and respond to your questions, unlike the Lamarc crew. I hope that helps, Jen Other people may have more detailed/technical suggestions... A simple thing you can do is repeat the analysis. If the results are very different (eg. no or small overlap in confidence intervals for parameters) you definitely need to run it for longer. Also if the confidence intervals are large, you may need to run it for longer, although I would not like to specify how large is too large. Hope this helps. Karen Bell [karen.bell@wku.edu] I would suggest you to start with FST values. Do some runs with say 10 short chains and 1 long chain starting with different random seed numbers. Try to play with parameter that explore the sample space, such as how many trees to discard etc. Compare the results and see if they converged to something common or if the results are completely disparate. You will acquire confidence in your runs as you see the results and experiment with the parameters. Once you are confident with the parameters you prepare the "official" run with the parameter set you have determined in your trials. Follow the manual for that and try to optimize the number of runs in relation to the computer time and power you have available. Do three or four runs, starting the first with FST parameters and the subsequent with what you obtained in the previous. Do a single very long run at last. Good luck Julianno Hello Tatiana. I have run Migrate ad infinitum, and so hope I can offer some good suggestions. First I'll give you my advice, and then a blurb from something I wrote on MCMC sampling a while ago, if you are interested. Running the Program 1) Run it multiple times, starting from random positions in parameter space. If you get consistent results, you be confident that you have sampled adequately. 2) Use heated chains, typically one cold and three hot (NOTE: this is not a default for the program, so you will have to specify this in your parmfile). This will greatly improve convergence and mixing of the chains. Please note, however, that adding heated chains will slow down your analysis substantially (you are doing four searches rather than one), so don't use more than four chains (unless you have access to some sort of NASA computer). 3) For preliminary runs, you may want to use the Brownian motion model. This is much (much) faster than the ladder model, and will give you reasonable results. If you like, you can take the results from the Brownian motion model and use them as starting parameters in the ladder model. MCMC Sampling The concern with MCMC analyses is whether we have achieved convergence, either to the posterior probability distribution or the global maximum peak on a likelihood surface (you didn't mention whether you are running the ML or Bayesian version of Migrate). Though we can never prove that we are sampling from these distributions, there are steps we can take to increase our confidence (Tierney, 1994). First, we can run the analysis multiple times starting from random points in parameter space; if the same results are obtained across runs we can be fairly confident that the chain has converged on the desired distribution. Secondly, we can run multiple chains simultaneously with communication between the chains, preferably with "heating" (see below). This latter method not only increases the likelihood of convergence, but also increases the mixing ability of the chain (i.e. explores isolated peaks in parameter space). Clearly, the more samples that are taken, the better approximation. But how many is enough? The truth is that we can never be absolutely sure that we have collected an appropriate number of samples. In other words, it is not possible to determine suitable run-lengths theoretically, and so this requires some experimentation on the part of the user. The concerns here are: 1) is the MCMC sample representative of distribution that it was sampled from, and 2) are enough samples collected to estimate a particular parameter with reasonable precision (i.e. low variance). Roughly speaking, Monte Carlo error decreases as the square root of the number of iterations (e.g. to reduce error by a factor of 10, increase the number of iterations by a factor of 100). As for the number of short/long chains you will require, this will depend much more on the number of populations than the number of samples, as each population adds parameters to be estimated. A strict number of samples is not the only concern in an MCMC search, however, because samples from an Markov chain are autocorrelated. In other words, the absolute number of samples taken is far greater than the effective number of samples. There are two strategies to get around this: 1) take a far greater number of samples, or 2) thin your Markov chain. Option 2 appears to be the preferred method, in part because autocorrelation can be directly measured and thus controlled for. Using an autocorrelation plot from a pilot run, the lag time (n, the number of iterations) required for effective independence of samples can be determined. The user can then take samples every n iterations in the actual analysis and be confident that samples are roughly independent. Regardless of the strategy taken, dealing with autocorrelation of MCMC samples requires far greater run times. Finally, you should use "heating" (Metropolis-Coupled MCMC) in your Migrate runs. Without getting too technical here, heating serves to allow chains to move more easily through parameter space by lowering peaks and, more importantly, decreasing the depths of valleys. You can imagine that parameter space would be better explored if for some chains insurmountable valleys did not exist. The benefits of heating are undoubtedly great, but they also come with a price. Each additional heated chain added to the analysis considerably increases the time to completion. The reason is simple: within each chain, each iteration requires the calculation of a computationally expensive likelihood function; running n chains therefore requires n calculations of the likelihood function each iteration. What is more, each chain requires a burnin, which is wasted computing effort. This constraint forced investigators to consider the tradeoff between the necessity for running multiple heated chains (at least 4 chains are required for sufficient mixing) to better explore parameter space and the requirement of running the cold chain long enough to obtain a sufficiently valid sample from the posterior probability distribution from which to draw meaningful conclusions. The recent advent of parallel computing, however, has greatly diminished this conflict. If you do not have access to parallel computing, then I would go with four chains, one cold and three hot (i.e. chain temperatures of 1, 1.2, 1.5, and 3.0). If you are interested, you can read a fuller description of MCMC in my (now somewhat antiquated) paper, available here: http://www.ummz.lsa.umich.edu/students/josephwb/ Brown_Bayesian_Paper.pdf Hope this helps, and good luck. Joseph. tz1@sanger.ac.uk Peter Beerli