Dear Evoldir members, A few days ago I sent some questions regarding my technical problem on doing phylogenies with both mitochondrial and nuclear markers. I am posting all the answers that I received, since some people are facing the some problem. Thanks so much to all that dedicated some time to answer me! Obs: For now, I am using the MrBayes approach of partitioning datasets and it is apparently working. Sibelle Vilaça - sibelletorres@gmail.com 2008/11/29 Dear Evoldir members, I am working with mitochondrial and nuclear sequences (cytochrome b and the intron b-fibrinogen 5) to try to delineate population structure and I am trying to do a phylogenetic tree with these two markers. But I am experiencing some problems with how to make an input for some programs (I am currently working with MrBayes and PHYML). For each individual, I have sequences of both nuclear and mitochondrial markers. Should I analyze these two markers as two datasets? How can I do it? Should I concatenate them and consider as a single sequence and run a modeltest to determinate the best substitution model? In a recent paper, Cabanne et al. (2008) concatenated the nuclear and mitochondrial sequences, and used the IUPAC ambiguity code for the polymorphic sites. Since I know the phase of my nuclear sequences, is there a way that I can use it in the phylogeny with the mitochondrial data? Any comments would be of great help. Sibelle -- Sibelle Torres Vilaça Laboratório de Biodiversidade e Evolução Molecular ICB-UFMG Av. Antonio Carlos, 6627 CP 486 - Sala L3-244 31270-010 Belo Horizonte, MG, Brasil Tel: +55 31 3409-2566 sibelletorres@gmail.com Dear Sibelle You should analyse your data separately. The markers are not linked, thus will no be co-inherited. The true tree for these markers would be highly reticulated. You will also need to account for the fact that there are 2 alleles for the nuclear gene. if you have heterozygous individuals, then PHASE is a very good software to deal with this. Cheers Brent Sibelle, Both the programs you mention allow you to partition a single data set and use different paremeters for each partition in a combined analysis. That seems a good way for you to proceed here. Check the respective manuals for details, but it's not difficult. You can also analyze them together under a single model. Using a gamma distribution of site rates can cover much of the difference among partitions, and the results would not be useless. And this would allow you to use some fast likelihood programs like RAxML and GARLI. In MrBayes, at least, you can't run partitioned analyses using different basic models; only the parameters can vary among partitions. But all other models are special cases of GTR + I + G, so MrBayes will approximate whatever model is appropriate for each partition. Here, for comparison, is a Mr.Bayes block that does partitioned analysis of a mitochondrial data set, by codon position. begin mrbayes; lset nst=6 rates=invgamma; charset 1position = 1-10858\3; charset 2position = 2-10859\3; charset 3position = 3-10860\3 partition by_position 3:1position, 2position, 3position; set partition = by_position; unlink shape=(all); unlink revmat=(all); unlink pinvar=(all); unlink statefreq=(all); showmodel; mcmcp ngen00000 samplefreqP0 printfreq00 savebrlens=yes nchains=4; quit; end; Regards, John H. Hi, If the two genes independently show different models to use from model testing, it would be inappropriate to concatenate them as a single sequence. Good luck with your analysis. David Liberles University of Wyoming It depends on your data, but it's likely that you'll want to partition each gene by codon position and run a separate model on each. Below is an example from a dataset I have, with three genes (28S, EF1a, and wingless). It's possible, however, that making this many partitions will create problems with too many parameters. There's a program called Tracer (tree.bio.ed.ac.uk/software/tracer/) that will help check if there are problems (based on MrBayes output). One nice thing about it is that you can do the run with it fully partitioned and look at how the parameters converged in the different partitions, so if they are similar (for example, if first and second positions are both extremely slow-evolving) you might want to combine them in order to reduce the parameters. The paper by Praz et al. 2008 (Mol. Phyl. Evol. 49(1):185-197) gives a good explanation of using it. Here's how partitioning is done for a MrBayes block: BEGIN mrbayes; charset 28S=1-1599 1600; charset EF101-2176 2428-2703 3017-3268; charset EF1nt101-2174\3 2428-2701\3 3017-3266\3; charset EF1nt202-2175\3 2429-2702\3 3018-3267\3; charset EF1nt303-2176\3 2430-2703\3 3019-3268\3; charset EF1introns!77 2178-2426 2427 2704 2705-3015 3016 3269; charset wg270-3430 3432-3914; charset wgnt1272-3428\3 3432-3912\3; charset wgnt2270-3429\3 3433-3913\3; charset wgnt3271-3430\3 3434-3914\3; partition codongene 8:28S,EF1nt1,EF1nt2,EF1nt3,EF1introns,wgnt1,wgnt2,wgnt3; set partition = codongene; [models] [28S: GTR+I+G] lset applyto=(1) nst=6 rates=invgamma; prset applyto=(1) statefreqpr=dirichlet(1,1,1,1); [EF1 nt1: HKY+I+G] lset applyto=(2) nst=2 rates=invgamma; prset applyto=(2) statefreqpr=dirichlet(1,1,1,1); [EF1 nt2: GTR+I+G] lset applyto=(3) nst=6 rates=invgamma; prset applyto=(3) statefreqpr=dirichlet(1,1,1,1); [EF1 nt3: HKY+G] lset applyto=(4) nst=2 rates=gamma; prset applyto=(4) statefreqpr=dirichlet(1,1,1,1); [EF1 introns: GTR+I+G] lset applyto=(5) nst=6 rates=invgamma; prset applyto=(5) statefreqpr=dirichlet(1,1,1,1); [wg nt1: SYM+G] lset applyto=(6) nst=6 rates=gamma; prset applyto=(6) statefreqpr=fixed(equal); [wg nt2: K80] lset applyto=(7) nst=2 rates=equal; prset applyto=(7) statefreqpr=fixed(equal); [wg nt3: GTR+I+G] lset applyto=(8) nst=6 rates=invgamma; prset applyto=(8) statefreqpr=dirichlet(1,1,1,1); unlink statefreq = (all); unlink revmat = (all); unlink shape = (all); unlink pinvar = (all); And then put in your run settings and end the block. Good luck, Karl You might look at the best program and PNAS paper on this page --- would work for datasets with both alleles for each individual. http://www.oeb.harvard.edu/faculty/edwards/people/postdocs/Liang.htm Hi Sibelle, I would be happy to start a correspondence with you regarding the analysis of unlinked DNA data sets. I suggest a correspondence, because there are several issues that might take a few exchanges to cover. May I suggest a paper of mine that discusses some of these: Maureira Butler, Pfeil, et al. (2008) Sys Biol (attached). Briefly, issues to consider are: 1) Are the markers carrying one history? (are they affected internally by recombination?) 2) If each marker is carrying one history, do they share the same history? (e.g., has hybridization or lineage sorting with recombination given rise to each marker tracking a different history?) 3) If they do have the same history, do the share the same signal? (issues regarding analysis method, model choice, possible effects of selection, long branch attraction, saturation of 3rd positions, effects of taxon sampling, etc) If you can be satisfied of all three things (i.e., that the markers share history, are not affectd by recombination, selection, etc, and have found the appropriate models for each marker and possibly for partitions within each marker) then you can combine the data. HOWEVER, a practical problem remains in that if you have more than one allele per nuclear marker, then it is not clear how these could be concatenated with other markers. Using the ambiguity codes is dangerous, because it assumes that alleles from each individual will be monophyletic. So it therefore precludes the possibility that some alleles from one individual may be more closely related to alleles from otehr individuals. If your sampling is thorough within species, this is something you might reasonably expect to find, especially if your organisms are regularly ourcrossing and heterozygosity is high. My approach to this is to analyse the alleles as separate terminals in the analysis, and not to combine tha markers initially. I then look over the results for each marker and interpret what is going on without combination (unless no marker shows allelic polymorphism, and if they show the same signal, etc, as above, then they can easily be combined). It may be that in many cases alleles from one individual are always monophyletic - then a single allele can be chosen to represent the individual, and then combined with the mtDNA marker, assuming combinability is acceptable, as above. There are a number of papers that we cite in the attached article that give reference to the general problems that various processes (such as hybridization, lineage sorting, selection, etc) can cause for phylogenetic analysis. The main ones include: Wendel and Doyle, Maddison, Avise, to name a few. Please feel free to ask me any further questions. cheers, Bernard Pfeil Hi regarding your phylogeny question.... I personally use PAUP, and you must concatinate the sequences. You are trying to build a 'total evidence tree' in which the information held in both markers is expressed in one phylogeny. PAUP is fairly easy to use, and I would recommend you learn the command line version on the PC. Sorry I cant answer you question directly. Hope this helps Jack Hi, MrBayes can use "partitions". Briefly, datasets are concatenated and each gene is defined as a "data partition". Then, during the analysis, you specify a different substitution model for each dataset/partition, while you force the tree to be the same. This gives more flexibility to the analysis than simply lumping all data together, and therefore forcing both the tree and the substitution model to be the same. If your nuclear alleles are phased, you should not waste this information. One possibility is to duplicate each mitochondrial sequence, and join the two resulting mt sequences to each of the nuclear haplotypes. If you decide to go this way and need some technical assistance, please email me back. regards Francesco Hi Sibelle I worked on this subject a couple of years back and came up with plenty of suggestions in my Sys. Biol. and Evol. Bioinformatics papers which you will be able to download from my publications page accessed by the below. It was predominantly working with parsimony, but I now know that there have been significant advances with Bayesian models by partitioning your data.... Total Evidence always appears to be a good approach, as long as you appreciate where the majority of the support is coming from. Good luck and best wishes Si Creer Simon Creer Research Fellow Molecular Ecology and Fisheries Genetics Laboratory School of Biological Sciences Environment Centre Wales Bangor University Gwynedd LL57 2UW Tel: +44(0)1248 382302 Fax: +44(0)1248 382569 web: http://www.bangor.ac.uk/~bssa0d/ Skype: spideycreer sibelletorres@gmail.com