Here are the responses to my recent enquiry about how best to estimate the magnitude of selection coefficients from longitudinal data on allele frequency. My original query is given below. Firstly, many thanks to all who responded. I make no comment on the individual responses but would note that we intend to track the spread of insecticide resistance in large geographic regions; the presumed intense selection and (unfortunate) large mosquito population size makes me inclined to ignore the effects of drift. *****************original question***************** I want to estimate selection coefficients from a longitudinal series of diploid genotype frequencies. The gene in question encodes insecticide resistance so selection is relatively intense. I intend to use max. likelihood to fit initial allele frequency, selection coefficient and dominance (i.e. P(0), s and h in standard terminology). Given these three parameters we can predict genotype frequencies at any given time point and use the multinomial distribution to get LL of obtaining the observed number of the three genotypes. So I have two questions (1) Is this a sensible way of doing it, or are there hidden pitfalls and/or is there a better way? (2) Has anyone done it before and published the method so we can cite it? It seems like it should be a standard type of analysis but we haven't been able to track a previous one down yet. (3) Better still, is there a public access programme that we can download? If not, we'll write our own and make it available. Comment from myself (you always think of something straight after you've sent the message) Why check fit to the three genotype frequencies? I'll track allele frequency so why not count the alleles in the three genotypes at each time point and then use the binomial to get LL from the predicted allele frequency. This also avoids having to assume the genotypes are in H-W which may be problematic if selection has already taken place (because, by definition, selection will cause deviations from HW) A sensible procedure, but with one trap. Remember to condition on your observations, that is, update the initial allele frequency every time you make an observation, and evaluate s and h based on each interval of observation. A. G. Clark used this procedure around 1980 in publications in Genetics and Heredity, as far as I remember. /Freddy Freddy Bugge Christiansen, Bioinformatics Research Center (BiRC), University of Aarhus, C.F. Møllers Alle, Bldg. 1110, DK-8000 Århus C. Dear Ian Hastings, Regarding your question on Evoldir, could you possibly post the most interesting answers? I am working at IGC, Portugal in Evolutionary Genetics and I will have data on experimentally evolved populations for which I'll need the same kind of estimations. Best regards Ivo Chelo Ivo M. Chelo, PhD. Evolutionary Genetics IGC Instituto Gulbenkian de Ciência Lisbon, Portugal Dear Ian, I am currently working on a very similar issue, and I would be glad to know if you got some feedback on your question. I have a piece of code in R to estimate p(0) and the relative fitness of 3 possible genotypes using maximum likelihood, and it seems to work (although it is very experimental). The algorithm relies on the assumption that the population size is infinite (no drift), so that all stochasticity comes from sampling. Accounting both for drift and sampling requires much more complex stats, including random effects etc., but it is not impossible (in theory). In case you are comfortable with R, let me know if you are interested in beta testing my code. Otherwise, I am also curious about existing software that can do the job. Cheers, Arnaud Le Rouzic CEES Dept. of Biology >, P.O. Box 1066 Blindern 0316 Oslo Norway I've done it using Bayesian methods: see the attached manuscript. I can send you the BUGS code if you're interested. It should be easy to adapt to your problem. Bob [ms is O'Hara 2005 Proc. Roy Soc] -- Bob O'Hara Department of Mathematics and Statistics P.O. Box 68 (Gustaf Hällströmin katu 2b) FIN-00014 University of Helsinki Finland Hi Ian, John Novembre just made me aware of your posting. You might want to take a look at: http://www.genetics.org/cgi/content/full/179/1/497 If that method seems to apply to your situation I believe John has a program that he probably wouldn't mind sharing with you. Best wishes, Rasmus Nielsen Ian -- A major issue with which you will have to be concerned is sample size. Several years ago, Joel Kingsolver and several coauthors published a series of papers on estimates of selection. In one paper, which I don't seem to have here, they analyzed the relationship between selection estimates and the sample size used in the analysis. They found that at small sample sizes there was a wide range of estimates, including very high ones. As the sample size increased, the maximum value of selection tended to decrease. This indicated that high selection estimates were artifacts of sample error. One of the papers Kingsolver et al. did at the time, but not the one that demonstated the effect of sample size, was Kingsolver, J. G., H. E. Hoekstra, J. M. Hoekstra, D. Berrigan, S. N. Vignieri, C. E. Hill, A. Hoang, P. Gilbert, and P. Beerli. 2001. The strength of phenotypic selection in natural populations. Am. Natur. 157:245-261. Good luck with your work. -- Mike Bell Michael A. Bell, Professor Department of Ecology and Evolution Stony Brook University Stony Brook, NY 11794-5245, USA phone: 1-631-632-8574 fax:1-631-689-6682 Hi Ian, This software package might be helpful to you: PGEToolbox - Matlab toolbox for Population Genetics and Evolution http://www.bioinformatics.org/pgetoolbox/ The functions called STNPDFH calculates the stationary distribution of the frequency X of a newly arisen mutation under selection with dominance factor h. Another function called STNPDFSMPL computes the frequency spectrum for the mutation. Reference:- Population Genetics of Polymorphism and Divergence for Diploid Selection Models With Arbitrary Dominance (2004) Scott Williamson Best regards, James J. Cai, Ph.D. Petrov Lab Department of Biology Stanford University Stanford, CA 94305, USA e-mail: jamescai@stanford.edu Dear Ian I'm writing regarding your evoldir question You may want to look at this paper, doing exactly what you want to do Lenormand, T., and M. Raymond. 2000. Analysis of clines with variable selection and variable migration. American Naturalist 155:70-82. (you can find the pdf on my website) Pierrick Labbé and I have also submitted a paper doing this on a long historical cline series to genetics (it's under review) I'm sure Pierrick could share the ms with you if you're interested (see his email Cc) One big pitfall is that you want to make sure that the frequency change is due to selection and not something else (in particular dispersal), but there are many other pitfalls... Best thomas Thomas Lenormand CEFE - UMR 5175 1919 route de Mende F-34293 Montpellier cedex 5 Ian -- 1. The proposed analysis is fine -- an excellent strategy. 2. Not sure where it is first mentioned in the literature. Let me give you a little perspective. Up to the 1960s methods could not be published unless they were doable by desk-calculator and had closed-form expressions for the estimates. This is not true of the ML selection curve analyses. 3. It became clear to multiple people by the 1970s that your proposed approach was the right one. 3. I would guess the first publication mentioning this should be in the early 1970s -- but I can't think offhand where (Evolution? Genetics?). Some phrases like "population cage" and "selection curve" may be important -- and you should look in population genetics texts of the era and any books (Endler's book on natural selection in the wild? Brian FJ Manly's 1985 book?) J.F. ---- Joe Felsenstein joe@gs.washington.edu Department of Genome Sciences and Department of Biology, University of Washington, Box 355065, Seattle, WA 98195-5065 USA Dear Ian, I am not sure whether the dominance coefficient can be estimated accurately through changes in frequency only, without measuring fitnesses for each genotype. As for the selection coefficient, you can always estimate an "efficient" selection coefficient (assuming no dominance for instance) from the allele frequency change. One easy way of doing this is by taking the slope of the curve : log(p/(1-p)) as a function of time, as we propose in Chevin & Hospital (2008) (latest Genetics issue). This definition goes back to Fisher, and has been used in experimental evolution (see Lenski et al 1991). You may also need to account for the changes in frequency attributable to genetic drift. If you are not aware of the effective population size for this species, it can be estimated jointly with s using the method proposed by Bollback,York & Nielsen (2008). Hope this helps. Cheers, Luis. -- Luis-Miguel CHEVIN Doctorant, UMR de Génétique Végétale du Moulon & laboratoire Ecologie Systématique Evolution, bât 360, Université Paris Sud XI. 01 69 15 70 49 URL : http://www.ese.u-psud.fr/bases/upresa/pages/chevin/index.html Ian Hastings Liverpool School of Tropical Medicine Pembroke Place, Liverpool L3 5QA 0151 705 3183 (office) 0151 705 3147 (group secretary) Email: hastings@liverpool.ac.uk "Hastings, Ian"