Dear EvolDir, I asked a question about how to calculate multi-allele Fst under a forward-time simulation settings and received many helpful comments. Because other people may be interested in this problem, I am posting some of the comments here for future references. Background (Original question): I am the maintainer of a forward-time population genetics simulation environment simuPOP. This program has a Fst estimator based on Weir and Cockerham 1984. However, because simuPOP simulates large populations and the WC84 algorithm is designed to *estimate* Fst of a population from a random sample, I am trying to implement an *observed* Fst statistic using observed and expected heterozygosities of a population. The problem is that although the diallelic case is well documented in population genetics textbooks, I could not find a reliable reference on how to calculate H_obs (step 8 of http://www.uwyo.edu/dbmcd/popecol/Maylects/FST.html) if there are more than two alleles at a locus. Clarification: Several people pointed out that the calculation of H_obs is easy because it is merely the frequency of heterozygotes in the population. However, my question was actually how to handle different heterozygotes (A1A2, A1A3, A2A3 etc) in the calculation of F_st using expected and observed heterozygote frequencies. Although a combined H_obs could be used, it is possible that Fst based on different heterozygotes could be estimated separately, and be combined in some way. It turned out that I was heading the wrong way. I should not have calculated H_obs (and H_exp) in the first place. Answer 1: Fst cannot be estimated if there are more than two alleles at a locus. Maybe you want to use Gst (Nei 1973)? There are also Rst (Slatkin 1995) and more recently Dest (Jost 2008). Feel free to take a look at my Dest estimator at http://www.ngcrawford.com/django/jost/ - Nick Answer 2: In my (perhaps fuzzy) understanding the difference between W&C's theta (Fst) and Gst is the difference between the estimate of the parameter (the mean of Fst over many realizations of evolution) and realized (~observed) value. Nei wrote a short paper on this issue (http://www.jstor.org/pss/2408586). I recollect that there are published methods for calculating Gst at multiallelic loci (some relevant papers come up if you use google scholar (gst multiallelic). I hope this helps Richard Answer 3: You need to use Gst, the k allele analog of Fst. Nei was the one who originally described Gst. See: Nei M. 1973. Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences USA 70:3321-3323. NEI, M., 1977 F-statistics and analysis of gene diversity in subdivided populations. Ann. Hum. Genet. 41: 225-233. Takahata N, and NEI M. 1984. FsT AND GST STATISTICS IN THE FINITE ISLAND MODEL. Genetics 107: 501-504. You might also see this paper: http://www.amjbot.org/cgi/content/full/89/3/460 Matt Summary Answer: It turned out that Gst is equivalent to the Wright's Fst in the diallic case so it can be considered as an extension to Wright's Fst in the multi-allele, multi-locus cases. As Nei pointed out (Nei 1986, Evolution, Vol 40, No. 3), Gst calculates the fixation index for existing populations for which genotype frequencies are available, and Fst (Weir and Cockerham 1984) treats existing populations as samples of infinite many populations derived from the same ancestral population. In the estimation of Fst of large simulated populations where genotype frequencies are known, Gst appears to be more appropriate. However, the calculation and interpretation of Fst are still controversial, after 60+ years of study. :-) New statistics are available (Slatkin's Rst and Jost's Dest), Weir has extended his Fst estimator (Wier and Hill 2002, Annu. Rev. Genet. 36:721-50), and there are different ways to calculate Gst (Culley et al, American J of Botany, 2002, 89, 460-465). For the sake of time, I have only implemented the original Nei 1973 version of Gst in simuPOP. New statistics could be added later. Further discussion: To test the new Gst statistics, I have written a short simuPOP script to calculate Fst and Gst for a large evolving population, and for small samples drawn from this population. The script is posted on the simuPOP online cookbook: http://simupop.sourceforge.net/cookbook/pmwiki.php/Cookbook/PopStructure . Briefly speaking, three randomly initialized subpopulations (sizek) are evolved separately for 400 generations. For every 20 generations, Fst and Gst are calculated for the whole population, and for a random sample (500 individuals from each subpopulation) drawn from this population. My limited simulations show that Fst for the whole population always over-estimates true Fst (greater than Gst), and Fst estimate from the sample does not perform any better. If anyone sees an unsolved problem from such a simulation, I will be happy to assist you. Cheers, -- Bo Peng, Ph.D. Instructor Department of Epidemiology The University of Texas M. D. Anderson Cancer Center The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain PHI (protected health information): further dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems. bpeng@mdanderson.org