Dear EvolDir,
I asked a question about how to calculate multi-allele Fst under
a forward-time simulation settings and received many helpful
comments. Because other people may be interested in this problem, I am
posting some of the comments here for future references.
Background (Original question):
I am the maintainer of a forward-time population genetics simulation
environment simuPOP. This program has a Fst estimator based on Weir and
Cockerham 1984. However, because simuPOP simulates large populations and
the WC84 algorithm is designed to *estimate* Fst of a population from a
random sample, I am trying to implement an *observed* Fst statistic using
observed and expected heterozygosities of a population. The problem is
that although the diallelic case is well documented in population genetics
textbooks, I could not find a reliable reference on how to calculate
H_obs (step 8 of http://www.uwyo.edu/dbmcd/popecol/Maylects/FST.html)
if there are more than two alleles at a locus.
Clarification:
Several people pointed out that the calculation of H_obs is easy because
it is merely the frequency of heterozygotes in the population. However,
my question was actually how to handle different heterozygotes (A1A2,
A1A3, A2A3 etc) in the calculation of F_st using expected and observed
heterozygote frequencies. Although a combined H_obs could be used, it is
possible that Fst based on different heterozygotes could be estimated
separately, and be combined in some way. It turned out that I was
heading the wrong way. I should not have calculated H_obs (and H_exp)
in the first place.
Answer 1:
Fst cannot be estimated if there are more than two alleles at a locus.
Maybe you want to use Gst (Nei 1973)? There are also Rst (Slatkin 1995)
and more recently Dest (Jost 2008). Feel free to take a look at my Dest
estimator at http://www.ngcrawford.com/django/jost/
- Nick
Answer 2:
In my (perhaps fuzzy) understanding the difference between W&C's
theta (Fst) and Gst is the difference between the estimate of the
parameter (the mean of Fst over many realizations of evolution) and
realized (~observed) value. Nei wrote a short paper on this issue
(http://www.jstor.org/pss/2408586).
I recollect that there are published methods for calculating Gst at
multiallelic loci (some relevant papers come up if you use google scholar
(gst multiallelic). I hope this helps
Richard
Answer 3:
You need to use Gst, the k allele analog of Fst. Nei was the one who
originally described Gst. See:
Nei M. 1973. Analysis of gene diversity in subdivided populations.
Proceedings of the National Academy of Sciences USA 70:3321-3323.
NEI, M., 1977 F-statistics and analysis of gene diversity in subdivided
populations. Ann. Hum. Genet. 41: 225-233.
Takahata N, and NEI M. 1984. FsT AND GST STATISTICS IN THE FINITE ISLAND
MODEL. Genetics 107: 501-504.
You might also see this paper:
http://www.amjbot.org/cgi/content/full/89/3/460
Matt
Summary Answer:
It turned out that Gst is equivalent to the Wright's Fst in the diallic
case so it can be considered as an extension to Wright's Fst in the
multi-allele, multi-locus cases. As Nei pointed out (Nei 1986, Evolution,
Vol 40, No. 3), Gst calculates the fixation index for existing populations
for which genotype frequencies are available, and Fst (Weir and Cockerham
1984) treats existing populations as samples of infinite many populations
derived from the same ancestral population. In the estimation of Fst
of large simulated populations where genotype frequencies are known,
Gst appears to be more appropriate.
However, the calculation and interpretation of Fst are still
controversial, after 60+ years of study. :-) New statistics are available
(Slatkin's Rst and Jost's Dest), Weir has extended his Fst estimator (Wier
and Hill 2002, Annu. Rev. Genet. 36:721-50), and there are different
ways to calculate Gst (Culley et al, American J of Botany, 2002, 89,
460-465). For the sake of time, I have only implemented the original
Nei 1973 version of Gst in simuPOP. New statistics could be added later.
Further discussion:
To test the new Gst statistics, I have written a short
simuPOP script to calculate Fst and Gst for a large
evolving population, and for small samples drawn from this
population. The script is posted on the simuPOP online cookbook:
http://simupop.sourceforge.net/cookbook/pmwiki.php/Cookbook/PopStructure
. Briefly speaking, three randomly initialized subpopulations (sizek)
are evolved separately for 400 generations. For every 20 generations, Fst
and Gst are calculated for the whole population, and for a random sample
(500 individuals from each subpopulation) drawn from this population. My
limited simulations show that Fst for the whole population always
over-estimates true Fst (greater than Gst), and Fst estimate from the
sample does not perform any better. If anyone sees an unsolved problem
from such a simulation, I will be happy to assist you.
Cheers,
--
Bo Peng, Ph.D.
Instructor
Department of Epidemiology
The University of Texas
M. D. Anderson Cancer Center
The information contained in this e-mail message may be privileged,
confidential, and/or protected from disclosure. This e-mail message
may contain PHI (protected health information): further dissemination
of PHI should comply with applicable federal and state laws. If you
are not the intended recipient, or an authorized representative of the
intended recipient, any further review, disclosure, use, dissemination,
distribution, or copying of this message or any attachment (or the
information contained therein) is strictly prohibited. If you think that
you have received this e-mail in error, please notify the sender by return
e-mail and delete all references to it and its contents from your systems.
bpeng@mdanderson.org