Thanks to everyone who responded to my question "Sample size calculation for 'F' statistics". For your interest, I append the original question and copies of all responses. ***** original question ***** I want to look at parasites in a number of separate hosts to check for non-random breeding within hosts. Obviously there will be differences in genotypes between hosts (Fst) and non-random mating would show up as a significant Fis value. The problem is what sample size would I need to detect a given value of Fis? It looks like most people avoid the issue of sample size or power calculations when doing F statistics. The best recent one I could come up with is N. Patterson, A. L. Price, and D. Reich. Population structure and eigenanalysis. Plos Genetics 2 (12):2074-2093, 2006. They present calculations for Fst between two populations and state that the minimum Fst that can be reasonable detected is 1/sqrt(nm) where n is no of individuals and m is no. of molecular markers. Would it be outrageous to extrapolate this to Fis so that, for example, examining parasites from 100 animals would have power to detect a minimum Fis of 1/sqrt(100)=0.1? ***** Responses (in chronological order of receipt) were as follows: ***** >From Adam Porter The thing about power in F-statistical analyses is that the error variance depends on allele frequencies, in addition to the number of loci and populations (& scoring error). Frequencies near 1/2 give the highest power; if you're dealing with rare alleles, you need a lot more individuals to get the same insight. So, you have to make assumptions about the distribution of allele frequencies, which entails accounting for multiple and varying numbers of alleles at different loci. Another potential source of variation, in host-parasite cases with few founders and very high rates of increase: you might get stochasticity in mating patterns that could produce transient inbreeding or outbreeding patterns within a single host; this would be problematic only if you measure the offspring from the first generation, but it would require sampling from more hosts regardless of numbers of individuals or loci. Given the array of variance sources, personally I would try to simulate the process, varying the true inbreeding parameter and sampling regime, to see what variance I got. The simulation is pretty straightforward, since you can calculate the expected genotype frequencies as deviations from Hardy-Weinberg, then sample from those and recalculate the statistic. It should be possible to get a decent idea using a spreadsheet. ***** >From Peter Smouse, (I can forward the pdfs on request- Ian) this is the sort of question people asked of breeding designs in Quantitative Genetics. I am attaching two *.pdf of our own papers. The equation (in the first TwoGener paper) you want to look at is # (10), and the translation would be Phi-ft becomes Fst. In the second paper, we pick the argument apart a bit more, because there are design issues. You should look at Falconer, re the variance of an intra-class correlation, and you should track it back to its original source. Try Falconer and Mackay (the most up to date version of the book), and in a pinch, get in touch with Trudy Mackay. She'll be able to guide your efforts. Note that the number of loci does not (for that treatment) come into the argument, but then Falconer was talking about pedigree relationships and groupings, where we are averaging over a very large number of loci, not assaying particular groups of convenient genetic marker loci. Fst and Fis are going to be a little trickier, because there is extra replication involved. Good luck with it. PS, John Nason recently asked me a similar question, so he is probably working on it as well. ***** >From Rodney J. Dyer. In a perfect world the assumptions behind the 1/sqrt{nm} would hold up, but unfortunately, we do not live in such a world. As a result, IMHO the best approach would be to subsample your data set as you go to find out when the variance of your estimator stabilizes. We did this in Dyer, RJ and VL Sork. 2001. Pollen Pool Heterogeneity in Shortleaf Pine, /Pinus echinata/ Mill. /Molecular Ecology/ 10: 859-866. It will work the same for you for Fis. You may also want to think about the Ayers&Balding estimators for the distribution of inbreeding rather than point estimates (Fis is in general non-symmetric). ***** >From Bruce Weir: If it is Fis you want, then the easiest thing is to recognize that the chis-square goodness-of-fit test for Hardy-Weinberg is n (m-1) f^2 where n is the sample size, m is the number of alleles at the locus and f is the ssample value of Fis. If f is replaced by the parametic value then n(m01)f^2 is the noncentrality parameter for the noncenral chi-square distribution. e.g. For a two-allele locus, the noncentrality is nf^2 and this should be 10.5 for 90% power, of n should be at least 10.5/f^2. I give tables for the noncentrality parameter in my Genetic Data Analysis book. >From Jaret Peter Bilewitch Here's an article that may help with the questions you raise: Ryman et al. 2006. Power for detecting genetic divergence: differences between statistical methods and marker loci. Molecular Ecology 15: 2031-2045. They focus more on comparisons of different tests but do examine sample size as well. ***** >From Thierry de Meeûs (I can forward the pdfs if requested): If you estimate Fis with unbiased statistics (i.e. Weir and Cockerham's 1984), sample size is not a relevant issue Of course, the bigger sub-samples are and the more subsamples there are the most accurate estimate will be. Of course sample sizes are of critical importance for statistical inference (Is the Fis significantly <> 0?) If the infrapopulations of your parasites are small you can avoid this caveat by the number of infrapopulations (i.e. number of sampled hosts) and the number of loci. Under the null hypothesis Fis estimator is not very variable as Fst estimator is, providing quickly fairly accurate measures and statistics. Fis is a standardised measure of non random association of alleles within subsamples, relative to the polymorphism present within subsamples. It is thus totally independent from Fst that measures the standardised non random association between alleles across subsamples relative to total genetic diversity. Fis can measure non-random breeding (in the sense given by non random union of gametes within subpopulations) but you have to assume it comes from no other source (null alleles, micro-wahlund). If you are specifically looking at random pairing of sexual partners and have access to pairs and have genotyped them, then you can test it directly through the method used by Prugnolle et al (2004) and Chevillon et al. (2007). I also attached an article that review and discuss these topics in case you need it ***** >From Jérôme Goudet: Regarding your mail to evoldir concerning the power issue, why not doing it by yourself? You could fairly simply use Easypop for instance with given level of selfing, and then see how many individuals/loci are necessary to achieve significance by subsampling the simulated data sets. On the other hand, Fis is riddle with other problems such as null alleles, hidden population substructure etc..., which is I think the main reason why people do not bother going for a power calculation. ***** "Hastings, Ian"