Dear all,
I recently posted this question:
I'm working on population genetics of reef fishes; I've used microsatellite
loci to detect population structure and I've found significant Fst value
among different sampling locations. However, when running STRUCTURE I
failed finding any partition among samples. Does anybody has a clue about
why is this happening??
Answers are listed below:
Answers
1) with polymorphic enough data, Fst tests are powerful to detect weak
differentiation between a priori groups. STRUCTURE tries to find a structure
in a dataset without a priori groups, which is much more challenging, and
will often fail when the Fst between real groups is too low (e.g. Fst <0.03).
2) Your experience is not unique. For example, we recently published a
paper in Molecular Ecology on our work on collared lizards. We found
significant isolation by distance and an fst, but STRUCTURE gave us nothing.
We then used a different Bayesian assignment program that also incorporates
spatial information, BAPS, and got beautiful results. In this case, I had
translocated these populations into an area in which they had gone extinct,
and we have followed in detail their entire colonization and dispersal
history for over 30 years. BAPS reconstructed this known history very
accurately. The BAPS results are in our paper (Neuwald, J. L., and A. R.
Templeton. 2013. Genetic restoration in the eastern collared lizard under
prescribed woodland burning. Molecular Ecology 22:3666-3679). Right now
I'm in Israel working on an endangered salamander. When we applied
STRUCTURE to our data, we got just two divisions (the Galilee and Mt.
Carmel, which are isolated from one another and show extreme genetic
differentiation). However, when we applied BAPS, in addition to this major
subdivision, it subdivided the Galilee into 10 subpopulations, all of which
made excellent sense given the topography of the area and our previous
studies on dispersal. Using these 10 subpopulations, we had highly
significant results with fst and AMOVA - all completely invisible to
STRUCTURE.
I have also had experience with STRUCTURE in my work in human genetics, and
have found I can get just about any result I want by playing with K, which
is notoriously hard to estimate in a statistically meaningful fashion. I
truly do not understand the popularity of STRUCTURE. I advise you to simply
avoid its use, and go to other programs such as BAPS. A non-parametric
alternative that has been used mostly in the human genetic literature is
with the program Awclust (http://awclust.sourceforge.net/docs/index.html).
3)My guess is that you are looking at two different scales in your data
(also I am not sure what parameters you used in structure) - if you find a
local structure (it could be that your individuals are more related and then
detect significant Fst between populations) whereas you have enough
migration (in the sense of genetic exchange) between groups which lead
structure to consider your population as panmictic.
I would suggest that you have a look at: Gauffre B, Estoup A, Bretagnolle V,
Cosson JF (2008) Spatial genetic structure of a small rodent in a
heterogeneous landscape. Mol Ecol 17:4619-4629 and maybe who cites this
paper.
4)Personally I only use Bayesian clustering when I am desperate, e.g. I
suspect strong FIS to come from Wahlund effects but have no clue to find the
origin of it. The assumptions of panmixia and Linkage equilibrium (the last
being impossible to reach in real populations), and also because I really do
not understand what these kind of softwares really do, are constraints that
make me quite reluctant. I prefer using old stuffs that are directly
connected to demography in a way I can understand.
In your case, you might have a continuous (or nearly so) increase of
differentiation with some factor(s), the most obvious being geographic
distance. You might also have multi-hierarchical levels. All factors that
might prevent STRUCTURE finding anything.
Try to study isolation by distance and, if it works, you will get much more
information than STRUCTURE will ever give you.
A good thing also is checking that all your loci behave the same (for both
FST and FIS). If one or two loci display unusual behaviour as compared to
all others, this might represent the signature of some technical or
non-neutral factors that may also disturb STRUCTURE functioning.
5)Fst can become significant also for very small value if the sample size
turns larger. In fact, and that is true for most statistical applications,
if you increase your sample size enough you get significant results
eventually - even though they will be biologically irrelevant.
It is difficult to evaluate your question without having seen your structure
results, or knowing your runtime settings. Maybe you have performed
Structure in a wrong way, e.g., you may have run it too short? Can you ask
experienced Structure users around? Or are you experienced yourself? I dont'
know. Structure is also known to be not sooooo good in picking up subtle
population division. Check out the software
DAPC, maybe it helps more (download here:
https://dl.dropboxusercontent.com/u/40499866/Jombart-T._Discriminant-analysis-of-principal-components-A-new-method-for-the-analysis-of-genetically-structured-populations_2010.pdf).
You can also investigate the hypothesis of panmixia by migrate-n. Download
one of my papers where population structure was an issue and confusing here:
https://dl.dropboxusercontent.com/u/40499866/kraus-Global%20lack%20of%20flyw
ay%20structure%20in%20a%20cosmopolitan%20bird%20revealed%20by%20a%20genome%2
0wide%20survey%20of%20single%20nucleotide%20polymorphisms.pdf.
6) Your result may not be that unexpected. When you calculate Fst you supply
much more information than structure had - the population designations of
each individual. Try running the version of structure where you provide
training samples for each population. If you use - say - half of your data
as training, you may find the rest fall neatly into their
7) Because of how p-values for FST are usually calculated it is possible to
get a 'significant' FST when in reality there is little or no population
structure. You should interpret to the FST value itself, rather than the
p-value. Another option if you have info about sampling locations is to
check the option in STRUCTURE that uses this as a prior. Doing this will
pick up more subtle structure within your sample.
8) If by 'significant' you mean you get p-values below say 0.05, this
doesn't mean there is real structure. P-value testing coupled with Fst like
measures is notorious for type I errors (see here
).
Additionally, I have read somewhere that structure can't detect
differentiation where Fst < 0.01, however this figure may not always be
valid when using microsatellites since they often misbehave when coupled
with Fst (see below).
I would recommend you first check to make sure your microsatellite
loci are suitable for use with Fst, and don't suffer from the well
know problem of negative bias as a result of high diversity (see here
). Following this, testing the significance
of genetic differentiation is much more appropriately done using a
bootstrapping method whereby 95% confidence intervals can be used.
If you want any more information about how you can do such analyses, I and
some colleagues have an R package, diveRsity (and associated web app
http://glimmer.rstudio.com/kkeenan/diveRsity-online/) which will allow you
to calculate Fst, Gst, G'st and Jost's D, compare the relationship of each
statistic and calculate 95% confidence intervals for each.
9) It's somewhat counterintuitive, but Fst can be more sensitive at
detecting differentiation than STRUCTURE is.
One might imagine that a genotypic approach, capturing recent information
from an array of markers, would have more power for
detecting differentiation than allele-frequency based approaches. But
that may not be true.
You can test this with simulations for your situation, as Katherine
Harrisson did using EASYPOP in
Harrisson KA, Pavlova A, Amos JN, Takeuchi N, Lill A, Radford JQ, Sunnucks
P. (2012) Fine-scale effects of habitat loss and fragmentation despite
large-scale gene flow for some regionally declining woodland bird species.
Landscape Ecology, 27, 813-827.
10) This type of result is not unexpected. When you calculate Fst, you
provide substantially more information than you provide to Structure: the
population from which each observation came.
Lacking this information, Structure has to integrate over all possible
population allocations of individuals to the specified number of populations
- with corresponding uncertainty about allele frequencies in each
population.
You can, however, use Structure differently to include some population
allocation information.
For example you can find settings in Structure which use a subset of the
data, with their population allocations, as training dataset, and then
classify the other individual according to the proportion of their genome
from each population. If you do this, I anticipate Structure will allocate
a large proportion of the test data to the appropriate population.
11) This is common, which is why you need to use a number of methods to
investigate population structure: population trees, PCA plots FST etc.
Sturcture is not good when there is isolation by distance, so this could be
an issue.
Finally have heard that if there are lots of unique alleles in each
population, this can obscure the structure.
One thing to try is use long rungs (burn in) etc. and use a locprior model.
12) Bayesian clustering algorithms are better able to partition samples when
FST values are high and when genetic differentiation among populations is
strong. As genetic differentiation among populations gets weaker, Bayesian
clustering algorithms have less variance to work with, and are less able to
correctly identify population structure. I did a simulation paper in which
I evaluated 3 common (non-spatial) Bayesian clustering algorithms
(STRUCTURE, BAPS, PARTITION) to determine their relative utility for
detecting population structure as the level of differentiation decreased. I
have attached it here, though others have noted similar phenomena.
Clustering algorithms that take spatial data into account will likely be
better at detecting structure at lower levels of differentiation, but may
not be appropriate for your study system. I hope you find this information
helpful, and good luck in your research.
13)I had a similar problem recently with a dataset I was working on and the
STRUCTURE manual mentions this is a common phenomenon. Have you tried
running it with the LOCPRIOR model selected? It basically takes into
account your own "populations" of where you collected each individual to
assist the algorithm in finding structure in the data. There is a section
in the manual on the LOCPRIOR model and it is pretty straight forward. It
did in fact improve the results of my analysis. Let me know if you have any
other questions.
14) As you probably know the STRUCTURE software might not find structure if
this weak. Even if the Fst is significant if its value is low the signal
might be not strong enough to be detected by STRUCTURE.
Have you tried use sampling locations as prior information? As you can check
in the manual, this might help the clustering when the signal is relatively
weak without leading to spurious results.
15) We've had similar things happen to us with some of our data sets (see
results of analysis of wingless alleles in attached paper). When you
calculate Fst values using Genepop or other similar programs, you are
assigning individuals to populations a priori (without reference to the
data). Structure assigns individuals to populations on the basis of the
data itself. This is a very useful attribute of Structure, but it comes
with a cost: a loss of statistical power for detecting differences among
populations when they are only moderately differentiated from each other.
This loss of statistical power is particularly evident when the sample sizes
for some of the populations being considered are small.
16) the lack of partitions in your samples could be due, in my view to a
Isolation by distance pattern. Could it be the case?
In addition, the significance in FST values does not imply the existence of
genetic structure (e.g the FST could be very low, even if is significant).
The first thing I would do is to perform a MDS and a PCA (probably more than
two components) to explore how the samples are in a plot.
Secondly, you could try to thest Isolation by distance with a mantel test
and then you could use the DAPC (Discriminant Analysis of Principal
components) or the SPCA (spatial principal components analysis) implemented
in the R package "adegenet".
17) I think it depends on how many K you are looking at and the parameters
you set such as burn-in and the number of iterations. Good practice is
pretty computer-intensive. The number of K's tested should be equal to K+1
groups. Burn-in at 10,000 is sufficient but 100,000 is best. At least
100,000 replicates but 1 million is best. 10-20 iterations per K is also
suggested. Also, when you average across your iterations, be sure to use
CLUMPP to find the best run so that you avoid issues like "label switching".
I would also suggest you email, Vikram Chhatre, who has written a program on
automating STRUCTURE analysis. His webpage is:
http://www.crypticlineage.net/index.html
Also there is a Google Discussion Group for STRUCTURE, which may be of some
help.
https://groups.google.com/forum/#!forum/structure-software
I suggest you read Gilbert et al. (2008) in Molecular Ecology. Publication
title: Recommendations for utilizing and reporting population genetic
analyses: the reproducibility of genetic clustering using the program
STRUCTURE.
Jessy Castellanos Gell
Genética para la Conservación
Centro de Investigaciones Marinas
Calle 16 No.114 entre 1ra. y 3ra. Miramar,
Playa, Ciudad de la Habana CP 10300. CUBA.
Tel.(537)203 06 17
jessy@cim.uh.cu
jessy@fbio.uh.cu