I have been away from lab for a week, but below I finally summarize the
replies that I received to my evoldir query, which was:
I have a question regarding the delta k method of inferring the most
likely k value in STRUCTURE. This method is discussed in the paper
Evanno G, S Regnaut, J Goudet 2005. Detecting the number of clusters of
individuals using the software STRUCTURE: a simulation study. Mol Ecol
14: 2611-2620.
Obviously, there is no way to validate k =1. But since k = 2 is based on
prior values (in part) generated for k =1, is it also invalid? The
reason I ask is that I often get the highest delta k value for k = 2,
which for my pops makes little sense. Intuitively, k = 2 would seem to
not be able to be validated by Evanno et al.'s method, but I need some
reinforcement for this assessment! Any input will be appreciated.
First off, a number of evoldir readers affirmed my concern - they, too,
have received the highest delta k values for k = 2, which they either
found not very informative or peculiar. Below, reader responses are
bracketed in quotation marks. My further comments are without quotes.
"There's no reason I can think of why evaluating at K=2 would be
invalid. You may be finding that K=2 is most probable because you have
hierarchical structure in your data set -- the Evanno method finds the
uppermost level of structure in a given data set. You may want to subset
your data based on the results of the individual assignments for K=2,
and run STRUCTURE on those subsets, and evaluate them with the Evanno
method. See Coulon et al 2008 (Mol Ecol 17, 1685-1701) for more
details."
This did occur to me, i.e., that STRUCTURE was capturing the highest
level of hierarchy, thus it would be appropriate to then test each
cluster independently for substructure. However, I noted that the vast
majority of my original 45 populations were being assigned to cluster 1
with high probabilities, and only one pop was almost completely admixed.
What disturbed me, however, was that one subpopulation from a cluster of
three very closely related populations from a previous published paper,
was being assigned to the second cluster in this mega-analysis.
"A more statistically motivated alternative to the Delta K method of
evanno is the DIC. We have implemented it in TESS (similar to structure
but using spatial coordinates in addition to the genetic data). You
might have a look at it
TESS 2.0: http://www-timc.imag.fr/Eric.Durand/soft.html."
My doctoral student is in fact using TESS. Meanwhile, realizing that a
minority of my 45 pops were at HWE, and a many were in fact inbred, I
began using InStruct, which drops STRUCTURE's HWE assumption and also
estimates inbreeding in each inferred population. InStruct also
implements the DIC (Deviance Information Criteria) to assess the optimal
k value. In fact finds k =7 to be most likely. Interestingly, after the
large peak of delta k at 2, 6 and 7 have next highest scores and are
statistically equal.
"In your case it might be better to follow the common sense approach
advocated in the STRUCTURE manual, which as far as I remember is to pick
the K where the likelihood, L(K), stops making large improvements, and
to combine this with prior biological knowledge. The delta K method is
designed to find this K objectively by calculating something close to
the maximum retardation, delta K [or abs(L"(K))], of the L(K) vs K
curve. In cases where the answer isn't obvious from the L(K) plot (it's
always worth looking at the plots), then I'd guess that the delta K plot
will also be hard to interpret.
"For example, if the "true" K is 1, then the delta K plot should be
approximately flat because there should be no K around which L(K) rises
substantially from K-1 to K and then rises much less from K to K+1 (this
is the signal that max delta K is trying to find). Of course the delta K
plot won't be exactly flat, but if there is no clear spike (e.g. the
spike at K=5 in Fig. 2D of Evanno et al.) then max delta K will reflect
noise. Perhaps this is what you have? It would be nice to have an
objective way to tell signal from noise, but for now it's best to
include a large dose of common sense (but see the PLoS Genetics paper
below).
"Other cases where the answer might not be obvious are where the model
of discrete admixed populations doesn't represent very well the true
population, or where the number of loci or samples is too small. Imagine
a situation with no prior information on population membership where
there is clear structure reflected in the L(K) plot, but it isn't
obvious which of a range of K is best. In this case the delta K method
at least has the advantage of objectivity, although reporting a
plausible range of K might make more sense than giving a point estimate.
"One thing I don't understand about the delta K method is the motivation
for using uses abs(L"(K)) instead of -L"(K), which is the true
retardation of the curve. Using the absolute value of L"(K) instead of
its negative ought to make the method trip up by picking out values of K
where L(K) jumps suddenly as well as those where it slows suddenly.
Perhaps this situation doesn't arise very often. Does using -L"(K)
instead of abs(L"(K)) make a difference in you case?
"Incidentally, there's an equivalent version of this approach for
determining the number of significant axes of variation in a principal
components analysis:
http://www.er.uqam.ca/nobel/r17165/RECHERCHE/COMMUNICATIONS/2006/IMPS/IM
PS_PRESENTATION_2006.pdf
It's based on finding the minimum acceleration, which is equivalent to
finding the maximum retardation.
"You might also want to look at another objective method of finding K:
PLoS Genet 2(12): e190 http://dx.doi.org/10.1371/journal.pgen.0020190
This one is related to PCA and is unique as far as I know in having
statistical and population genetic justification (the delta K method
being objective but ad hoc). However you'd need to check that it's valid
for your sample size and type of data."
In the past, I have used the "common sense approach." In other data
sets, there was convergence between the former and delta k. Choosing
the k value where L(k) first plateaus would lead me to k = 6 or 7 for
this larger data set.
"There is indeed no way to test for K=1 using Delta K. However, every
paper stresses the fact that there is no best way to define K and a
combination of the two methods (ln P(D) and delta K) has to be used. If
K=1 and we force the model for K=2, most individuals will have a
probability around 0.5 and 0.5 to belong to pop1 and pop2. In that case,
it's easy to see that K=2 is not the true K.
"I also found that in many cases, the delta K is highest for K=2. As
mentioned in Evanno et al. paper, " [...] when confronted with more
complex hierarchical level of population structure. In such situations,
the UPPERMOST hierarchical level of population structure is detected.
Subsequent analyses of subsets [...] allow finding the hidden
within-group structure.
"With real data sets, it's very likely that it exists complex
hierarchical level of population structure and it seems to me that this
is why we observe most of the time K=2 using Delta K. Subsets need to be
analysed!"
"My personal feeling on the subject is that when the best value of K is
2, if this one is not valid because the best value of K would have been
one, the probability of assignment of your individuals will all be
intermediate. However if you have good probabilities of assignment for
K=2 (let's say the majority of your individuals with a probability
higher than 0.8) then you can considerate that the test is OK."
"I have written to the authors about this, and the response is that
there is no way to distinguish between k=1 and k=2. I believe delta k
for k=2 is valid because delta is based on the difference between k=1
and k=2 data. The k=1 data from STRUCTURE are legitimate, so their use
in calculating delta k is fine, too. The problem is that there are no
data for k=0, so you can't calculate difference between k=1 and k=0. If
k=2 does not make sense for your pops but k=1 does, then you might want
consider that you have one admixed population."
My issue is that I believe that the real k is actually much higher.
Why? Because my study organism has very low gene flow, both by seed and
pollen, and the populations show significant isolation by distance at
the macro level. InStruct agrees with me, and the DIC puts k = 7 as
most likely. Admixture is largely restricted to populations in close
geographic proximity.
I hope this helps everyone else who wrote indicating they had similar
issues. The link to download InStruct (or run it on Cornell's server)
is: http://cbsuapps.tc.cornell.edu/InStruct.aspx
Sincere thanks to everyone who sent suggestions.
Alan Meerow
PLEASE NOTE THE NEW PHONE NUMBER!
Alan W. Meerow, Ph.D., Research Geneticist and Systematist
USDA-ARS-SHRS, National Germplasm Repository
13601 Old Cutler Road, Miami, FL 33158 USA
voice: 786-573-7075; FAX: 786-573-7110
email: alan.meerow@ars.usda.gov
Alan.Meerow@ARS.USDA.GOV