I have been away from lab for a week, but below I finally summarize the replies that I received to my evoldir query, which was: I have a question regarding the delta k method of inferring the most likely k value in STRUCTURE. This method is discussed in the paper Evanno G, S Regnaut, J Goudet 2005. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14: 2611-2620. Obviously, there is no way to validate k =1. But since k = 2 is based on prior values (in part) generated for k =1, is it also invalid? The reason I ask is that I often get the highest delta k value for k = 2, which for my pops makes little sense. Intuitively, k = 2 would seem to not be able to be validated by Evanno et al.'s method, but I need some reinforcement for this assessment! Any input will be appreciated. First off, a number of evoldir readers affirmed my concern - they, too, have received the highest delta k values for k = 2, which they either found not very informative or peculiar. Below, reader responses are bracketed in quotation marks. My further comments are without quotes. "There's no reason I can think of why evaluating at K=2 would be invalid. You may be finding that K=2 is most probable because you have hierarchical structure in your data set -- the Evanno method finds the uppermost level of structure in a given data set. You may want to subset your data based on the results of the individual assignments for K=2, and run STRUCTURE on those subsets, and evaluate them with the Evanno method. See Coulon et al 2008 (Mol Ecol 17, 1685-1701) for more details." This did occur to me, i.e., that STRUCTURE was capturing the highest level of hierarchy, thus it would be appropriate to then test each cluster independently for substructure. However, I noted that the vast majority of my original 45 populations were being assigned to cluster 1 with high probabilities, and only one pop was almost completely admixed. What disturbed me, however, was that one subpopulation from a cluster of three very closely related populations from a previous published paper, was being assigned to the second cluster in this mega-analysis. "A more statistically motivated alternative to the Delta K method of evanno is the DIC. We have implemented it in TESS (similar to structure but using spatial coordinates in addition to the genetic data). You might have a look at it TESS 2.0: http://www-timc.imag.fr/Eric.Durand/soft.html." My doctoral student is in fact using TESS. Meanwhile, realizing that a minority of my 45 pops were at HWE, and a many were in fact inbred, I began using InStruct, which drops STRUCTURE's HWE assumption and also estimates inbreeding in each inferred population. InStruct also implements the DIC (Deviance Information Criteria) to assess the optimal k value. In fact finds k =7 to be most likely. Interestingly, after the large peak of delta k at 2, 6 and 7 have next highest scores and are statistically equal. "In your case it might be better to follow the common sense approach advocated in the STRUCTURE manual, which as far as I remember is to pick the K where the likelihood, L(K), stops making large improvements, and to combine this with prior biological knowledge. The delta K method is designed to find this K objectively by calculating something close to the maximum retardation, delta K [or abs(L"(K))], of the L(K) vs K curve. In cases where the answer isn't obvious from the L(K) plot (it's always worth looking at the plots), then I'd guess that the delta K plot will also be hard to interpret. "For example, if the "true" K is 1, then the delta K plot should be approximately flat because there should be no K around which L(K) rises substantially from K-1 to K and then rises much less from K to K+1 (this is the signal that max delta K is trying to find). Of course the delta K plot won't be exactly flat, but if there is no clear spike (e.g. the spike at K=5 in Fig. 2D of Evanno et al.) then max delta K will reflect noise. Perhaps this is what you have? It would be nice to have an objective way to tell signal from noise, but for now it's best to include a large dose of common sense (but see the PLoS Genetics paper below). "Other cases where the answer might not be obvious are where the model of discrete admixed populations doesn't represent very well the true population, or where the number of loci or samples is too small. Imagine a situation with no prior information on population membership where there is clear structure reflected in the L(K) plot, but it isn't obvious which of a range of K is best. In this case the delta K method at least has the advantage of objectivity, although reporting a plausible range of K might make more sense than giving a point estimate. "One thing I don't understand about the delta K method is the motivation for using uses abs(L"(K)) instead of -L"(K), which is the true retardation of the curve. Using the absolute value of L"(K) instead of its negative ought to make the method trip up by picking out values of K where L(K) jumps suddenly as well as those where it slows suddenly. Perhaps this situation doesn't arise very often. Does using -L"(K) instead of abs(L"(K)) make a difference in you case? "Incidentally, there's an equivalent version of this approach for determining the number of significant axes of variation in a principal components analysis: http://www.er.uqam.ca/nobel/r17165/RECHERCHE/COMMUNICATIONS/2006/IMPS/IM PS_PRESENTATION_2006.pdf It's based on finding the minimum acceleration, which is equivalent to finding the maximum retardation. "You might also want to look at another objective method of finding K: PLoS Genet 2(12): e190 http://dx.doi.org/10.1371/journal.pgen.0020190 This one is related to PCA and is unique as far as I know in having statistical and population genetic justification (the delta K method being objective but ad hoc). However you'd need to check that it's valid for your sample size and type of data." In the past, I have used the "common sense approach." In other data sets, there was convergence between the former and delta k. Choosing the k value where L(k) first plateaus would lead me to k = 6 or 7 for this larger data set. "There is indeed no way to test for K=1 using Delta K. However, every paper stresses the fact that there is no best way to define K and a combination of the two methods (ln P(D) and delta K) has to be used. If K=1 and we force the model for K=2, most individuals will have a probability around 0.5 and 0.5 to belong to pop1 and pop2. In that case, it's easy to see that K=2 is not the true K. "I also found that in many cases, the delta K is highest for K=2. As mentioned in Evanno et al. paper, " [...] when confronted with more complex hierarchical level of population structure. In such situations, the UPPERMOST hierarchical level of population structure is detected. Subsequent analyses of subsets [...] allow finding the hidden within-group structure. "With real data sets, it's very likely that it exists complex hierarchical level of population structure and it seems to me that this is why we observe most of the time K=2 using Delta K. Subsets need to be analysed!" "My personal feeling on the subject is that when the best value of K is 2, if this one is not valid because the best value of K would have been one, the probability of assignment of your individuals will all be intermediate. However if you have good probabilities of assignment for K=2 (let's say the majority of your individuals with a probability higher than 0.8) then you can considerate that the test is OK." "I have written to the authors about this, and the response is that there is no way to distinguish between k=1 and k=2. I believe delta k for k=2 is valid because delta is based on the difference between k=1 and k=2 data. The k=1 data from STRUCTURE are legitimate, so their use in calculating delta k is fine, too. The problem is that there are no data for k=0, so you can't calculate difference between k=1 and k=0. If k=2 does not make sense for your pops but k=1 does, then you might want consider that you have one admixed population." My issue is that I believe that the real k is actually much higher. Why? Because my study organism has very low gene flow, both by seed and pollen, and the populations show significant isolation by distance at the macro level. InStruct agrees with me, and the DIC puts k = 7 as most likely. Admixture is largely restricted to populations in close geographic proximity. I hope this helps everyone else who wrote indicating they had similar issues. The link to download InStruct (or run it on Cornell's server) is: http://cbsuapps.tc.cornell.edu/InStruct.aspx Sincere thanks to everyone who sent suggestions. Alan Meerow PLEASE NOTE THE NEW PHONE NUMBER! Alan W. Meerow, Ph.D., Research Geneticist and Systematist USDA-ARS-SHRS, National Germplasm Repository 13601 Old Cutler Road, Miami, FL 33158 USA voice: 786-573-7075; FAX: 786-573-7110 email: alan.meerow@ars.usda.gov Alan.Meerow@ARS.USDA.GOV