Hi, Following my question about bootstraping a distance matrix to produce a phylogenetic tree with bootstrap confidence values, I got many useful replies. My original question is below and the replies summary is further down this message. On 15-12-18 10:57 AM, Eric Normandeau wrote: > Hi, > > I'm creating a tree from a distance matrix using phylip neighbor and > phylib drawgram. I would like to have bootstrap values for the > branches but all I have is this one distance matrix. Is this possible? > > The distance matrix is calculated from a Genotyping by Sequencing > (GBS) dataset with a few thousand SNPs. For each individual pair, the > distance is basically the proportion of genotypes that differ for loci > where both individuals are genotyped. > > I could easily create different distance matrices by randomly sampling > only some of the SNPs to calculate the distances, but that would not > make much sense. For example, if I use a low proportion of the SNPs > (ie: 5%), the bootstrap values will end up being lower because the > matrices will be more different from one another than if I use a high > proportion of the SNPs (ie: 50+%). It feels like I can control the > bootstrap values artificially, so that is not a good avenue. > > Is there a way, given a single distance matrix, to create a tree with > bootstrap values? > > I'll be happy to summarize the answers and post them back on EvolDir. > > Eric > Here is the summary of the very useful replies: 1) I cannot bootstrap the distance matrix 2) I can bootstrap the markers (random choice with replacement) and produce 1000 distances matrices 3) I was describing a jackknife procedure, not a bootstrap procedure David Remington also suggested that I could use a nucleotide substitution model to calculate a more appropriate distance metric. However, he mentions that this approach wouldn't work if the SNP dataset has been pre-filtered, for example by retaining only the variable markers, as it would make the nucleotide substitution model incorrect. In our case, we have to work with pre-selected SNPs, otherwise we would be using hundreds of thousands of SNPs with a lot of missing data and most of which would be variable in only one individual, thus more probably sequencing errors. David also asked if this was a within-species study and it is not. It uses specimens from a variety of related species with low numbers of samples per species. That is why we went for a phylogenetic approach. Jack Cameron suggested using a Canberra distance, which is a robust Manhattan metric and then to use Pvclust to bootstrap once I have the distance matrices. The Canberra distance computes a distance between two vectors P an Q of numbers by computing the sum of, for each index i, (|Pi - Qi|) / (|Pi| + |Qi|). Since my pairwise comparisons between two samples use all the loci for which both samples were genotyped, and that the length of the the P and Q vectors vary among the pairwise comparisons, I would then divide by the length to normalize them. This leaves the problem of encoding the possible genotypes (AA, AB, BB) as numbers, probably with AA=1, AB=2 and BB=3. In this context, however, I do not see how dividing by (Pi + Qi) makes any sense. It basically gives more weight to a case of AA-AB than to a case of AB-BB since the later will be divided by 5, compared to 3 for the former. My implementation of the distance is equivalent to using only the numerator (|Pi - Qi|) part of the Canberra distance and then dividing the sum by the length of the vectors. Any further thoughts on using a different distance metric would be highly welcome! I will then implement the following: 1) Use my VCF file with my SNP markers and bootstrap it 1000 times 2) Use Phylip to compute the concensus tree and bootstrap confidence values 3) Keep an eye out for how to use a different distance metric Many thanks to all who replied to my question (in chronological order of replies): - Bernhard Haubold - Peter Smouse - Francisco Bilbao Moore - David Remington - Murray Cox - Jack Cameron - Frank E. Anderson - Mary Kuhner - Miguel Navascués - Louis Ranjard Eric Normandeau - Bioinformaticien Laboratoire de Louis Bernatchez Ressources Aquatiques Québec (RAQ) Institut de Biologie Intégrative et des Systèmes (IBIS) Pavillon Charles-Eugène-Marchand 1030, Avenue de la Médecine Local 1143 Université Laval Québec (Québec) G1V 0A6 Canada Tél: 418 656-2131 poste 8455 Courriel: eric.normandeau@bio.ulaval.ca eric.normandeau@bio.ulaval.ca