Hi,

Following my question about bootstraping a distance matrix to produce a 
phylogenetic tree with bootstrap confidence values, I got many useful 
replies. My original question is below and the replies summary is 
further down this message.

On 15-12-18 10:57 AM, Eric Normandeau wrote:
> Hi,
>
> I'm creating a tree from a distance matrix using phylip neighbor and 
> phylib drawgram. I would like to have bootstrap values for the 
> branches but all I have is this one distance matrix. Is this possible?
>
> The distance matrix is calculated from a Genotyping by Sequencing 
> (GBS) dataset with a few thousand SNPs. For each individual pair, the 
> distance is basically the proportion of genotypes that differ for loci 
> where both individuals are genotyped.
>
> I could easily create different distance matrices by randomly sampling 
> only some of the SNPs to calculate the distances, but that would not 
> make much sense. For example, if I use a low proportion of the SNPs 
> (ie: 5%), the bootstrap values will end up being lower because the 
> matrices will be more different from one another than if I use a high 
> proportion of the SNPs (ie: 50+%). It feels like I can control the 
> bootstrap values artificially, so that is not a good avenue.
>
> Is there a way, given a single distance matrix, to create a tree with 
> bootstrap values?
>
> I'll be happy to summarize the answers and post them back on EvolDir.
>
> Eric
>
Here is the summary of the very useful replies:

1) I cannot bootstrap the distance matrix
2) I can bootstrap the markers (random choice with replacement) and 
produce 1000 distances matrices
3) I was describing a jackknife procedure, not a bootstrap procedure

David Remington also suggested that I could use a nucleotide 
substitution model to calculate a more appropriate distance metric. 
However, he mentions that this approach wouldn't work if the SNP dataset 
has been pre-filtered, for example by retaining only the variable 
markers, as it would make the nucleotide substitution model incorrect. 
In our case, we have to work with pre-selected SNPs, otherwise we would 
be using hundreds of thousands of SNPs with a lot of missing data and 
most of which would be variable in only one individual, thus more 
probably sequencing errors.

David also asked if this was a within-species study and it is not. It 
uses specimens from a variety of related species with low numbers of 
samples per species. That is why we went for a phylogenetic approach.

Jack Cameron suggested using a Canberra distance, which is a robust 
Manhattan metric and then to use Pvclust to bootstrap once I have the 
distance matrices. The Canberra distance computes a distance between two 
vectors P an Q of numbers by computing the sum of, for each index i, 
(|Pi - Qi|) / (|Pi| + |Qi|). Since my pairwise comparisons between two 
samples use all the loci for which both samples were genotyped, and that 
the length of the the P and Q vectors vary among the pairwise 
comparisons, I would then divide by the length to normalize them. This 
leaves the problem of encoding the possible genotypes (AA, AB, BB) as 
numbers, probably with AA=1, AB=2 and BB=3. In this context, however, I 
do not see how dividing by (Pi + Qi) makes any sense. It basically gives 
more weight to a case of AA-AB than to a case of AB-BB since the later 
will be divided by 5, compared to 3 for the former. My implementation of 
the distance is equivalent to using only the numerator (|Pi - Qi|) part 
of the Canberra distance and then dividing the sum by the length of the 
vectors.

Any further thoughts on using a different distance metric would be 
highly welcome!

I will then implement the following:

1) Use my VCF file with my SNP markers and bootstrap it 1000 times
2) Use Phylip to compute the concensus tree and bootstrap confidence values
3) Keep an eye out for how to use a different distance metric

Many thanks to all who replied to my question (in chronological order of 
replies):

- Bernhard Haubold
- Peter Smouse
- Francisco Bilbao Moore
- David Remington
- Murray Cox
- Jack Cameron
- Frank E. Anderson
- Mary Kuhner
- Miguel Navascués
- Louis Ranjard

Eric Normandeau - Bioinformaticien
Laboratoire de Louis Bernatchez
Ressources Aquatiques Québec (RAQ)

Institut de Biologie Intégrative et des Systèmes (IBIS)
Pavillon Charles-Eugène-Marchand
1030, Avenue de la Médecine
Local 1143
Université Laval
Québec (Québec) G1V 0A6
Canada

Tél: 418 656-2131 poste 8455
Courriel: eric.normandeau@bio.ulaval.ca

eric.normandeau@bio.ulaval.ca