Hi,
Following my question about bootstraping a distance matrix to produce a
phylogenetic tree with bootstrap confidence values, I got many useful
replies. My original question is below and the replies summary is
further down this message.
On 15-12-18 10:57 AM, Eric Normandeau wrote:
> Hi,
>
> I'm creating a tree from a distance matrix using phylip neighbor and
> phylib drawgram. I would like to have bootstrap values for the
> branches but all I have is this one distance matrix. Is this possible?
>
> The distance matrix is calculated from a Genotyping by Sequencing
> (GBS) dataset with a few thousand SNPs. For each individual pair, the
> distance is basically the proportion of genotypes that differ for loci
> where both individuals are genotyped.
>
> I could easily create different distance matrices by randomly sampling
> only some of the SNPs to calculate the distances, but that would not
> make much sense. For example, if I use a low proportion of the SNPs
> (ie: 5%), the bootstrap values will end up being lower because the
> matrices will be more different from one another than if I use a high
> proportion of the SNPs (ie: 50+%). It feels like I can control the
> bootstrap values artificially, so that is not a good avenue.
>
> Is there a way, given a single distance matrix, to create a tree with
> bootstrap values?
>
> I'll be happy to summarize the answers and post them back on EvolDir.
>
> Eric
>
Here is the summary of the very useful replies:
1) I cannot bootstrap the distance matrix
2) I can bootstrap the markers (random choice with replacement) and
produce 1000 distances matrices
3) I was describing a jackknife procedure, not a bootstrap procedure
David Remington also suggested that I could use a nucleotide
substitution model to calculate a more appropriate distance metric.
However, he mentions that this approach wouldn't work if the SNP dataset
has been pre-filtered, for example by retaining only the variable
markers, as it would make the nucleotide substitution model incorrect.
In our case, we have to work with pre-selected SNPs, otherwise we would
be using hundreds of thousands of SNPs with a lot of missing data and
most of which would be variable in only one individual, thus more
probably sequencing errors.
David also asked if this was a within-species study and it is not. It
uses specimens from a variety of related species with low numbers of
samples per species. That is why we went for a phylogenetic approach.
Jack Cameron suggested using a Canberra distance, which is a robust
Manhattan metric and then to use Pvclust to bootstrap once I have the
distance matrices. The Canberra distance computes a distance between two
vectors P an Q of numbers by computing the sum of, for each index i,
(|Pi - Qi|) / (|Pi| + |Qi|). Since my pairwise comparisons between two
samples use all the loci for which both samples were genotyped, and that
the length of the the P and Q vectors vary among the pairwise
comparisons, I would then divide by the length to normalize them. This
leaves the problem of encoding the possible genotypes (AA, AB, BB) as
numbers, probably with AA=1, AB=2 and BB=3. In this context, however, I
do not see how dividing by (Pi + Qi) makes any sense. It basically gives
more weight to a case of AA-AB than to a case of AB-BB since the later
will be divided by 5, compared to 3 for the former. My implementation of
the distance is equivalent to using only the numerator (|Pi - Qi|) part
of the Canberra distance and then dividing the sum by the length of the
vectors.
Any further thoughts on using a different distance metric would be
highly welcome!
I will then implement the following:
1) Use my VCF file with my SNP markers and bootstrap it 1000 times
2) Use Phylip to compute the concensus tree and bootstrap confidence values
3) Keep an eye out for how to use a different distance metric
Many thanks to all who replied to my question (in chronological order of
replies):
- Bernhard Haubold
- Peter Smouse
- Francisco Bilbao Moore
- David Remington
- Murray Cox
- Jack Cameron
- Frank E. Anderson
- Mary Kuhner
- Miguel NavascuÃ©s
- Louis Ranjard
Eric Normandeau - Bioinformaticien
Laboratoire de Louis Bernatchez
Ressources Aquatiques QuÃ©bec (RAQ)
Institut de Biologie IntÃ©grative et des SystÃ¨mes (IBIS)
Pavillon Charles-EugÃ¨ne-Marchand
1030, Avenue de la MÃ©decine
Local 1143
UniversitÃ© Laval
QuÃ©bec (QuÃ©bec) G1V 0A6
Canada
TÃ©l: 418 656-2131 poste 8455
Courriel: eric.normandeau@bio.ulaval.ca
eric.normandeau@bio.ulaval.ca