Dear EvolDir, a few days ago I posted a query about how to construct a distance tree for 1 million taxa. I posted this query on behalf of someone else (the "OP", if you like), who has assembled a very large dataset of pairwise Smith-Waterman distances between various proteins. The OP's plan was to use these distances to give some structure to a landscape of proteins (orthologs and paralogs from a variety of taxa) by clustering them by distance. Various responders assumed/hoped that the OP also has a sequence alignment, but that's actually not the case. Here's a summary of the responses received: Several responders suggested tools that operate directly on distances. Sergios-Orestis Kolokotronis suggested that QuickTree/QuickJoin ( http://www.daimi.au.dk/~mailund/quick-join.html) might be of use. Jeff Boore suggested a pipeline called PHRINGE ( http://oomycetes.genomeprojectsolutions-databases.com/) that is quite close, in spirit, to what the OP is trying to achieve. A number of responders (Jason Stajich, Paramvir Dehal, Mark Holder) suggested FastTree (http://www.microbesonline.org/fasttree/). However, this requires a protein alignment, which the OP doesn't have. Brian Foley discussed the problem of aligning this many sequences, with some special case suggestions for doing this using an HMMer model. Not really the situation that the OP is dealing with, but an interesting problem nonetheless. Others (Xuhua Xia, Jason Caravas) discussed the issue of allocating memory for this many pairwise distances - the implication being that, if that's taken care of, number-crunching thru all these pairs using neighbor-joining (however implemented) is something that is done in linear time and therefore manageable. For example, Xuhua Xia suggested that this could be done using DAMBE, memory requirements permitting. Jason Caravas suggested that one could also re-implement the neighbor-joining algorithm, which isn't that complex (and usefully described in Joe Felsenstein's book "Inferring Phylogenies"). Ruchira S. Datta recommended a paper ( http://bioinformatics.oxfordjournals.org/content/24/13/i41.abstract?keytype=ref&ijkey=tHbUa129vjtj6z) that discusses algorithms that don't require the entire distance matrix in memory at once. Thank you all for your responses! Rutger -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com rutgeraldo@gmail.com