Dear All

I would like to thank everyone who contributed to the huge response
regarding my recent query concerning ?How many clones should I sequence
? below?

?When cloning and identifying individual alleles of mixed PCR products, a
key question is how many clones do I have to sequence to ensure that I
have approached a good probability (eg 95%) that I have identified all
different copies. In the past when I have been isolating two
heterozygotic copies of a nuclear marker, simply sequencing five
colonies always identified the two variants. But how many colonies would
I have to sequence if there were 3, 4, 5, .... 20 etc...??

Many solutions have been supplied and I have taken the liberty of
reproducing most contributions below. Some favour experimental
approaches such as SSCP, others, standard probability solutions,
binomial calculations and R scripts, poisson distribution, accumulation
curves, online calculators and repeated warnings of recombinant cloning
issues. Thank you for the comprehensive response and please take what
you would like to from the following collation.

Sincere apologies to any contributors that I may not have replied
directly to and please note that some contributions have not been
reposted due to txt/.html conversion issues/duplication/attachment problems.

Hopefully, we should be able to better gauge how many sequencing
reactions we should now be expending on complex cloning mixtures.

With very best wishes and thanks

Si

Dear Si Creer,

If the two copies are present in equal numbers in your pcr mix, this is
easy to calculate. The probability to obtain sequences of both copies,
is (1 - probability that all sequenced clones are identical):
1 - 2*(0.5)n
, with n=the number of sequenced clones.

If you want to have 0.95 probability of having sequenced both copies,
1 - 2*(0.5)n=0.95 ->
n=5.3

This means that you will have to sequence 6 clones to have a >95 %
probability. With 5 sequenced clones, the probability to have both
copies is 1-2*(0.5)5  = 0.9375.

Duur

Duur Aanen
Laboratory of Genetics
Wageningen University and Research Center
The Netherlands
Tel. +31(0)317 482706
Mobile: +31 (0)6 10327948
Fax: +31 (0)317 483146
http://www.gen.wur.nl/uk/staff/postdocs/duur+aanen/

Dear Si,

calculating the probability of missing one of the two alleles under
the assumption that the two alleles are amplified and cloned with the
same probability is very easy. Sort of white and black balls picked
out of a sack (with replacement). I kind of remember the number of
clones to be six if you want to go under 5% of probability of missing
one of the two alleles in an heterozygote. But you can calculate this
based on the ploidy of your system.

But do the assumptions hold ? I am not sure.
This is a starting point, nevertheless.

cheers
francesco

++++++++++++++++++++++++
Francesco Nardi, Dr.

Dept. of Evolutionary Biology
University of Siena

Master in Bioinformatics
University of Siena

via Aldo Moro 2 - 53100 Siena
Italy

Ph.: +39.0577.234420 (lab. 4398)
Fax.: +39.0577.234476

Here is a quick and easy way:

Assuming you have a diploid organism and thus two PCR products in equal
or near proportions and have one sequence already in the bag, then the
chance that you pull the wrong one (probability of a miss) next time is
0.5 and the chance that you do so the next time as well is 0.5 x
0.5..........

So as you have set the acceptable probability of a miss at 0.05, so all
you have to do is keep pushing the buttons on your calculator to
multiply 0.5 by itself until the answer on the display is <0.05.  If you
have kept count of the number of times you pressed the button, then that
is how many clones you need to sequence.

This method works for any probability level or any fractional
representation of what you are after in the mixture.  It is standard
library screening question and can be solved with a fancy formula using
logs and other horrid stuff.  My way is easier if you can keep count.

Hope this helps

Geoff

Dr Geoffrey K. Chambers

Reader in Cell and Molecular Biosciences

School of Biological Sciences
Victoria University
PO Box 600
Wellington 6140
NEW ZEALAND

Ph:   +64-(0)4-463-6091
Fax: +64-(0)4-463-5331

E-Mail:  Geoff.Chambers@vuw.ac.nz

Please visit my Personal Home Page
on the SBS website: http://www.vuw.ac.nz/sbs

Hi Si
This may help
Tom Gilbert
Bower MA, Spencer M, Matsumura S, Nisbet RER and Howe CJ 2005. How many
clones need to be sequenced from a single forensic of ancient DNA sample
in order to determine a reliable consensus sequence? Nucleic Acids
Research 33, 8, 2549-2556.

Hi Si,

You could always plot the cumulative number of clones analysed against
the number of different sequences identified. As the curve begins to
level off (i.e. you are identifying fewer new alleles), you know that
you are approaching the maximum number of different sequences contained
in your PCR product. This does assume that all your alleles were
amplified equally well, but if however some amplify more readily than
others, you might miss some of the rarer ones. In any case, this is a
pretty good method for checking that you are reaching saturation of
alleles. Regards, Sophie

hi Si

I think it is basically a binomial sampling probability issue, like that
dealt with in:
Sjögren P, Wyöni P-I (1994) Conservation genetics and detection of rare
alleles in finite populations. Cons. Biol., 8(1), 267-270.
They made a program called GENESAMP I think, which will tell you how
many samples you need to detect an allele at X% occurrence with 95%
certainty.

Your problem is that you will have to do a lot more work if some of the
alleles are rarer than others.
Up to a certain complexity, we would probably use SSCP in this
circumstance, and it works well.
Sunnucks P, Wilson ACC, Beheregaray LB, et al. (2000) SSCP is not so
difficult: the application and utility of single-stranded conformation
polymorphism in evolutionary biology and molecular ecology. Molecular
Ecology, 9, 1699-1710.

Paul

hi si

am faced with a similar problem, but perhaps even
more complicated - mtDNA heteroplasmy, with an
unknown number of copies at different frequencies
in the tissue.  the frequency thing makes it even
more complicated.

one thing to be careful of - the possibility that
jump-pcr may generate recombinants, making it
look like you have more alleles than you have.

cheers
brent

Dear Si,

We wrote a paper about this a few years ago.  It was designed around
ancient DNA mixtures, but I think it addresses your problem.
I've attached a copy. We also made a web-site where you could type in
the sequence and figure out how confident you are about the sequence:
http://www.mcdonald.cam.ac.uk/projects/genetics/coco/web_aDNAv112.html
Hope this helps,

Ellen Nisbet
Department of Biochemistry
University of Cambridge

Dear Si,

See attached reprint; this formula should answer your question.

Best,
Andrew
**********************
J. Andrew DeWoody
1159 Forestry Building
Purdue University
West Lafayette, IN 47907
765-496-6109
765-496-2422 (fax)
http://www.agriculture.purdue.edu/fnr/html/faculty/DeWoody/index.html
DeWoody JA and Avise Genetic perspectives on the natural history of fish
mating systems. 2001. Journal of Heredity, 92,2, 167-172

Hello Si,

This is a problem I've come across too. If I understand the problem right,
you have mixed PCR products from multiple individuals (or single PCR
products from mixed DNA samples?). Do you know the allele frequency
distribution? The simplest way to solve it would be to assume that every
allele in the mixture is different, i.e. it's an extension of the (binomial)
heterozygote problem, where the probability of sequencing a particular
arrangement copies of each sequence is multinomial and every allele has a
1/(number of alleles) probability of being sequenced at each round of
sequencing. There is probably a simple analytical solution to this, but I
couldn't think of it so I wrote a quick simulation (R code attached), which
gives an approximate solution (table below). Looking at the plot (also
attached), the relationship between number of alleles and number of
sequences required is almost linear from 2 to 30 so a good rule of thumb
seems to be:

sequences required for 95% confidence of not missing anything
= number of alleles * 6.5 - 13

These numbers are based on 10,000 simulations, so should be pretty close to
the right answer. I'll be interested to know if someone sends you an exact
solution.

Hope this is useful!
Paul

# Function using n.sim simulations to work out how many clones would
need to be sequenced
# given a PCR that is an equal mixture of n different sequences. The
function returns the
# number of sequences required for 95% confidence of missing none for a
range of n (allele.range)
#

    HowManySequences<-
      function(allele.range,max.sequences000,n.sim00,conf=0.95)
      {

sequences.required<-vector(mode='integer',length=length(allele.range))
        names(sequences.required)<-paste(allele.range,'alleles',sep='')
        last.sequences.required<-allele.range[1]
        for(j in allele.range)
        {
          mean.x<-vector(mode='numeric',length=max.sequences)
          for(n.sequences in last.sequences.required:max.sequences)
          {
            x<-
              unlist(
                lapply(
                  as.list(1:n.sim),
                  function(i)
                  {
                    trial<-rmultinom(n.sequences,1,prob=rep(1/j,j))
                    sum(apply(trial,1,sum)==0)==0 # TRUE if every allele
was picked up
                  }))
            mean.x[n.sequences]<-mean(x)
            if(mean.x[n.sequences]>0.95) break
          }
          is.na(mean.x[mean.x<0.95])<-TRUE
          sequences.required[j-allele.range[1]+1]<-which.min(mean.x-conf)
          last.sequences.required<-sequences.required[j-allele.range[1]+1]
          print(sequences.required[j-allele.range[1]+1])
        }
        plot(
          type='n',
          x=allele.range,y=sequences.required,
          axesúLSE,
          xlab='Number of alleles',
          ylab=(paste('Number of clones required for ',100*conf,'%
confidence',sep='')))

text(x=allele.range,y=sequences.required,labels=sequences.required,cex=0.7)
        axis(1,at=allele.range,labels=allele.range)
        axis(2,at=sequences.required,labels=sequences.required)
        return(as.data.frame(cbind(allele.range,sequences.required)))
      }

xx<-
    HowManySequences(2:30)

abline(lm(sequences.required~allele.range,data=xx),lty=3)

Hi Si,
Saw your posting on Evodir.

I think your problem could be seen as a poisson distribution problem,
that depends on the frequencies in the population of each allele
you're searching for.  Since you probably don't know those
frequencies when you start, it becomes more of a statistical power
problem -- how rare of an allele you want/need to be able to detect,
given the resources you have at your disposal.

So, the allele frequency is p for some allele, which can be treated
as the probability of detection in a sample size of 1, or as the mean
number of times you'd detect it with infinite trials (sequenced
clones) of sample size 1.  With N = a finite number of trials, you
might get it once, twice, etc., or not at all.  You want the number
of trials you need to get it at least once.  The poisson distribution
describes the number of times you'll get it with a given sample
size.  The probability of sampling it at least once is the complement
of the probability of not sampling it at all.

The poisson probability of sampling a given number k copies of your
favorite allele in attempts  is P(k) = (exp[-Np] [Np]^k)/k!.  Here,
you want the probability of not sampling any copies, which is the k=0
case.  So, P(k=0) = exp[-Np].

Therefore, the probability of sampling at least 1 copy of a
particular allele after sequencing N clones is P(k>0)= 1 -  exp[-Np].

So, stick that in a spreadsheet, and twiddle with values of N and p
to see what sample sizes you need to get P to your acceptable
threshold (you suggested 95%).  Of course, if you are searching for a
particular allele, you might well get a copy long before you reach N
sequenced clones.  But the non-copies will be other alleles in the
population, so where you stop sampling will determine the
probabilities of detecting even rarer alleles in the population.  And
the frequencies of alleles you do sample will estimate the
frequencies in the base population.

With p=0.5 (i.e., the frequency in a heterozygote), you need N=6 to
get a 95% frequency of detecting both alleles - that fits with your
experience (N=5 gives a 91% chance).
With p=0.05 (i.e., a moderately rare allele), you need N` to get a
95% frequency of detecting both alleles.

So, give that logic the 24 hr test --  see if it makes sense both
today and tomorrow.
-Adam

  > Adam Porter
  > Department of Plant, Soil and Insect Sciences &
  >   Graduate Program in Organismic & Evolutionary Biology
  > Hatch Laboratory Bldg rm 13
  > 140 Holdsworth Way
  > University of Massachusetts
  > Amherst MA 01003-9320
  > USA
  >
  > fx: (413) 545-0231
  > aporter@ent.umass.edu
  > http://www-unix.oit.umass.edu/~aporter/

Si,

I don't have a direct answer to your question, but you need to consider why
you're seeing multiple variants.  It could very well be an artifact of PCR
recombination, so you need to be very careful that the variants you are
recovering from cloning are biologically meaningful.  Google scholar on 'pcr
recombination' turns up a lot of information on this phenomenon.  Here's a
recent biotechniques article on it:

Detection of high levels of recombination generated during PCR amplification
of RNA templates
Wayne Yu, Karl J. Rusterholtz, Amber T. Krummel, and Niles Lehman
BioTechniques Vol. 40, No. 4: pp 499-507 (Apr 2006)

Good luck!

Jamie
Grad student
Ecology & Evolution
Cornell University

In probability theory that is a famous problem called the Coupon Collector's
Problem or the Classical Occupancy Problem.  Assuming that there are cereal
boxes with coupons in them, and there are N of them and they are
equiprobable,
how many boxes do we have to buy to have a probability P of collecting all N
coupons?  And conversely, what is the expected number of boxes we have
to buy
f we keep going until we get at least one of all N coupons?

It is in many probability textbooks.  A Google search on the phrase
"coupon collector's problem" gets 598 hits, "classical occupancy problem"
gets 797.
     It is covered in William Feller's famous probability textbook:
Feller, W.  1968.  An Introduction to Probability Theory and Its
Applications.
     3rd edition. John Wiley and Sons, New York,.
where it will be found on pages 101-102.  The formula is tiresome.
Here is a link to a copy of the corresponding pages for an earlier edition:
     http://www25.brinkster.com/ranmath/problems/feller/occ01.htm
and here is an exposition of the problem on the web:
       http://gpu.sourceforge.net/gpu_p2p/node12.html
A useful way to do it numerically is recursive.  We maintain an array A(n,k)
which gives the probability that after buying  n  cereal boxes we have  k
different coupons.  Initially set  A(0,0) = 1  and all other A(i,j) = 0.
Then if there are N coupons in all, we have the recursion:

           A(n,k)  =    (1-(k-1)/N) A(n-1, k-1)  +  (k/N) A(n-1, k)

and we use this to compute the A(n,k) for successively larger numbers of
n.  The probability you want is A(n,N).

J.F.
----
Joe Felsenstein         joe@gs.washington.edu
   Department of Genome Sciences and Department of Biology,
   University of Washington, Box 355065, Seattle, WA 98195-5065 USA

            allele.range sequences.required
2alleles             2                  6
3alleles             3                 11
4alleles             4                 16
5alleles             5                 21
6alleles             6                 27
7alleles             7                 33
8alleles             8                 39
9alleles             9                 45
10alleles           10                 50
11alleles           11                 57
12alleles           12                 64
13alleles           13                 69
14alleles           14                 77
15alleles           15                 84
16alleles           16                 90
17alleles           17                 96
18alleles           18                104
19alleles           19                111
20alleles           20                116
21alleles           21                125
22alleles           22                131
23alleles           23                139
24alleles           24                145
25alleles           25                152
26alleles           26                158
27alleles           27                167
28alleles           28                174
29alleles           29                182
30alleles           30                187

_____________________________________________________
Paul Johnson
Robertson Centre for Biostatistics
Level 11, Boyd Orr Building
University of Glasgow
University Avenue
Glasgow G12 8QQ, UK
Tel/fax: +44 141 330 4744/5094
paulj@stats.gla.ac.uk
http://www.stats.gla.ac.uk/~paulj/index.html

I use a back-of-the-envelope calculation based on the binomial
distribution:

If a rare (allele, sequence) is present in the population at 10%
(p=0.1) then I have an 88% chance of detecting it at least once if I
examine 20 clones. (p=0.1, q=0.9, n , prob=sum(prob | x (for
successes)=1 to 20).    See attached spreadsheet.ï¿¼You can
arbitrarily pick at rate, p, for rare alleles depending on where the
sample comes from, or you could get your rate of rare alleles from
previous screens?  In dengue, other researchers have detected
variants amongst clones of a PCR from a serum sample from a single
host at a rate of approximately 5%.
I've used this in several grant applications and papers (ï¿¼e.g.,
bottom of page 2).

Shannon
Bennett SN, Holmes EC, Chirivella M, Rodriguez DM, Beltran M, Vorndam V,
Gubler DJ and McMillan WO 2003. Selection-driven evolution of emergent
dengue virus. Mol. Biol. Evol. 20 10 1650-1658.

Dear Si Creer,

The theoretical number of colonies depends on the number of alleles,
as you indicated, but also on their frequency in the population of
clones.  Both of these are unknown.  If they were known, you could
use a multinomial distribution to get at the sampling probabilities.

For example, in the simple case of two alleles, with one rare (0.1)
and one common (0.9) in the population of clones, you can use the
binomial distribution to calculate what the probability of detecting
the rare allele is for the number of clones sequenced.  In this
example, if you sequenced 10 clones, you would have a 65.1% chance of
encountering the rare allele.  If you only sequenced 5 clones, you
would only have a 41% chance of detecting the rare allele.  In R:

  > 1-dbinom(0, size, prob=0.1)
[1] 0.6513216
  > 1-dbinom(0, size=5, prob=0.1)
[1] 0.40951

In your case you are dealing with determining the number of alleles
empirically.  You could use methods developed for sampling effort
(species accumulation curves for species richness) to determine
whether the number of alleles is approaching an asymptote.

I hope that helps.

Regards, Alex

Hi Si
I don't know of any formula (and I would be interested to see if anyone
else has one), but I have pondered the same question. I work on MHC
genes, and currently work in a system where I amplify up to 4 sequences
per PCR.  I have compared cloning with SSCP and DGGE, and found that
screening 20 clones usually gave me all the sequences that were resolved
by DGGE - although occasionally I did resolve extra alleles on the DGGE
that I didn't find amongst the clones even after screening 20 of them,
so 20 may not always be enough.  I ended up screening the clones by
re-amplifying the insert from them and running them on an SSCP gel, then
just sequencing the variants rather than all 20.
Would you mind sending me any other answers to this question?
Kind regards,
Hilary Miller

FRST NZ Science and Technology Postdoctoral fellow
Allan Wilson Centre for Molecular Ecology and Evolution
School of Biological Sciences
Victoria University of Wellington
PO Box 600
Wellington, New Zealand
Ph: +64 4 463 7443
Fax: +64 4 463 5331
email: hilary.miller@vuw.ac.nz
http://www.vuw.ac.nz/sbs/tuatara/PeoplePages/HilaryMiller2007.aspx

Hi Si

I think this is straightforward when the number of copies are equal.
Ask: "what is the chance of missing a sequence after n samplings with
i alleles?".  When n is 5, I think you still have a 6.25% chance of
missing one of two alleles (0.54).
P = 0.03 (0.55) for n = 6 when i = 2
P = 0.04 (0.6678) for n = 10 when i = 3
P = 0.04 (0.7511) for n = 14 when i = 4
P = 0.04 (0.814) for n = 18 when i = 5
P = 0.045 (0.8317) for n = 22 when i = 6
P = 0.046 (0.8620) for n = 26 when i = 7

So at the 5% level, you need to sequence 3-4 times as many clones as
you have sequences (but it gets a bit bigger as i increases).

It is more complicated when the copy number varies, as it involves
drawing from a frequency spectrum.  Let me know what you find out,
since I need to think about this in sequencing cDNAs.

cheers, graham wallis

Hi Si,

I will try from my other mail. The title is "Mark-recapture cloning: a
streightforward and costeffective cloning method for population
genetics of single copy nuclear sequences in diploids" published in
Molecular Ecology notes (Bierne et al.) so you should not have any
problem finding it if this attachement doesn't work either...

Good luck,
Petri
Bierne N et al 2007. Mark-recapture cloning: a straightforward and
cost-effective cloning method for population genetics of single-copy
nuclear DNA sequences in diploids. Mol Ecol Notes
doi:10.1111/j.1471-8286.2007.01685.x

Dear Si:

Regarding the calculation for a mixture of three products or above;

The fractional representation of what you want if all are in equal
numbers is 0.33.

So p(hit) = 0.33 and p(miss) = 1- p(hit) = 0.66

So here you need to do (0.66)^n until n is big enough that the value < 0.05

I think you can see that you will need a bigger n than for the two
product case

You can find the exact formula using the ratio of natural logs etc. in
most Mol Biol texts and lab guides under screening libraries.  I don't
have an exact reference to hand.

Hope this fills the concept gap

Cheers

Geoff

Hi Si,

I think this is the same problem that's addressed by rarefaction curves
in ecology.  There the question is, How many species are there if I've
sampled X times and observed Y distinct species.  The rate of increase
in Y must drop with increasing X and this theory estimates the asymptote
of Y as X goes to infinity.

I don't particularly know the literature but wikipedia, google or
similar are probably fine places to start.  Nor do I think I'm the first
person to recognize the applicability of this theory to exactly your
problem.  Hope this helps!

Best,
Dan

---------------------------------
Daniel M. Weinreich
Department of Ecology and Evolutionary Biology, and
     Center for Computational Molecular Biology
Brown University Box G-W, Providence, RI 02912
For FedEx add: Sydney Frank Hall/LSB 157, 60 Olive Street
Office phone: 401/863-3937 lab -2749 fax -2166
http://www.brown.edu/Departments/EEB/weinreich/weinreichindex.htm
http://www.brown.edu/Research/CCMB/

Hi Simon,

I don't really have the answer to your question - but just wanted to
point your attention to the 'jumping pcr'-problem in cloning since the
extent of that phenomenon in your samples will influence heavily how
many colonies to sequence (a paper on the subject is attached). You
probably know all about it. I didn't when I started cloning recently -
and in some cases had to sequence 10 - 15 colonies instead of the
intended 5 to pick up two alleles (since about 25% were false
recombinants)......

Good luck.

Best wishes,

Tove
Bradley RD and Hillis DM 1997 Recombinant DNA sequences generated by PCR
amplification. Mol. Biol. Evol. 14, 5, 592-593.

-- 
Si Creer
Post Doctoral Research Fellow
Molecular Ecology and Fisheries Genetics Group
School of Biological Sciences
University Wales, Bangor
Bangor
Gwynedd
LL57 2UW
UK

e-mail: s.creer@bangor.ac.uk
Tel: +1248 382302
Fax: +1248 371644
Home Page: http://biology.bangor.ac.uk/~bssa0d/

-- 
Si Creer
Post Doctoral Research Fellow
Molecular Ecology and Fisheries Genetics Group
School of Biological Sciences
University Wales, Bangor
Bangor
Gwynedd
LL57 2UW
UK

e-mail: s.creer@bangor.ac.uk
Tel: +1248 382302
Fax: +1248 371644
Home Page: http://biology.bangor.ac.uk/~bssa0d/

-- 
Gall y neges e-bost hon, ac unrhyw atodiadau a anfonwyd gyda hi,
gynnwys deunydd cyfrinachol ac wedi eu bwriadu i'w defnyddio'n unig
gan y sawl y cawsant eu cyfeirio ato (atynt). Os ydych wedi derbyn y
neges e-bost hon trwy gamgymeriad, rhowch wybod i'r anfonwr ar
unwaith a dilëwch y neges. Os na fwriadwyd anfon y neges atoch chi,
rhaid i chi beidio â defnyddio, cadw neu ddatgelu unrhyw wybodaeth a
gynhwysir ynddi. Mae unrhyw farn neu safbwynt yn eiddo i'r sawl a'i
hanfonodd yn unig  ac nid yw o anghenraid yn cynrychioli barn
Prifysgol Cymru, Bangor. Nid yw Prifysgol Cymru, Bangor yn gwarantu
bod y neges e-bost hon neu unrhyw atodiadau yn rhydd rhag firysau neu
100% yn ddiogel. Oni bai fod hyn wedi ei ddatgan yn uniongyrchol yn
nhestun yr e-bost, nid bwriad y neges e-bost hon yw ffurfio contract
rhwymol - mae rhestr o lofnodwyr awdurdodedig ar gael o Swyddfa
Cyllid Prifysgol Cymru, Bangor.  www.bangor.ac.uk

"S.Creer" <bssa0d@bangor.ac.uk>