Dear All I would like to thank everyone who contributed to the huge response regarding my recent query concerning ?How many clones should I sequence ? below? ?When cloning and identifying individual alleles of mixed PCR products, a key question is how many clones do I have to sequence to ensure that I have approached a good probability (eg 95%) that I have identified all different copies. In the past when I have been isolating two heterozygotic copies of a nuclear marker, simply sequencing five colonies always identified the two variants. But how many colonies would I have to sequence if there were 3, 4, 5, .... 20 etc...?? Many solutions have been supplied and I have taken the liberty of reproducing most contributions below. Some favour experimental approaches such as SSCP, others, standard probability solutions, binomial calculations and R scripts, poisson distribution, accumulation curves, online calculators and repeated warnings of recombinant cloning issues. Thank you for the comprehensive response and please take what you would like to from the following collation. Sincere apologies to any contributors that I may not have replied directly to and please note that some contributions have not been reposted due to txt/.html conversion issues/duplication/attachment problems. Hopefully, we should be able to better gauge how many sequencing reactions we should now be expending on complex cloning mixtures. With very best wishes and thanks Si Dear Si Creer, If the two copies are present in equal numbers in your pcr mix, this is easy to calculate. The probability to obtain sequences of both copies, is (1 - probability that all sequenced clones are identical): 1 - 2*(0.5)n , with n=the number of sequenced clones. If you want to have 0.95 probability of having sequenced both copies, 1 - 2*(0.5)n=0.95 -> n=5.3 This means that you will have to sequence 6 clones to have a >95 % probability. With 5 sequenced clones, the probability to have both copies is 1-2*(0.5)5 = 0.9375. Duur Duur Aanen Laboratory of Genetics Wageningen University and Research Center The Netherlands Tel. +31(0)317 482706 Mobile: +31 (0)6 10327948 Fax: +31 (0)317 483146 http://www.gen.wur.nl/uk/staff/postdocs/duur+aanen/ Dear Si, calculating the probability of missing one of the two alleles under the assumption that the two alleles are amplified and cloned with the same probability is very easy. Sort of white and black balls picked out of a sack (with replacement). I kind of remember the number of clones to be six if you want to go under 5% of probability of missing one of the two alleles in an heterozygote. But you can calculate this based on the ploidy of your system. But do the assumptions hold ? I am not sure. This is a starting point, nevertheless. cheers francesco ++++++++++++++++++++++++ Francesco Nardi, Dr. Dept. of Evolutionary Biology University of Siena Master in Bioinformatics University of Siena via Aldo Moro 2 - 53100 Siena Italy Ph.: +39.0577.234420 (lab. 4398) Fax.: +39.0577.234476 Here is a quick and easy way: Assuming you have a diploid organism and thus two PCR products in equal or near proportions and have one sequence already in the bag, then the chance that you pull the wrong one (probability of a miss) next time is 0.5 and the chance that you do so the next time as well is 0.5 x 0.5.......... So as you have set the acceptable probability of a miss at 0.05, so all you have to do is keep pushing the buttons on your calculator to multiply 0.5 by itself until the answer on the display is <0.05. If you have kept count of the number of times you pressed the button, then that is how many clones you need to sequence. This method works for any probability level or any fractional representation of what you are after in the mixture. It is standard library screening question and can be solved with a fancy formula using logs and other horrid stuff. My way is easier if you can keep count. Hope this helps Geoff Dr Geoffrey K. Chambers Reader in Cell and Molecular Biosciences School of Biological Sciences Victoria University PO Box 600 Wellington 6140 NEW ZEALAND Ph: +64-(0)4-463-6091 Fax: +64-(0)4-463-5331 E-Mail: Geoff.Chambers@vuw.ac.nz Please visit my Personal Home Page on the SBS website: http://www.vuw.ac.nz/sbs Hi Si This may help Tom Gilbert Bower MA, Spencer M, Matsumura S, Nisbet RER and Howe CJ 2005. How many clones need to be sequenced from a single forensic of ancient DNA sample in order to determine a reliable consensus sequence? Nucleic Acids Research 33, 8, 2549-2556. Hi Si, You could always plot the cumulative number of clones analysed against the number of different sequences identified. As the curve begins to level off (i.e. you are identifying fewer new alleles), you know that you are approaching the maximum number of different sequences contained in your PCR product. This does assume that all your alleles were amplified equally well, but if however some amplify more readily than others, you might miss some of the rarer ones. In any case, this is a pretty good method for checking that you are reaching saturation of alleles. Regards, Sophie hi Si I think it is basically a binomial sampling probability issue, like that dealt with in: Sjögren P, Wyöni P-I (1994) Conservation genetics and detection of rare alleles in finite populations. Cons. Biol., 8(1), 267-270. They made a program called GENESAMP I think, which will tell you how many samples you need to detect an allele at X% occurrence with 95% certainty. Your problem is that you will have to do a lot more work if some of the alleles are rarer than others. Up to a certain complexity, we would probably use SSCP in this circumstance, and it works well. Sunnucks P, Wilson ACC, Beheregaray LB, et al. (2000) SSCP is not so difficult: the application and utility of single-stranded conformation polymorphism in evolutionary biology and molecular ecology. Molecular Ecology, 9, 1699-1710. Paul hi si am faced with a similar problem, but perhaps even more complicated - mtDNA heteroplasmy, with an unknown number of copies at different frequencies in the tissue. the frequency thing makes it even more complicated. one thing to be careful of - the possibility that jump-pcr may generate recombinants, making it look like you have more alleles than you have. cheers brent Dear Si, We wrote a paper about this a few years ago. It was designed around ancient DNA mixtures, but I think it addresses your problem. I've attached a copy. We also made a web-site where you could type in the sequence and figure out how confident you are about the sequence: http://www.mcdonald.cam.ac.uk/projects/genetics/coco/web_aDNAv112.html Hope this helps, Ellen Nisbet Department of Biochemistry University of Cambridge Dear Si, See attached reprint; this formula should answer your question. Best, Andrew ********************** J. Andrew DeWoody 1159 Forestry Building Purdue University West Lafayette, IN 47907 765-496-6109 765-496-2422 (fax) http://www.agriculture.purdue.edu/fnr/html/faculty/DeWoody/index.html DeWoody JA and Avise Genetic perspectives on the natural history of fish mating systems. 2001. Journal of Heredity, 92,2, 167-172 Hello Si, This is a problem I've come across too. If I understand the problem right, you have mixed PCR products from multiple individuals (or single PCR products from mixed DNA samples?). Do you know the allele frequency distribution? The simplest way to solve it would be to assume that every allele in the mixture is different, i.e. it's an extension of the (binomial) heterozygote problem, where the probability of sequencing a particular arrangement copies of each sequence is multinomial and every allele has a 1/(number of alleles) probability of being sequenced at each round of sequencing. There is probably a simple analytical solution to this, but I couldn't think of it so I wrote a quick simulation (R code attached), which gives an approximate solution (table below). Looking at the plot (also attached), the relationship between number of alleles and number of sequences required is almost linear from 2 to 30 so a good rule of thumb seems to be: sequences required for 95% confidence of not missing anything = number of alleles * 6.5 - 13 These numbers are based on 10,000 simulations, so should be pretty close to the right answer. I'll be interested to know if someone sends you an exact solution. Hope this is useful! Paul # Function using n.sim simulations to work out how many clones would need to be sequenced # given a PCR that is an equal mixture of n different sequences. The function returns the # number of sequences required for 95% confidence of missing none for a range of n (allele.range) # HowManySequences<- function(allele.range,max.sequences000,n.sim00,conf=0.95) { sequences.required<-vector(mode='integer',length=length(allele.range)) names(sequences.required)<-paste(allele.range,'alleles',sep='') last.sequences.required<-allele.range[1] for(j in allele.range) { mean.x<-vector(mode='numeric',length=max.sequences) for(n.sequences in last.sequences.required:max.sequences) { x<- unlist( lapply( as.list(1:n.sim), function(i) { trial<-rmultinom(n.sequences,1,prob=rep(1/j,j)) sum(apply(trial,1,sum)==0)==0 # TRUE if every allele was picked up })) mean.x[n.sequences]<-mean(x) if(mean.x[n.sequences]>0.95) break } is.na(mean.x[mean.x<0.95])<-TRUE sequences.required[j-allele.range[1]+1]<-which.min(mean.x-conf) last.sequences.required<-sequences.required[j-allele.range[1]+1] print(sequences.required[j-allele.range[1]+1]) } plot( type='n', x=allele.range,y=sequences.required, axesúLSE, xlab='Number of alleles', ylab=(paste('Number of clones required for ',100*conf,'% confidence',sep=''))) text(x=allele.range,y=sequences.required,labels=sequences.required,cex=0.7) axis(1,at=allele.range,labels=allele.range) axis(2,at=sequences.required,labels=sequences.required) return(as.data.frame(cbind(allele.range,sequences.required))) } xx<- HowManySequences(2:30) abline(lm(sequences.required~allele.range,data=xx),lty=3) Hi Si, Saw your posting on Evodir. I think your problem could be seen as a poisson distribution problem, that depends on the frequencies in the population of each allele you're searching for. Since you probably don't know those frequencies when you start, it becomes more of a statistical power problem -- how rare of an allele you want/need to be able to detect, given the resources you have at your disposal. So, the allele frequency is p for some allele, which can be treated as the probability of detection in a sample size of 1, or as the mean number of times you'd detect it with infinite trials (sequenced clones) of sample size 1. With N = a finite number of trials, you might get it once, twice, etc., or not at all. You want the number of trials you need to get it at least once. The poisson distribution describes the number of times you'll get it with a given sample size. The probability of sampling it at least once is the complement of the probability of not sampling it at all. The poisson probability of sampling a given number k copies of your favorite allele in attempts is P(k) = (exp[-Np] [Np]^k)/k!. Here, you want the probability of not sampling any copies, which is the k=0 case. So, P(k=0) = exp[-Np]. Therefore, the probability of sampling at least 1 copy of a particular allele after sequencing N clones is P(k>0)= 1 - exp[-Np]. So, stick that in a spreadsheet, and twiddle with values of N and p to see what sample sizes you need to get P to your acceptable threshold (you suggested 95%). Of course, if you are searching for a particular allele, you might well get a copy long before you reach N sequenced clones. But the non-copies will be other alleles in the population, so where you stop sampling will determine the probabilities of detecting even rarer alleles in the population. And the frequencies of alleles you do sample will estimate the frequencies in the base population. With p=0.5 (i.e., the frequency in a heterozygote), you need N=6 to get a 95% frequency of detecting both alleles - that fits with your experience (N=5 gives a 91% chance). With p=0.05 (i.e., a moderately rare allele), you need N` to get a 95% frequency of detecting both alleles. So, give that logic the 24 hr test -- see if it makes sense both today and tomorrow. -Adam > Adam Porter > Department of Plant, Soil and Insect Sciences & > Graduate Program in Organismic & Evolutionary Biology > Hatch Laboratory Bldg rm 13 > 140 Holdsworth Way > University of Massachusetts > Amherst MA 01003-9320 > USA > > fx: (413) 545-0231 > aporter@ent.umass.edu > http://www-unix.oit.umass.edu/~aporter/ Si, I don't have a direct answer to your question, but you need to consider why you're seeing multiple variants. It could very well be an artifact of PCR recombination, so you need to be very careful that the variants you are recovering from cloning are biologically meaningful. Google scholar on 'pcr recombination' turns up a lot of information on this phenomenon. Here's a recent biotechniques article on it: Detection of high levels of recombination generated during PCR amplification of RNA templates Wayne Yu, Karl J. Rusterholtz, Amber T. Krummel, and Niles Lehman BioTechniques Vol. 40, No. 4: pp 499-507 (Apr 2006) Good luck! Jamie Grad student Ecology & Evolution Cornell University In probability theory that is a famous problem called the Coupon Collector's Problem or the Classical Occupancy Problem. Assuming that there are cereal boxes with coupons in them, and there are N of them and they are equiprobable, how many boxes do we have to buy to have a probability P of collecting all N coupons? And conversely, what is the expected number of boxes we have to buy f we keep going until we get at least one of all N coupons? It is in many probability textbooks. A Google search on the phrase "coupon collector's problem" gets 598 hits, "classical occupancy problem" gets 797. It is covered in William Feller's famous probability textbook: Feller, W. 1968. An Introduction to Probability Theory and Its Applications. 3rd edition. John Wiley and Sons, New York,. where it will be found on pages 101-102. The formula is tiresome. Here is a link to a copy of the corresponding pages for an earlier edition: http://www25.brinkster.com/ranmath/problems/feller/occ01.htm and here is an exposition of the problem on the web: http://gpu.sourceforge.net/gpu_p2p/node12.html A useful way to do it numerically is recursive. We maintain an array A(n,k) which gives the probability that after buying n cereal boxes we have k different coupons. Initially set A(0,0) = 1 and all other A(i,j) = 0. Then if there are N coupons in all, we have the recursion: A(n,k) = (1-(k-1)/N) A(n-1, k-1) + (k/N) A(n-1, k) and we use this to compute the A(n,k) for successively larger numbers of n. The probability you want is A(n,N). J.F. ---- Joe Felsenstein joe@gs.washington.edu Department of Genome Sciences and Department of Biology, University of Washington, Box 355065, Seattle, WA 98195-5065 USA allele.range sequences.required 2alleles 2 6 3alleles 3 11 4alleles 4 16 5alleles 5 21 6alleles 6 27 7alleles 7 33 8alleles 8 39 9alleles 9 45 10alleles 10 50 11alleles 11 57 12alleles 12 64 13alleles 13 69 14alleles 14 77 15alleles 15 84 16alleles 16 90 17alleles 17 96 18alleles 18 104 19alleles 19 111 20alleles 20 116 21alleles 21 125 22alleles 22 131 23alleles 23 139 24alleles 24 145 25alleles 25 152 26alleles 26 158 27alleles 27 167 28alleles 28 174 29alleles 29 182 30alleles 30 187 _____________________________________________________ Paul Johnson Robertson Centre for Biostatistics Level 11, Boyd Orr Building University of Glasgow University Avenue Glasgow G12 8QQ, UK Tel/fax: +44 141 330 4744/5094 paulj@stats.gla.ac.uk http://www.stats.gla.ac.uk/~paulj/index.html I use a back-of-the-envelope calculation based on the binomial distribution: If a rare (allele, sequence) is present in the population at 10% (p=0.1) then I have an 88% chance of detecting it at least once if I examine 20 clones. (p=0.1, q=0.9, n , prob=sum(prob | x (for successes)=1 to 20). See attached spreadsheet.You can arbitrarily pick at rate, p, for rare alleles depending on where the sample comes from, or you could get your rate of rare alleles from previous screens? In dengue, other researchers have detected variants amongst clones of a PCR from a serum sample from a single host at a rate of approximately 5%. I've used this in several grant applications and papers (e.g., bottom of page 2). Shannon Bennett SN, Holmes EC, Chirivella M, Rodriguez DM, Beltran M, Vorndam V, Gubler DJ and McMillan WO 2003. Selection-driven evolution of emergent dengue virus. Mol. Biol. Evol. 20 10 1650-1658. Dear Si Creer, The theoretical number of colonies depends on the number of alleles, as you indicated, but also on their frequency in the population of clones. Both of these are unknown. If they were known, you could use a multinomial distribution to get at the sampling probabilities. For example, in the simple case of two alleles, with one rare (0.1) and one common (0.9) in the population of clones, you can use the binomial distribution to calculate what the probability of detecting the rare allele is for the number of clones sequenced. In this example, if you sequenced 10 clones, you would have a 65.1% chance of encountering the rare allele. If you only sequenced 5 clones, you would only have a 41% chance of detecting the rare allele. In R: > 1-dbinom(0, size, prob=0.1) [1] 0.6513216 > 1-dbinom(0, size=5, prob=0.1) [1] 0.40951 In your case you are dealing with determining the number of alleles empirically. You could use methods developed for sampling effort (species accumulation curves for species richness) to determine whether the number of alleles is approaching an asymptote. I hope that helps. Regards, Alex Hi Si I don't know of any formula (and I would be interested to see if anyone else has one), but I have pondered the same question. I work on MHC genes, and currently work in a system where I amplify up to 4 sequences per PCR. I have compared cloning with SSCP and DGGE, and found that screening 20 clones usually gave me all the sequences that were resolved by DGGE - although occasionally I did resolve extra alleles on the DGGE that I didn't find amongst the clones even after screening 20 of them, so 20 may not always be enough. I ended up screening the clones by re-amplifying the insert from them and running them on an SSCP gel, then just sequencing the variants rather than all 20. Would you mind sending me any other answers to this question? Kind regards, Hilary Miller FRST NZ Science and Technology Postdoctoral fellow Allan Wilson Centre for Molecular Ecology and Evolution School of Biological Sciences Victoria University of Wellington PO Box 600 Wellington, New Zealand Ph: +64 4 463 7443 Fax: +64 4 463 5331 email: hilary.miller@vuw.ac.nz http://www.vuw.ac.nz/sbs/tuatara/PeoplePages/HilaryMiller2007.aspx Hi Si I think this is straightforward when the number of copies are equal. Ask: "what is the chance of missing a sequence after n samplings with i alleles?". When n is 5, I think you still have a 6.25% chance of missing one of two alleles (0.54). P = 0.03 (0.55) for n = 6 when i = 2 P = 0.04 (0.6678) for n = 10 when i = 3 P = 0.04 (0.7511) for n = 14 when i = 4 P = 0.04 (0.814) for n = 18 when i = 5 P = 0.045 (0.8317) for n = 22 when i = 6 P = 0.046 (0.8620) for n = 26 when i = 7 So at the 5% level, you need to sequence 3-4 times as many clones as you have sequences (but it gets a bit bigger as i increases). It is more complicated when the copy number varies, as it involves drawing from a frequency spectrum. Let me know what you find out, since I need to think about this in sequencing cDNAs. cheers, graham wallis Hi Si, I will try from my other mail. The title is "Mark-recapture cloning: a streightforward and costeffective cloning method for population genetics of single copy nuclear sequences in diploids" published in Molecular Ecology notes (Bierne et al.) so you should not have any problem finding it if this attachement doesn't work either... Good luck, Petri Bierne N et al 2007. Mark-recapture cloning: a straightforward and cost-effective cloning method for population genetics of single-copy nuclear DNA sequences in diploids. Mol Ecol Notes doi:10.1111/j.1471-8286.2007.01685.x Dear Si: Regarding the calculation for a mixture of three products or above; The fractional representation of what you want if all are in equal numbers is 0.33. So p(hit) = 0.33 and p(miss) = 1- p(hit) = 0.66 So here you need to do (0.66)^n until n is big enough that the value < 0.05 I think you can see that you will need a bigger n than for the two product case You can find the exact formula using the ratio of natural logs etc. in most Mol Biol texts and lab guides under screening libraries. I don't have an exact reference to hand. Hope this fills the concept gap Cheers Geoff Hi Si, I think this is the same problem that's addressed by rarefaction curves in ecology. There the question is, How many species are there if I've sampled X times and observed Y distinct species. The rate of increase in Y must drop with increasing X and this theory estimates the asymptote of Y as X goes to infinity. I don't particularly know the literature but wikipedia, google or similar are probably fine places to start. Nor do I think I'm the first person to recognize the applicability of this theory to exactly your problem. Hope this helps! Best, Dan --------------------------------- Daniel M. Weinreich Department of Ecology and Evolutionary Biology, and Center for Computational Molecular Biology Brown University Box G-W, Providence, RI 02912 For FedEx add: Sydney Frank Hall/LSB 157, 60 Olive Street Office phone: 401/863-3937 lab -2749 fax -2166 http://www.brown.edu/Departments/EEB/weinreich/weinreichindex.htm http://www.brown.edu/Research/CCMB/ Hi Simon, I don't really have the answer to your question - but just wanted to point your attention to the 'jumping pcr'-problem in cloning since the extent of that phenomenon in your samples will influence heavily how many colonies to sequence (a paper on the subject is attached). You probably know all about it. I didn't when I started cloning recently - and in some cases had to sequence 10 - 15 colonies instead of the intended 5 to pick up two alleles (since about 25% were false recombinants)...... Good luck. Best wishes, Tove Bradley RD and Hillis DM 1997 Recombinant DNA sequences generated by PCR amplification. Mol. Biol. Evol. 14, 5, 592-593. -- Si Creer Post Doctoral Research Fellow Molecular Ecology and Fisheries Genetics Group School of Biological Sciences University Wales, Bangor Bangor Gwynedd LL57 2UW UK e-mail: s.creer@bangor.ac.uk Tel: +1248 382302 Fax: +1248 371644 Home Page: http://biology.bangor.ac.uk/~bssa0d/ -- Si Creer Post Doctoral Research Fellow Molecular Ecology and Fisheries Genetics Group School of Biological Sciences University Wales, Bangor Bangor Gwynedd LL57 2UW UK e-mail: s.creer@bangor.ac.uk Tel: +1248 382302 Fax: +1248 371644 Home Page: http://biology.bangor.ac.uk/~bssa0d/ -- Gall y neges e-bost hon, ac unrhyw atodiadau a anfonwyd gyda hi, gynnwys deunydd cyfrinachol ac wedi eu bwriadu i'w defnyddio'n unig gan y sawl y cawsant eu cyfeirio ato (atynt). Os ydych wedi derbyn y neges e-bost hon trwy gamgymeriad, rhowch wybod i'r anfonwr ar unwaith a dilëwch y neges. Os na fwriadwyd anfon y neges atoch chi, rhaid i chi beidio â defnyddio, cadw neu ddatgelu unrhyw wybodaeth a gynhwysir ynddi. Mae unrhyw farn neu safbwynt yn eiddo i'r sawl a'i hanfonodd yn unig ac nid yw o anghenraid yn cynrychioli barn Prifysgol Cymru, Bangor. Nid yw Prifysgol Cymru, Bangor yn gwarantu bod y neges e-bost hon neu unrhyw atodiadau yn rhydd rhag firysau neu 100% yn ddiogel. Oni bai fod hyn wedi ei ddatgan yn uniongyrchol yn nhestun yr e-bost, nid bwriad y neges e-bost hon yw ffurfio contract rhwymol - mae rhestr o lofnodwyr awdurdodedig ar gael o Swyddfa Cyllid Prifysgol Cymru, Bangor. www.bangor.ac.uk "S.Creer"