Next.Generation.InHouse.Computing Thank you very much to everyone who provided such useful input to my original question: ³What with costs of outsourcing the standard bioinfomatics needed for next gen data, quickly being able to reach dizzing heights (I was just quoted 1000 Euros for clustering and contig assembly for one cDNA library), I wonder if any advice could be given on a decent powerful set up for in-lab use? The trade-off between computing time and accumulating outsourcing costs is important, yet would not be to upset if it took one week for contig and clustering of a cDNA library on our own machine. Does anyone have a powerful setup of their own, which is not an expensive cluster system, able to execute the basic necessities e.g. alignments, BLASTs, SNP searches?² There seems to be a few ways to tackle this dilemmas: 1. Outsource to a company/lab, with prices between ranging dramatically 2. Utilize a cluster/server system at your Institute or region: eg. AceNet. The speed of processing largly depends on the amount of power you are allocated. Some analyses may take a long time if the cores are at capacity. 3. Purchase your own desktop "super computer", along with software. 4. Utilize a virtual cloud system, for remote processing of data. This however seems to be hampered only by the upload time of your data, via the internet, into the system. Some people have also had problems with the output files. We have decided to go with our own desktop solution (~$4000 CA), along with the CLC Genomics Workbench software (~$6000 US) for a number of reasons. 1. The software comes highly recommended and utilizes high powered desktops fully, to for example identify SNPs in assembled cDNA libraries in around 30 mins (as opposed to days, stated by outsourced companies). 2. We want to have hands on experience and expertise within the lab. 3. Our own setup will streamline data processing and customization. There seems to be a lot of people doing in-house analyses, as such an array of software and a community for support has cropped up: http://seqanswers.com/forums/index.php See below for example of a powerful machine specs all the received correspondences: Mainboard : ASUS Z8NA-D6C ATX Dual LGA1366 Intel 5500 DDR3 PCI-E16 - $318 CPU # 1 : Intel Xeon E5520 Quad Core Processor LGA1366 2.26GHZ 8MB Cache 45NM - $450 CPU # 2 : Intel Xeon E5520 Quad Core Processor LGA1366 2.26GHZ 8MB Cache 45NM - $450 Memory : Corsair XMS3 CMX8GX3M4A1333C9 24GB DDR3 4X6GB DDR3-1333 CL 9-9-9-24 - $1080 HDD : Western Digital WD10EADS Caviar Green 1TB SATA 32MB Cache 3.5IN Hard Drive OEM - $95 Video Card : Sapphire ATI Radeon 5870 1GB PCIe DDR5 Graphics Card - $460 Sound Card : on board Card Reader : Nmedia C68 3.5IN USB2.0 Flash Card Reader Silver Black with Front USB - $24 Optical Drive : LG GH22NS50 Black 22X SATA DVD Writer OEM - $33 Case : Supermicro eATX Server Case eATX 4X5.25 6X3.5INT No PS Front USB & Audio - $610 O/S : Microsoft Windows 7 Professional Edition 64BIT DVD - $164 with 24gb memory $3684 + tax =  $4162.92 with 32gb memory $3964 + tax = $4479.32 ------- Hi Jack, There is no need to spend $1000 doing a next-gen cDNA assembly and shame on anybody who would even charge that! For about $1500 you can put together a Linux system to handle most of your bioinformatic needs, including assembly. Really, any workstation will do that has at least 8GB RAM and a quad core. What sorts of data are you talking about - Illumina, SOLiD, or 454? If you want to chat on the phone about assembly strategies and necessary hardware feel free to give me a call at 520-336-7592. Best, Mike Barker -- Michael S. Barker, Ph.D. The Biodiversity Research Centre, University of British Columbia ----------------- Dear Jack, I bot myself a system built around an Asus Z8NA-D6 i5500 RGVA motherboard with 2 Intel Xeon processors. You will have 2 Quad cores, that male 8 cores behaving as one. These Intels are hyperthreaded, which means they actually will behave like 16 processors. The whole thing will be considerably cheaper than 5k Euro. It is only one box, and will work nicely with standard Linux ditributions. This is a standard ATX set-up with up to 24 Gb of RAM. Please confront your local computer store with these details and they can tell you more. This is good enough to do everything you wish for, but not yet whole genome de novo assembly of verebrate sized genomes. For that you would need 256 or better 512 Gb RAM. This is only possible with server configurations, prices starting at around 30k Euro - and not so easy to maintain... More questions? Let me know. Cheers, Robert Robert H. S. KRAUS PhD student Resource Ecology Group Wageningen University, The Netherlands Phone +31 317 4 83530/84700 Fax +31 317 4 84845 Email robert.kraus@wur.nl web http://www.reg.wur.nl/UK/Staff/Kraus/ ------------------- Mr Lighten, You really have to make sure when you are estimating costs, that you factor in your own time. It is wo. There is the time you spend finding the right software, then installing it, fixing it to work for you (may take some IT work) , then the learning curve, actually doing the analysis the first time, then with different parameters, and until it makes sense. Can you give a guarantee about outcome? If it doesn't work the first time, a company is on the hook for doing it until you think it's right. I am just saying that maybe 1000 euros isn't so much to pay for it. I am an academic at heart. And I am not (honestly) asking for work, but I am on a crusade to remind academics that their time is worth money. And just how much time it takes. Good luck, Susan Susan I. Fuerstenberg, Ph.D. President and Chief Operating Officer Genome Project Solutions 1.877.867.0146 (Toll-Free) www.genomeprojectsolutions.com ------ Dear Jack, I have used the program NGen from Lasergene to do trimming and assembly of 454 reads. This is a bit expensive (and has to be run on a 64 bit PC) but is really easy to use and seem to be working well. Cheers, /Robert ---------------- Jack, we have been using CLC Workbench in our lab for thatst 2 years and we,re very hapy with it. There is a licence cost to it but there is possibility for free trial of 2 weeks I think. This could be plenty for your need. Assembly of a full Titanium plate and finding putative SNPs in it takes about... 30 min. http://www.clcbio.com/index.php?id' Louis ------------------- Jack, There are several ways you can go about getting data processed: 1. You can pay to have it done. Example, the Georgia Genomics Facility will assist with this for 100USD per hour. For which a typical 454 run would take 0.5-3 hours depending on the complexity. 2. You can do it with software placed on your universities computing servers. I have personally worked with Kansas State University's BeoCat cluster (http://www.cis.ksu.edu/beocat), and UGA's R-cluster (http://rcc.uga.edu/) to assemble 454 data. These are great resources within the university systems, and are for the most part under-utilized. They will typically load any licensed software that is needed by the research community. 3. As most bioinformatic software packages are Linux, you could simply setup your own computer. I have a dual-boot (MAC OS and Linux) Mac Pro with dual-quad core processors and 32GB ram. Unless you are trying to cluster several runs of data, any computer with at least 16GB ram should be sufficient. 4. If none of those are available, we have tested and can provide instructions to setup a virtual Linux (Ubuntu) machine within VMware Fusion (19USD) on a MAC. Depending again on the ram needs of the software, running the software within a virtual OS window works as well as within real hardware. There is similar software for a PC (VMware Player, free!) that could do the same thing....although we have not as yet tested the PC configuration. Ken ______________________ Ken Jones, PhD Assistant Research Scientist Georgia Genomics Facility Riverbend North, Room 129 110 Riverbend Rd. Athens, Georgia 30602 email: kenjones@uga.edu office: 706.542.6877 GGF main: 706.542.6409 GGF fax: 706.542.6414 website: http://dna.uga.edu ---------------------- Hi Jack, Since I last posted, I've been working with CLCbio. The software is expensive. But it is beautiful. It provides readily interpretable output and it's powerful and flexible (but not so flexible as to be unapproachable, like bfast). It is expensive-- but a one time license is not so frightful if your lab is doing a lot of nextgen stuff. See if it does what you want! http://www.clcbio.com/ Cheers, Marta --------------------- Jack, You have several resources available to you at Dal. I was a postdoc there for four years and I would suggest getting in touch with the CGEB people (http://cgeb.dal.ca/) for some help. Also the NRC has a cluster that you can use if you talk to Sheldon (Sheldon.Briand@nrc-cnrc.gc.ca). I often ran Staden and Artemis off of their cluster when I was there. -Chris --------------- Regarding the Galaxy Cloud system: Yeah, I've used it. It does take forever to upload. You can upload archived files (although I have not done this for no good reason) which would help somewhat. I've used it basically for quality control type stuff. It is really easy to do things like plot the average quality score for each position in a read. I am not really a 'bioinformaticist' so these types of things are really helpful to me.  I've also used it to map 454 reads (already assembled) to a genome. the mapping went smoothly but the export has not. I am waiting for some tech support on this issue. I've done all of my RNA-seq type work on local computers here are the University of Illinois. We have a centralized computing facility that charges us something $1/day/cpu which winds up being almost nothing for the types of things I'm doing. If you can get your dept to go in for something like that, it would be great. (I can't remember if that is what you were thinking) I can dig up the details of what we have here, but its probably much bigger than what a single user (or even a few labs) would want. Hope this helps. And let me know if you need more info let me know. If you see my friend Ryan Kerney (who I think is still up there) say hi! Chris ----------------- Dear Jackie, I don't really think there is any way around this apart from learning bioinformatics. Working with RNAseq, I find that each transcriptome is different, each project have different goals and ambitions, and the analysis really has to be individually tailored for each project. There is actually a lot of biological and scientific reasoning going into every step of "simple" bioinformatic procedures such assembly, BLAST-searches, SNP-detection and gene annotation, and it is a lot harder than anyone thinks, who haven't done it. I do think sequencing companies are not really clear enough about this bit when thy advertise their services as "£1000 for your favourite genome". You could try to inherit some set-up from someone else, but if you don't understand how it works, it will be really difficult for you to get it to work for you. It is a little bit like doing a PCR using a protocol. It might work, but if it doesn't, you really need to understand what is happening in order to trouble-shoot it, and to tailor it to your specific product and template. And really understanding a PCR reaction takes a lot of learning, experimentation and skills development. I'd advice you to find someone who has done something very similar to what you want to do, and then hire them, or get them to collaborate with you. Maybe you can also dwelve into the bewildering expanse of papers on different bioinformatic software, or maybe try to get a placement in a bioinformatics lab where you could surround yourself with people who have the skills you need? Best, Magdalena ------------------ Yes, this is an issue, although it is worth taking into consideration the cost of maintaining a suitable system in house. If anyone is interested, we have compute capacity and bioinformatics staff as part of our genomics facility. We are using it, for example, to assemble genomes, map reads against reference genomes, manage large databases of reads and reference information, perform phylogenetic analysis, etc. Depending on what you need, we can certainly provide access and thereby save you from the tasks of purchasing and maintaining your equipment, managing the software suite, etc. Those costs do also accumulate over time and should not be underestimated. Anyone interested should please contact me directly. Cheers, Brook -- Brook Milligan Internet: brook@nmsu.edu Department of Biology New Mexico State University Telephone: (575) 646-7980 Las Cruces, New Mexico 88003 U.S.A. FAX: (575) 646-5665 ----------- Jack - Kind of depends on your budget and computing & installation expertise. Amazon's EC2 is a decent option http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.htm l http://blog.cyclecomputing.com/amazon-ec2/ For US researchers the NSF Teragrid is a great option with some easy to obtain development account with a reasonable amount of time before applying for a full allotment of CPU hours https://www.teragrid.org/ A $6000 (or maybe less) Linux machine with 16 cores and 32Gb of RAM would suit for a lot of the things you are talking about depending on the genome size and the kind of NGS data you are planning to process. We bought a couple of servers from thinkmate - http://www.thinkmate.com/ but other vendors with pretty good costs are out there. The problem is less in the initial hardware vs the upkeep and administration costs of running systems. So that needs to be factored in too. Hope that helps. -jason -- Jason E. Stajich, PhD Assistant Professor Department of Plant Pathology & Microbiology University of California Riverside, CA 92521 jason.stajich@ucr.edu office: 951.827.2363 http://lab.stajich.org/ http://twitter.com/stajichlab http://fungalgenomes.org/blog/ -- Jack Lighten, Ph.D. Candidate, Bentzen Lab, Room 6078, Department of Biology, Dalhousie University, Halifax, NS, B3H 4J1 Canada Office:(902) 494-1398 Email: Jackie.Lighten@Dal.Ca Profile: www.marinebiodiversity.ca/CHONe/Members/lightenj/profile/bio Jackie Lighten