Dear EvolDir Colleagues, Many thanks to all the people sending their kind reply concerning my question on batch downloading sequences from GenBank. I think all your suggestions are very useful and work well. Among them, the simplest way is just type like ''EF100000:EF102000[ACCN]'' for the users, like me, who are not familiar with scripts. Or you can do some scripting with perl or R or other softwares like bioedit, macvector.... Here I attached the answers to my question. Hearted thanks are give to the colleagues helping me and also to evoldir providing such a good platform to communicate. Have a nice time and good luck for your work and life!! Sincerely yours, Jinyong Dear Jinyong and all other users of GenBank, there is an easy (but not well known) way to achieve this without any additional software. In your example, just type the following term into the GenBank search field: EF100000:EF102000[ACCN] Please note the colon between the first and last sequence and the final term in square brackets. Yours Martin. Dr. Martin Wiemers Department für Populationsökologie Fakultät für Lebenswissenschaften Universität Wien Althanstr. 14 A-1090 Wien Austria Tel. +43 1 4277 57403 e-mail: martin.wiemers@univie.ac.at http://www.univie.ac.at/population-ecology/ http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide should probably work Go to NCBI home page and select CoreNucleotide or just Nucleotide from the Search pull down menu and just type in EF100000, EF102000, ... as many as you want and then click on Go. Once you get the records you can save it in any format you want. You don't need a script for this I would think. Good luck. Raja Georgetown University - Show quoted text - if you have some familiarity with Perl, I'd recommend to take a look at BioPerl (www.bioperl.org). It comes with variety of example scripts that probably already do what you want, or will be easy to modify. For running the example scripts you in fact don't even need to know Perl, though for modifying the scripts you obviously do. You might simply ask your question on the bioperl mailing list, though. -hilmar Bioedit can do this (under File->retrieve from GenBank or GenPept), although it's a bit clunky because you can't just enter a range of accession numbers - you have to enter each as a separate number, with a carriage return in between. What I usually do is use Excel to make a list (if you enter the first number and then "drag down" the cells, it will give you a sequential list of #s), save it as text, and then cut and paste the list into the window in Bioedit. Don't forget to click on the ticky box for "accession" not "gi" or it won't find anything. For really long lists it will take a while to download so be patient - my suspicion is that the bottleneck is at the GenBank end, not the program. Bioedit is free and runs on Windows - if you do a google search you should find it multiple places - I got it here: http://www.mbio.ncsu.edu/BioEdit/bioedit.html Hope that helps, Regards, Laura -- Laura B. Geyer, PhD Smithsonian Tropical Research Institute geyerl@si.edu lbgeyer@gmail.com (preferred) >From the US: Smithsonian Tropical Research Institute Attn: Laura Geyer - Naos Unit 948 APO AA 34002-0948 703/487-3770 ext. 8730 Internationally: Instituto Smithsonian de Investigaciones Tropicales Att: Laura Geyer - Naos Apartado 0843 - 03092 Balboa, Ancón Panamá, República de Panamá (+507) 212-8730 Fax: (+507) 212-8790 or (+507) 212-8791 Dear Jinyong, http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide please follow the instructions at the website above. Good luck! Renyi Hi Jinyong, The best way to do this is Batch Entrez: http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide Hope this helps, Dave Arde -- David H. Ardell Linnaeus Centre for Bioinformatics, Box 598, SE-751 24 Uppsala, Sweden T:+46(0).18.471.6694 F:+46(0).18.471.6698 http://www.lcb.uu.se/~dave I've used Geneious (www.geneious.com) for kind of task you've described. It works well, and it has a number of flexible output options. Kind regards - Rich Rich Cronn, Research Geneticist US Forest Service, Pacific NW Research Station 3200 SW Jefferson Way, Corvallis, OR 97331 541-750-7291 phone * 541-750-7329 fax * rcronn@fs.fed.us I've used R with the packages "ape" and "seqinr" to batch download sequences from GenBank. You need to know the accession numbers (which it sounds like you do) and it helps if they're sequential. Also, you need to know R. Hope this helps! -- Brian J. Knaus Ph.D. Candidate Department of Botany and Plant Pathology Oregon State University 2082 Cordley Hall Corvallis, OR 97331-2902 http://oregonstate.edu/~knausb Hello, By far the best method I have seen for batch downloading and managing sequences from GenBank is Geneious: http://www.geneious.com/ It is an amazing little program. Good luck, Rebecca Rebecca Johnson If you have genbank and blastall installed on a local system you can use the fastacmd from the commandline: > fastacmd -d /common/data/genbank -p T -s 32699731,3915985 Would retrieve the GI numbers 32699731 and 3915985. Also accepts accession numbers. > fastacmd - gives the parameter options for the command. There also are Perl scripts that connect to the databank via the internet. One version of this is available at the NCBI ( http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html ), another version through the bioperl project (see http://www.bioperl.org/wiki/Bptutorial.pl#Accessing_sequence_data_from_local_and_remote_databases). Good luck Peter J. Peter Gogarten Professor of Molecular and Cell Biology University of Connecticut 91 North Eagleville Road Storrs, CT 06269-3125 USA Phone: 860 486 4061 (office) 860 486 1887 (lab) FAX: 860 486 4331 Email: gogarten@uconn.edu www: http://gogarten.uconn.edu/ I saw your post to the evoldir (quoted below). Though I'm not very familiar with it, I believe that MEGA can do something like this. On the other hand, this could also be done with a very elementary Perl script using one or more of the BioPerl modules which are specifically designed for downloading sequence records from GenBank (and other databases). Also, with a Perl script it should be a simple matter to generate sequential accession numbers so you wouldn't have to read them from a file somewhere. If you use R and install the package ape there is a function there called read.genbank which you fee a string of accession numbers and it fetches all the sequences for you. cheers Einar Hi Jinyong, I use GenBankRetriever, a Java program. It's written by Stephen Smith (Yale University) and can be downloaded from http://yphy.org/viburnum/ GBR.zip Best, Sergios-Orestis Kolokotronis Sackler Institute for Comparative Genomics American Museum of Natural History Central Park West @ 79th St. New York, NY 10024 -USA- tel +1 212 313 7648 koloko@amnh.org http://koloko.net Dept. of Ecology, Evolution, and Environmental Biology (E3B) Columbia University sk2059@columbia.edu If you use linux and have perl and lynx (lynx is a text based web browser freely available for linux) on your system then I have a script that I wrote which can do batch downloads from GenBank. Please let me know if you have these, and I would be happy to send you the script. Take care, Melanie Carol Mariani http://www.macvector.com/MacVector%20Content/databasesearching.html Database Searching NCBI Entrez The National Center for Biotechnology Information (NCBI) maintains a number of sequence databases that can be accessed over the Internet. MacVector has a built-in Internet database browser that lets you search the NCBI's Entrez databases for sequences using combinations of search terms such as author, organism, keywords, accession number etc. Sequences that match the search terms can be downloaded either individually or in batches to your desktop or to a folder on your hard drive. Sequences are retrieved with all of their associated annotations and features intact. These can be viewed and edited in MacVector, or used for additional analyses. MacVector allows to download sequences and store them in files on your computer. "Macdonald, Stuart" Type: EF100000:EF102000[ACCN] into the GenBank window. Change "Display" to FASTA, and "Send to" Text or a file. This works for up to 500 contiguous accessions. For >500, or for non-contiguous accessions go to: http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide and follow the instructions in (2). Dear Jinyong, Regarding your question on evoldir: you can actually do that from within the NCBI website itself. The approach is as follows: (1) Create a file with a list of GI or accession numbers and save it locally (2) Go to the following page: http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi (3) Enter the filename or browse to choose it from your system directory (4) Designate the database as Nucleotide or Protein, as appropriate (5) Press Retrieve; you will see a list of document summaries (6) Select a format for the sequence data (7) Press Save to download; you will be prompted to name the file Hope this helps. Kind regards, Anders Unfortunately I do not think you can simply put an interval of accession numbers in your file (say, "EF100000-EF102000"), instead you need to explicitly list each name on a separate line. Hopefully you (or someone at your institute) can create such a file automatically (using some scripting language or excel perhaps). Best Anders Anders Gorm Pedersen, Associate Professor, Ph.D. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark Bldg. 208, DK-2800 Lyngby, Denmark Tel.: (+45) 45 25 24 84 Fax: (+45) 45 93 15 85 http://www.cbs.dtu.dk/researchgroups/molevolution.php I just double-checked, and you can actually create such lists automatically in excel: (1) Enter the first name in your list in some cell on an excel sheet (e.g., "EF100000") (2) Enter the second name in an adjacent cell (e.g., "EF100001") (3) Select the two cells (4) Pull the "fill handle" (the small dot in the lower right corner of your selection), excel will then automatically extend the pattern for all the cells you drag across. Kind regards, Anders I am responding to your GenBank question on the evolution directory. There is a nice feature on the NCBI website for searching for, and downloading, batches of sequences. Just go to NCBI tools, and then to the Batch Entrez link: http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide This allows you to upload a file which contains a list of the accession numbers you want. Then, after the search finds these sequences, you can simply download them all in fasta format (just click "Display FASTA", and then "send to file"). Hopefully this will work for your situation. Cheers, Jamie Jamie Oaks Museum of Natural Science 119 Foster Hall Louisiana State University Baton Rouge, LA 70803 USA Office Phone: 225-578-5393 Office Fax: 225-578-3075 E-mail: joaks1@lsu.edu jinyong.hu@googlemail.com