Dear All again As is customary in Q&A sessions on EvolDir, I am reporting back the gargantuan response to my query below, regarding an expedient way to change TERMINAL ONLY "-" characters to Ns in large or multiple sequence alignments of VARIABLE LENGTH. The initial query is below for reference. A really big thank you to everyone - it really stimulated a lot of interest and indeed thought on the many ways to skin this particular rabbit. On the GUI front, it became clear that I'm unfamiliar with MacClade, Mesquite and SeaView and will be investigating all soon. A mix of summaries and cut/pastes are below for all to refer to. Some of my alignments are pretty big though and so I am shying away from GUIs and would prefer Perl or Python for these sorts of jobs. We THINK that sed (Linux) is not powerful enough on its own and so various Python and Perl scripts were donated for testing. The sources and some code (copy, save and make executable chmod +x filename, http://www.linux.org/lessons/beginner/l14/lesson14b.html) are also pasted below for beta trialling. I also copy in some thoughts regarding the treatment of gaps and Ns in datasets for consideration. Many thanks once again to the senders of dozens of responses and best wishes to all once again.. Si For GUIs, various options were suggested: MacClade, if you have a Mac and a licence for $125 http://macclade.org/macclade.html Re. your question in evoldir, I think Mesquite will do the job for you (open your alignment and in the menu "Matrix", select "Alter/Transform" and "Terminal Gaps to ?"). Till someone send you a real script, I suggest using Seaview software for alignment manipulation. http://pbil.univ-lyon1.fr/software/seaview.html Best wishes, Eszter Ari PERL Scripts (also check out the BioPerl website http://bioperl.org/: Available from: Pauline Garnier-Gere pauline@pierroton.inra.fr Francesco Nardi nardifra@unisi.it Olaf Bininda-Emonds olaf.bininda@uni-oldenburg.de seqConverter script for lots of alignment manipulations John Wares Jpwares@uga.edu Charles Kieswetter ckies@bu.edu Nick Crawford ngcrawfo@bu.edu Some scripts posted below: Hi Si - This is pretty straightforward with Bioperl (as long as you have BioPerl installed on your Linux/Unix machine). Here's one solution: #!/usr/bin/perl -w use strict; use Bio::AlignIO; my $in = Bio::AlignIO->new(-format => 'fasta', -file => shift @ARGV); my $out = Bio::AlignIO->new(-format => 'fasta'); while( my $aln = $in->next_aln ) { for my $seq ( $aln->each_seq ) { my $str = $seq->seq; if( $str =~ /^(-+)/ ) { my $rep = length($1); # replace from the 5' end substr($str,0,$rep,'N'x$rep); } if( $str =~ /(-+)$/ ) { my $rep = length($1); # replace from the 3' end substr($str,-1 * $rep,length($str),'N'x$rep); } $seq->seq($str); } # don't print the /start-end info in the FASTA ID $aln->set_displayname_flat(1); $out->write_aln($aln); } -jason Franceso Nardi #!/usr/bin/perl use Bio::SeqIO; if (! $ARGV[0]){ print "\n\nUsage: fillNs.pl /home/../../filename.fasta \n"; print "(..and be careful, this is very basic) \n\n"; } else{ $in = Bio::SeqIO->new( -file => "<$ARGV[0]" , -format => "fasta"); $out = Bio::SeqIO->new( -file => ">$ARGV[0].wNs" , -format => "fasta"); while ( $seq = $in->next_seq() ){ $s=$seq->seq; if ($s =~ m/[^-].*[^-]/){ $s = "N" x length($`)."$&"."N" x length($'); $seq->seq($s); $out->write_seq($seq); } else { die "Sorry, could not find the sequence in ".$seq->id.", quitting...\n\n"; }; }; }; PYTHON Scripts: Available from: Nicolas Feau nicolas.feau@bordeaux.inra.fr Eric Fontanillas efontanillas@sb-roscoff.fr Dear Si, This was such an easy thing to do in python that I wrote a script to do it in 5 minutes. To run it, just follow the instructions at the top of the script. They should work for a mac and linux (which will come with Python installed), not sure about windows though. I have tried it out on my alignments, and it seems to do the trick, and deals with sequences on >1 line no problem. I have put the script up on my website (www.robertlanfear.com) where you can download it. Let me know if it does what you want. It would be trivial to extend it to alter all '.fasta' files in a single directory, happy to extend it like that if it would be useful. Yours Rob GAWK solutions: Hi Simon, Not an elegant solution as one should be able to replace a variable length pattern with a variable length string of Ns using regular expressions but it should do the trick very fast if you have gawk in your system. For clarity, I divide the problem into two steps: Original file Taxons: Taxon 1 -----ATGCTG--TGACTG----TGACT--- Taxon 2 ---GTATGTTG--TGACTGCT--TGACCGTC First step (process Taxons file): gawk '{if ($1 != "Taxon") {match($0,/^-+/); start_len = RLENGTH; match($0,/-+$/); end_len = RLENGTH; print start_len, substr($0, start_len + 1, length($0) - start_len - end_len), end_len} else print $0}' Taxons >| Taxons.length Taxon 1 5 ATGCTG--TGACTG----TGACT 3 Taxon 2 3 GTATGTTG--TGACTGCT--TGACCGTC -1 Second step (process Taxons.length file): gawk '{if ($1 != "Taxon") {for (i=1; i <= $1; i++) printf "N"; printf $2; for (i=1; i <= $3; i++) printf "N"; printf "\n"} else print $0}' Taxons.length >| Taxons.processed Taxon 1 NNNNNATGCTG--TGACTG----TGACTNNN Taxon 2 NNNGTATGTTG--TGACTGCT--TGACCGTC Hope this helps! -- Sergi Castellano Sure, no problem! I am myself interested in seeing other solutions. I usually leave my perl for more complicated things, but I would love to see a perl one-liner doing the trick with a short regular expression. That may also be possible with sed and awk, but it did not seem obvious to me at the time. In addition, I would usually skip creating and intermediate file using, for example, a named file: gawk '{if ($1 != "Taxon") {for (i=1; i <= $1; i++) printf "N"; printf $2; for (i=1; i <= $3; i++) printf "N"; printf "\n"} else print $0}' <(gawk '{if ($1 != "Taxon") {match($0,/^-+/); start_len = RLENGTH; match($0,/-+$/); end_len = RLENGTH; print start_len, substr($0, start_len + 1, length($0) - start_len - end_len), end_len} else print $0}' Taxons) >| Taxons.processed This results in a typical gawk/bash one-liner to quickly write in the command-line. This does not make it that more elegant though! Gap and N discussions Dear Simon, "N"s in the 5' and 3' region are per se also false charcters, because you dont know if there is any nucleotide or a gap? Therefore a gap wouldnt be indicated by an "N", i.e. "any possible nucleotide". Thats why we prefer to set a "?" in these regions to distinguish the positions from real gaps ("-") or ambiguous base-calls ("N") received from the sequencing procedure in the rest of the alignment. All the best, Alexander > Dear All > > > > Alignment programs like MUSCLE and Clustal often output alignments with > > "-" symbols indicating indels (real events) within sequence alignments, > > but also "-" symbols at the 5' and 3' ends of sequences. The latter > > however, are not real evolutionary events and really should be Ns > > (missing data), depending on the sort of analytical framework you use. Most phylogenetic reconstruction software does not deal well with either gap characters or N's (or R, Y and other ambiguity or mixed base IUPAC codes) in sequences. As you note, insertions and deletions (which end up represented by dashes) are real events, but the problem is that there are many types of these real events, including duplications of flanking sequences to create a direct repeat, insertions of transposable elements, insertions of single bases, etc each with very different mechanisms, probabilities and evolutionary "costs". Many phylogenetic programs have options available to treat gaps as missing information (the gap region is not counted in pairwise comparisons) or to strip all gaps (any column in the alignment containing a gap character in any sequence is not counted). With an N character, some programs treat it as a .25 match to any base, some treat it as a full match to any base, some treat it as "missing information" and don't count it at all. The point is, that it is important to consider what information can be correctly gained, vs what type of "noise" you are adding to your data set, by replacing terminal missing information with N characters. -------- Original Message -------- Subject: Other: Script for editing alignments? Date: Thu, 12 Aug 2010 02:00:09 -0400 (EDT) From: evoldir@evol.biology.mcmaster.ca Reply-To: brian@helix.biology.mcmaster.ca To: s.creer@bangor.ac.uk Dear All Alignment programs like MUSCLE and Clustal often output alignments with "-" symbols indicating indels (real events) within sequence alignments, but also "-" symbols at the 5' and 3' ends of sequences. The latter however, are not real evolutionary events and really should be Ns (missing data), depending on the sort of analytical framework you use. If there is sufficient heterogeneity and signal within the 5' and 3' ends of sequences, the "-"s can be manually edited in a text editor to Ns with no problem, if the alignment is small. If it is large (e.g. 2000 seqs), or there are lots of alignments, it becomes a lengthy task. I'm investigating such alignments presently and so was wondering if anyone had a clever way of implementing sed, or had a Perl script that would perform such a task. Simply put, it would require replacing the 5' and 3' "-" below only with Ns and leaving the within sequence "-"s alone. The sequences naturally may span more than one line. >Taxon 1 -----ATGCTG--TGACTG----TGACT--- >Taxon 2 ---GTATGTTG--TGACTGCT--TGACCGTC to >Taxon 1 NNNNNATGCTG--TGACTG----TGACTNNN >Taxon 2 NNNGTATGTTG--TGACTGCT--TGACCGTC It's a simple task, but I haven't seen any scripts out there to do the job. If there are any scripters out there who can help, or if someone knows of an application that would help, it would be great to hear from you. With best wishes and thanks Si Creer -- Simon Creer Senior Research Fellow Molecular Ecology and Fisheries Genetics Lab Environment Centre Wales Building School of Biological Sciences Bangor University Bangor Gwynedd LL57 2UW UK e-mail: s.creer@bangor.ac.uk Tel: +1248 382302 Fax: +1248 371644 Home Page: http://biology.bangor.ac.uk/~bssa0d/ "Creer,Simon"