Table of Contents

Name

prss - test a protein sequence similarity for significance

Synopsis

prss3 [-Q -d # -f # -g # -h -O file -s SMATRIX -w # -z 1,3 -Z # -w # ] sequence-file-1 sequence-file-2 [ #-of-shuffles ]

prss3 [-dfghsw] - interactive mode

Description

prss3 is used to evaluate the significance of a protein or DNA sequence similarity score by comparing two sequences and calculating optimal similarity scores, and then repeatedly shuffling the second sequence, and calculating optimal similarity scores using the Smith-Waterman algorithm. An extreme value distribution is then fit to the shuffled-sequence scores. The characteristic parameters of the extreme value distribution are then used to estimate the probability that each of the unshuffled sequence scores would be obtained by chance in one sequence, or in a number of sequences equal to the number of shuffles. This program is derived from rdf2, described by Pearson and Lipman, PNAS (1988) 85:2444-2448, and Pearson (Meth. Enz. 183:63-98). Use of the extreme value distribution for estimating the probabilities of similarity scores was described by Altshul and Karlin, PNAS (1990) 87:2264-2268. The and expectations calculated by prdf. prss3 uses calculates optimal scores using the same rigorous SmithWaterman algorithm (Smith and Waterman, J. Mol. Biol. (1983) 147:195-197) used by the ssearch3 program.

prss3 also allows a more sophisticated shuffling method: residues can be shuffled within a local window, so that the order of residues 1-10, 11-20, etc, is destroyed but a residue in the first 10 is never swapped with a residue outside the first ten, and so on for each local window.

Output

The SW alignment score is shown in the first column of the histogram. The number of shuffled sequences with this score is shown in the second column and the expected number is shown in the third column. The "=" and "*" reflect the number of sequences observed in the shuffled pair and the number expected, respectively.

The statistics at the bottom of the output indicate the score for the unshuffled sequences and the range of shuffled scores along with an indication of the significance of the alignment.

Examples

(1) prss3 -w 10 musplfm.aa lcbo.aa

Compare the amino acid sequence in the file musplfm.aa with that in lcbo.aa, then shuffle lcbo.aa 200 times using a local shuffle with a window of 10. Report the significance of the unshuffled musplfm/lcbo comparison scores with respect to the shuffled scores.

(2) prss3 musplfm.aa lcbo.aa 1000

Compare the amino acid sequence in the file musplfm.aa with the sequences in the file lcbo.aa, shuffling lcbo.aa 1000 times.

(3) prss3

Run prss in interactive mode. The program will prompt for the file name of the two query sequence files and the number of shuffles to be used.

Options

prss3 can be directed to change the scoring matrix, gap penalties, and shuffle parameters by entering options on the command line (preceeded by a `-'). All of the options should preceed the file names number of shuffles.

-d # Number of shuffles (200 is the default)

-f # Penalty for the first residue in a gap (-12 by default) for proteins.

-g # Penalty for additional residues in a gap (-2 by default) for proteins.

-h Do not display histogram of similarity scores.

-Q -q
quiet - do not prompt for filename.

-O filename
send copy of results to filename."

-s str
specify the scoring matrix. BLOSUM50 is used by default for proteins; +5/-4 is used by defaul for DNA. prss3 recognizes the same scoring matrices as fasta3, ssearch3, fastx3, etc; e.g. BL50, P250, BL62, BL80, MD10, MD20, and other matrices in BLAST1.4 matrix format.

-w # Use a local window shuffle with a window size of #.

-z # Calculate statistical significance using the mean/variance (moments) approach used by fasta3/ssearch or from maximum likelihood estimates of lambda and K.

-Z # Present statistical significance as if a `#' entry database had been searched (e.g. -Z 50000 presents statistical significance as if 50,000 sequences had been compared).

Environment Variables

(SMATRIX) the filename of an alternative scoring matrix file. For protein sequences, BLOSUM50 is used by default; PAM250 can be used with the command line option -s P250(or with -s pam250.mat). BLOSUM62 (-s BL62) and PAM120 (-S P120).

See Also

ssearch3(1) , fasta3(1) .

Author

Bill Pearson
wrp@virginia.EDU

The curve fitting routines in scaleswe.c were adapted from code provided by Phil Green, U. of Washington.


Table of Contents