====================================================================== Neighborhood Correlation Version 2.0 ====================================================================== For details of Neighborhood Correlation please refer to http://www.neighborhoodcorrelation.org/ or the publication: Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins Song N, Joseph JM, Davis GB, Durand D PLoS Computational Biology 4(5): e1000063 doi:10.1371/journal.pcbi.1000063 This program implements Neighborhood Correlation, as described in the above publication. It is designed to take BLAST BIT-scores as input. ====================================================================== Changes since version 1.0 ====================================================================== Version 2.0 is a complete rewrite of the Neighborhood Correlation implementation. It is meant to replace Version 1.0 (previously referred to as the "reference implementation"). Version 2.0 been optimized to accommodate large datasets through fast computation and greatly reduced memory usage. This implementation has added a dependency upon the Numpy numerical package. It also requires a C compiler be available on the system. See "Prerequisites", below. Performance is greatly improved over Version 1.0. As a rough guide, the set of Mouse and Human sequences used in our analysis included 26,197 sequences. From this, all-against-all BLAST yielded approximately 4.8 million pairwise relations. For this dataset, Neighborhood Correlation, Version 2.0 can be expected to consume approximately 125MB of memory. Running time for this dataset is approximately 45 minutes on an Intel Pentium D, at 3.2GHz. Greater than 1GB, and 16 hours were required by Version 1.0. The input and output of both versions should be equivalent, save the following modification: Version 1.0 reported NC scores for pairs that satisfied the condition (NC(x,y) >= nc_thresh || BLAST score (x,y) exists). Now, this has been simplified to only (NC(x,y) >= nc_thresh). Both versions remain available at http://www.neighborhoodcorrelation.org. ====================================================================== Support ====================================================================== If you encounter any difficulties with this program, or have related questions, please contact Jacob Joseph . ====================================================================== Licensing ====================================================================== (C) 2009 Jacob Joseph , and Carnegie Mellon University This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. ====================================================================== Prerequisites ====================================================================== This program requires the Numpy numerical package for Python. This package is not included with the standard Python installation. It may be downloaded from http://numpy.scipy.org/. Additionally, a C compiler is required during installation to compile the inner loop of the algorithm. Such a compiler will be available on nearly all UNIX-based systems. This program been tested with Python version 2.5 on Linux and MacOS. Though untested on other systems, it has no OS-specific dependencies and should work on any Python installation with Numpy. Note that we have observed Bus Errors on MacOS with old (1.0.1) versions of Numpy. If you experience difficulties, please try first installing a recent version of Numpy. ====================================================================== Installation ====================================================================== This program may be installed to your (Linux) system as follows. Other platforms may require slight variations. Extract the package; e.g., tar -xzf neighborhood_correlation-2.0.tar.gz Build the package for your system: cd neighborhood_correlation-2.0 python setup.py build Install to the system: python setup.py install Leave the Installation directory: cd Local Install: (optional) If you do not have root access to your machine, or you would like to install to a local or non-default directory, you may supply a --prefix argument to the installation command; e.g., python setup.py install --prefix=$HOME/usr On a 64-bin Linux machine with Python 2.5, this will install to the directories $HOME/usr/bin and $HOME/usr/lib64/python2.5/site-packages/. If you install in a non-standard location, Python may not be able to locate the required files. This problem will manifest itself as an ImportError, or as the message "WARNING: Neighborhood Correlation C helper not found." If you have installed using a non-standard directory, you will need to set your PYTHONPATH and PATH to correspond to what you specified. The method of setting environment variables depends upon the UNIX shell you are using. The correct path to set also depends upon the specifics of your system; e.g., Python version, platform, operating system, existing environment variables, and so on. For our example, (installation with Python 2.5 on a 64-bit Linux machine, into the directory $HOME/usr, and using the bash shell) you would add the following lines to $HOME/.bashrc: export PYTHONPATH=$HOME/usr/lib64/python2.5/site-packages:$PYTHONPATH export PATH=$HOME/usr/bin:$PATH To have these changes take effect immediately in your running shell, type: source $HOME/.bashrc ====================================================================== Running Neighborhood Correlation ====================================================================== The executable NC_standalone should now be installed to your system. An explanation of program parameters is as follows: Usage: NC_standalone -f [options] Output of Neighborhood Correlation scores will be printed to stdout in the same three-column format as used for input. Options: -f, --flatfile (required) A white-space delimited file of BIT-scores from BLAST. Three columns are expected, of the format: -------------------------------------- seq_id_0 seq_id_1 bit_score seq_id_2 seq_id_3 bit_score .... -------------------------------------- No column heading should be provided. Please refer to http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.614 for an explanation of BIT-score. -o, --output Write score output to a file. If omitted, the score list is printed to stdout. -h, --help Print this help message. --num_residues Number of residues in the sequence database. Used to calculate SMIN, the lowest expected bit_score. If unspecified, an estimate of 537 residues per sequence will be used, which is an average of mouse and human SwissProt sequences. --nc_thresh <0.05> NC reporting threshold. Calculated values below this threshold will not be reported. Conservatively defaults to 0.05. --smin_factor <0.95> SMIN factor. Calculate SMIN from the expected random BIT-score scaled by this factor. Defaults to 0.95. ====================================================================== Input and Output ====================================================================== This program accepts as input a list of pairwise sequence similarities, as BLAST BIT-scores. Please refer to http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.614 for discussion of the BIT-score. Specifically, this program will read a white-space delimited file, in a three-column format: -------------------------------------- seq_id_0 seq_id_1 bit_score seq_id_2 seq_id_3 bit_score .... -------------------------------------- No column heading should be provided. Output is provided in the same three-column format: -------------------------------------- seq_id_0 seq_id_1 nc_score seq_id_2 seq_id_3 nc_score .... -------------------------------------- Note that the output score file will not have the same pairs of sequences. BLAST will only output scores for pairs which are significant. Because Neighborhood Correlation considers the entire set of BLAST scores related to a particular pair of sequences, a sequence pair that does not have a significant BLAST score may still have a reasonably high Neighborhood Correlation score. Conversely, a pair of sequences with a high BIT-score may not yield a high Neighborhood Correlation score. The '--nc_thresh' parameter can be used to report only scores above a given threshold (default=0.05). Lower thresholds may be used for greater completeness, producing an output file of more pairs. If '--nc_thresh' is set to zero, all pairs will be printed, and the number of lines in the output file will be the square of the number of sequences. This threshold affects only the printed output and will not greatly affect runtime; all scores will still be calculated. ====================================================================== Running BLAST ====================================================================== Though comprehensive BLAST usage is beyond the scope of this document, a few aspects of BLAST execution are critical. In particular, - Expectation value ( blastall -e) should be set to the number 10*N, where N is the number of sequences in your dataset. This ensures that BLAST returns similarity scores for all alignments. - Effective search space length (blastall -Y) should be set to R^2, where R is the number of residues in your dataset. Standalone blast executables are available from NCBI at http://www.ncbi.nlm.nih.gov/BLAST/download.shtml. NCBI BLAST is able to output much more information in addition to the scores required for input to Neighborhood Correlation. If you are in need of robust BLAST parsing, we recommend you consider that provided by the BioPython project (http://biopython.org). ====================================================================== Example Dataset ====================================================================== The file 'human_mouse_bit_scores.dat' contains all-against-all BLAST results for all Human and Mouse sequences in SwissProt, version 50.9. Fragments have been excluded. This is the dataset used for the Song et al. PLoS Computational Biology paper. Execution may be accomplished with: NC_standalone -f human_mouse_bit_scores.dat