AutoNOE CS-Rosetta

This tutorial shows how to run autoNOE CS-Rosetta calculations as published in References upcoming. Thanks to the setup tools, this tutorial is very similar to Tutorial: Standard CS-Rosetta. The main difference is that we want to start a swarm of runs with different restraint weights, which we do by using the specialized setup-command setup_autoNOE rather than setup_run.

1. Provided Input Files:

the inputs for this tutorial are found in tutorials/inputs/NeR103A and comprise of the following files:

NeR103A_CASD_ali_noesy.list original peak list for the aliphatic carbon channel
NeR103A_CASD_aro_noesy.list original peak list for the aromatic carbon channel
NeR103A_CASD_n_noesy.list original peak list for the nitrogen channel
ner103a.bmrb resonance assignments in bmrb format

As shown in the tutorial for fragment picking fragments and supporting files can be generated as follows:

ner103.autotrim.tab backbone chemical shifts in TALOS file format
ner103.fasta corresponding FASTA sequence for structure calculation
ner103.frags3.dat.gz fragment library with 3mer fragments
ner103.frags9.dat.gz ragment library with 9mer fragments

2. Preparation of Input Files:

2.1 Setup of CS-Rosetta target

An initial target record can be generated with the provided files from the tutorial directory.

$ setup_target -target ner103a -method autoNOE -frags ner103.frags3.dat.gz ner103.frags9.dat.gz -fasta ner103.fasta -cs ner103.autotrim.tab 

2.2 Preparation of Resonance Assignments

The .prot file will provide the resonance assignments to the autoNOE module of Rosetta in the correct format. The sequence numbering has to be consistent with the sequence numbering in the fragments, and thus we will trim accordingly.

$ bmrb2prot ner103a.bmrb ner103a.prot
$ renumber_prot -fasta ner103.fasta ner103a.prot ner103a.trim.prot

2.3 Preparation of Peak Lists

The peak-lists come in widely different formats, and even if the same general format is used there are often subtle differences, in how the data is annotated. To prepare peak-files with a consistent format that can be read by autoNOE-Rosetta one can use the provided clean_peak_file application. The user can specify which columns in the input file should be mapped to which column in the final peak file. Moreover, the user can specify how many header-lines should be ignored before data is extracted. To map columns of the input to the output, the user first specifies a list of column numbers that are read from the input, additionally the user provides a list of same length with pre-determined names such as "N", "H","id", "I". The element names "H,C,N" can be given in upper-case and lower-case. The input is interpreted such that frequency columns with the same case are considered to be directly correlated. That means, H n h, specifies that the first column is a free proton dimension, whereas the second and third column corresponds to the labelled dimension, ie., all pairs of nitrogen and proton frequencies in the second and third dimension stem from pairs of amide protons and their corresponding nitrogen. This means the assignment "h N H" is equivalent to "H n h".
The user should know the meaning of the columns in the given input file, but might also be able to figure it out from the distribution of frequencies. In our case the translation command looks as follows:

$ clean_peak_file NeR103A_CASD_ali_noesy.list ali.peaks -cols 1 2 3 4 -skip 1 -names h C H I

That means the first 3 columns specify frequencies, the 2nd and 3rd are from methyl groups, whereas the 1st can be any proton. The 4th column specifies the peak Intensity.
Other examples are:

$ clean_peak_file xeasy.peaks ali.peaks -cols 1 2 3 4 7 -skip 6 -names id C h H I 

where the first column is a peak-identifier which is extracted alongside with the frequencies and intensities. If no peak-identifier (id) is extracted peaks are numbered consecutively starting at 1.

2.4Adding Peak-List and Resonance Assignments to the CS-Rosetta target

The curated peak-list and resonance assignment can now be added to the CS-Rosetta target that has been created in step 2.1:

setup_target -target ner103a -method autoNOE -peaks ali.peaks aro.peaks n.peaks -shifts ner103a.trim.prot

In some cases one wants to specify different lists of resonance assignment for each list of peaks. In this case multiple entries can be given to the -shifts parameter of the setup-cmd, and each entry will be mapped to the corresponding entry of the -peak parameter list. Note, that matching is based on list-position alone and not on filenames. This requires that the parameter lists of -peaks and -shifts have matching lengths:

setup_target -target ner103a -method autoNOE -peaks ali.peaks aro.peaks n.peaks -shifts ner103a.trim.ali.prot ner103a.trim.aro.prot ner103a.trim.n.prot

3. Starting a autoNOE-Rosetta calculation

3.1. Setup the run-directories

For AutoNOE-Rosetta the optimal restraint weight depends on various factors, such as the quality of the input data, and the compatibility of the generated distance restraints with low-scoring protein structures in Rosetta. In our research we found that the optimal restraint weight is best determined by running autoNOE-Rosetta with multiple weights and then selecting the optimal run afterwards. To this end, we generate a swarm of runs with

$ setup_autoNOE -target ner103a -method autoNOE -dir . -job slurm 

Which will generate runs for the restraint weights 5, 10, 25, 50. If the peak-list has been manually picked these settings are usually sufficient, for automatically picked peak-lists we recommend to also run the additional restraint weights of 1, 2.
This is done by specifying the list of restraint weights explicitly in the setup-cmd.

$ setup_autoNOE -target ner103a -method autoNOE -dir . -job slurm -cst 1 2 5 10 25 50

The command will generate the following directory structure:

ner103a
    cst_05
    cst_10
    cst_25
    cst_50

With each subdirectory containing the directories already known from RASREC or plain CS-Rosetta runs:

ner103a
   cst_05
      inputs
      run
      test
      setup_command.sh
   cst_10
     ...

3.2. Running the pre-assignment step

Before starting the actual multi-processor simulation we have to run a script that has been generated in the run directory. This script runs the automatic NOE assignment module of Rosetta once, which has the benefit that we can check
that NOE assignment actually works and no trivial problem with the input files remains before we run the multi-processor version. The output data of this pre-assignment step is required for the multi-processor simulation, and hence this step is absolutely mandatory. We want to run this script for every choice of restraint weights, thus we use a for-loop (bash):

cd ner103a
for cst in $( ls -d cst*); do 
  cd $cst/run 
  . initialize_assignments_phaseI.sh
  cd ../..
done

3.3. Starting the AutoNOE-RASREC simulation

If the pre-assignment step from Section 3.2 was successful we can proceed and start the actual simulations on our cluster using the provided queuing system.
The following command, for instance, would start all simulations using 192 cores each on our cluster which is powerd by the SLURM queuing system.

for cst in $( ls -d cst*); do 
  cd $cst/run 
  sbatch -n 192 production.slurm.job
  cd ../..
done

If the queuing system is MOAB or similar (msub to queue jobs), the same thing would look like this (Note to use -job moab in the setup_autoNOE command), on a cluster whose nodes have 8 cores and use hyper-threading
or nodes that have 16 cores.

for cst in $( ls -d cst*); do 
  cd $cst/run 
  msub -l nodes=12:ppn=16 -n 200 production.slurm.job
  cd ../..
done

.
The runs are finished when in all cst_xx/run directories the directory cst_xx/run/fullatom_pool_stage8 appears.

Analysis of AutoNOE-Rosetta runs

To analyse the runs, and to figure out which of the restraints weight to use we use the provided application autoNOE_select_final_run as follows:

cd ner103a # the top-level directory with all the cst_xx directories as direct sub-directories. 
autoNOE_select_final_run [-ensemble final.pdb]

The optional argument [-ensemble final.pdb] specifies whether the best 10 models should be written to a multi-model pdb-file and which name it should get.
This command runs all necessary analysis and comes up with a list as follows:

------------------------------------- all runs -----------------------------------------
                              cst_25/run cst= 25.0 score=  -115.1 prec=  1.3 fraction_converged=1.00 target_fct= 57.8
                              cst_10/run cst= 10.0 score=  -123.8 prec=  1.4 fraction_converged=0.99 target_fct=137.1
                              cst_50/run cst= 50.0 score=  -105.4 prec=  1.8 fraction_converged=0.99 target_fct= 27.8
                              cst_05/run cst=  5.0 score=  -123.3 prec=  2.4 fraction_converged=0.93 target_fct=217.7
--------------------------- selected by weighted formula -------------------------------
                              cst_25/run cst= 25.0 score=  -115.1 prec=  1.3 fraction_converged=1.00 target_fct= 57.8

In this case the run with restraint weight of 25 was most successful and is selected. One should also observe the "fraction_converged" and "target_fct" entries of the final run. The fraction of converged residues should be higher than 90% and the target_fct should be lower than 500. If this is not the case the results of the autoNOE-Rosetta structure calculation cannot be trusted. These are empirical values we found from a benchmark of 50 proteins.
If you find that you're calculations pass these tests but converged to the wrong structure, please contact us, as such cases are most informative for the future development of the software.

Adding more runs

It is possible to repeat runs at individual restraint weights, which can help to get better results in cases where the precision is not optimal.
To this end, go into the run-directory that you want to redo and use the command copy_flag_files.

cd ner103a/cst_25/run
copy_flag_files.sh ../run_2
cd ../run_2
. initialize_assignment_phaseI.sh
sbatch -n 192 production.slurm.job

...and wait.
Running the analysis step after completion of the additional runs (autoNOE_select_final_run) will now also consider the additional run.