CNRG: Alphafold | Carl R. Woese Institute for Genomic Biology

• Alphafold is a highly accurate protein structure prediction program.
• More information at https://github.com/deepmind/alphafold/

How to Run

Load alphafold module. This loads alphafold, singularity, and the alphafold databases.

module load alphafold/2.3.2

Create scratch folder. This is important so the temporary data will go to the local scratch disk instead of the system's /tmp folder. The /tmp has limited space which if it gets filled up, the node will become unresponsive and cause jobs to fail.

mkdir /scratch/$SLURM_JOB_ID 
export TMPDIR=/scratch/$SLURM_JOB_ID

Run run_singularity.py to run alphafold. This is a wrapper script for the alphafold singularity container to make things easier to run.

run_singularity.py --data-dir $BIODB --cpus $SLURM_NTASKS --use-gpu --output-dir     example_output --fasta-paths example.fasta

--data-dir parameter should be set to $BIODB. $BIODB points to the location of the alphafold databases
--cpus parameter should be set to $SLURM_NTASKS. $SLURM_NTASKS is a variable which is equal to the number of processors you reserved
--use-gpu enables the use of GPUS. Singularity will automatically use the number of the GPUs you have reserved.
--output-dir parameter specifies where the output files should go. Change this parameter to an folder in your home folder
--fasta-paths parameter specifies your input fasta files. Only one fasta sequence per a file is allowed. If you want to run on multiple sequences, each sequence needs to be in its own file. Then you can specify multiple files like below

--fasta-paths example.fasta,example2.fasta,example3.fasta

Example Job Script

#!/bin/bash 
# ----------------SLURM Parameters---------------- 
#SBATCH -n 4 
#SBATCH -N 1 
#SBATCH -p gpu 
#SBATCH --gres=gpu:1 
#SBATCH --mem 70G 

# ----------------Load Modules-------------------- 
module load alphafold/2.3.2 
# ----------------Commands------------------------ 
mkdir /scratch/$SLURM_JOB_ID 
export TMPDIR=/scratch/$SLURM_JOB_ID 

run_singularity.py --data-dir $BIODB --cpus $SLURM_NTASKS --use-gpu --db-preset full_dbs --output-dir output \ 
--fasta-paths example.fasta 

rm -fr /scratch/$SLURM_JOB_ID

Submit Job

Submit job to the cluster

sbatch example.sh

Parameters

These are all the parameters for run_singularity.py. This can be accessed by running run_singularity.py --help

  -h, --help            show this help message and exit 
 --fasta-paths FASTA_PATHS [FASTA_PATHS ...], -f FASTA_PATHS [FASTA_PATHS ...] 
                       Paths to FASTA files, each containing one sequence. 
                       All FASTA paths must have a unique basename as the 
                       basename is used to name the output directories for 
                       each prediction. 
 --max-template-date MAX_TEMPLATE_DATE, -t MAX_TEMPLATE_DATE 
                       Maximum template release date to consider (ISO-8601 
                       format - i.e. YYYY-MM-DD). Important if folding 
                       historical test sets. 
 --db-preset {reduced_dbs,full_dbs} 
                       Choose preset model configuration - no ensembling with 
                       uniref90 + bfd + uniclust30 (full_dbs), or 8 model 
                       ensemblings with uniref90 + bfd + uniclust30 (casp14). 
 --model-preset {monomer,monomer_casp14,monomer_ptm,multimer} 
                       Choose preset model configuration - the monomer model, 
                       the monomer model with extra ensembling, monomer model 
                       with pTM head, or multimer model 
 --num-multimer-predictions-per-model NUM_MULTIMER_PREDICTIONS_PER_MODEL 
                       How many predictions (each with a different random 
                       seed) will be generated per model. E.g. if this is 2 
                       and there are 5 models then there will be 10 
                       predictions per input. Note: this FLAG only applies if 
                       model_preset=multimer 
 --benchmark, -b       Run multiple JAX model evaluations to obtain a timing 
                       that excludes the compilation time, which should be 
                       more indicative of the time required for inferencing 
                       many proteins. 
 --use-precomputed-msas 
                       Whether to read MSAs that have been written to disk 
                       instead of running the MSA tools. The MSA files are 
                       looked up in the output directory, so it must stay the 
                       same between multiple runs that are to reuse the MSAs. 
                       WARNING: This will not check if the sequence, database 
                       or configuration have changed. 
 --data-dir DATA_DIR, -d DATA_DIR 
                       Path to directory with supporting data: AlphaFold 
                       parameters and genetic and template databases. Set to 
                       the target of download_all_databases.sh. 
 --docker-image DOCKER_IMAGE 
                       Alphafold docker image. 
 --output-dir OUTPUT_DIR, -o OUTPUT_DIR 
                       Output directory for results. 
 --use-gpu             Enable NVIDIA runtime to run with GPUs. 
 --models-to-relax MODELS_TO_RELAX 
                       Whether to run the final relaxation step on the 
                       predicted models. Turning relax off might result in 
                       predictions with distracting stereochemical violations 
                       but might help in case you are having issues with the 
                       relaxation stage. 
 --enable-gpu-relax    Run relax on GPU if GPU is enabled. 
 --gpu-devices GPU_DEVICES 
                       Comma separated list of devices to pass to 
                       NVIDIA_VISIBLE_DEVICES. 
 --cpus CPUS, -c CPUS  Number of CPUs to use.

Issues

If you receive an error like the one below, you most likely need to increase the amount of memory you are reserving in your job script.

RuntimeError: HHSearch failed

References

• https://github.com/deepmind/alphafold
• https://github.com/dialvarezs/alphafold
• https://hub.docker.com/r/catgumag/alphafold

Carl R. Woese Institute for Genomic Biology

Computer and Network Resource Group

Resources

Alphafold

Carl R. Woese Institute for Genomic Biology