Developer guide

Developer guide

Introduction

GenomeDepot is based on the Django framework. Django architecture follows the Model-View-Template pattern, where Model handles the data representation, Template handles the presentation logic and View handles the business logic that takes up information from the user and displays model data to the user. GenomeDepot communicates with the web server over the Web Server Gateway Interface (WSGI), so Django views in GenomeDepot are synchronous in nature. However, there are several pages in GenomeDepot that employ asynchronous Javascript: BLAST search, database search in genes and annotations, comparative analyses, data export. In addition, long-running tasks started from the GenomeDepot administration panel, such as genome import or re-building BLAST databases, are executed asynchronously by the Django Q worker. Data models are defined in genomebrowser/browser/models.py.

Where GenomeDepot stores sequence data

In GenomeDepot, only protein sequences are stored in the MySQL database. Nucleotide sequences are always stored in static files. STATIC_ROOT variable in the .env file defines the directory for static files. This directory has the “genomes” subdirectory for genome files. The genomes/gbff directory contains genomes in the Genbank format and the genomes/jbrowse directory contains files served by the embedded genome viewer (indexed FASTA file, indexed GFF3 files etc.) for each genome. The web server must have access to static files. BLAST databases are stored in the appdata subdirectory of the project directory.

Genome import pipeline

GenomeDepot imports genomes in batches. Since each run of the genome import pipeline re-generates BLAST databases and runs eggNOG-mapper, single genome import is impractical. A reasonable size of a batch for import is between 50 and 1000 genomes.

The genome import pipeline runs several external tools generating ortholog family mappings, predicted operons and gene functional assignments. The first of them is eggNOG-mapper for fast functional annotation and orthology predictions. It processes all proteins from the imported genomes in chunks containing 200,000 sequences. GenomeDepot relies on eggNOG-mapper orthology predictions in comparative analyses and functional classifications. After the import of eggNOG-mapper output into database, the pipeline runs POEM for operon predictions, for one genome at a time. After importing operon predictions, the pipeline generates static files for the embedded Jbrowse browser and BLAST databases. At this moment, imported genomes become visible to web portal visitors. And finally, the genome import pipeline starts annotation pipeline, which runs an array of annotation tools.

Annotation pipeline

The GenomeDepot annotation pipeline is an extensible toolkit of specialized annotation tools. Each tool in the pipeline can be configured and turned on or off in the administration panel. The annotation pipeline can be started as a part of the genome import process or independently for genomes that have been imported into GenomeDepot. The annotation pipeline runs gene annotation tools one by one for selected genomes, one genome at a time.

Architecture of annotation pipeline plug-ins

The annotation pipeline of GenomeDepot can be expanded with additional annotation tools. The annotation tools accept nucleotide or protein sequences as an input and generate functional annotations for individual genes. GenomeDepot interacts with annotation tools through plug-in Python modules that organize input files, execute external tools, process tool-specific outputs and generate tab-separated files imported by the GenomeDepot pipeline into the database.

A plug-in module must contain the application function that accepts two arguments. The first argument is an object of the Annotator class, and the second is a dictionary of genomes, with genome name as a key and a path to the GenBank file as a value. The application function returns full path of tab-separated text file with gene annotations.

Typically, a plug-in module implements three functions: preprocess, run and postprocess. The preprocess function creates a working directory in the GenomeDepot temporary directory and writes all input files into the working directory. The preprocess function also generates bash script that activates a Conda environment for the annotation tool, executes the tool and deactivates the Conda environment. The run function executes the bash script created by preprocess. The postprocess function reads output file(s) generated by the annotation tool and creates an output file for import into GenomeDepot in the temporary directory. Finally, the postprocess function deletes the working directory.

Example:

from subprocess import Popen, PIPE, CalledProcessError
def application(genomes, annotator):
    wrapper_script, working_dir = preprocess(genomes, annotator)
    run(wrapper_script)
    output_filename = postprocess(working_dir)
    return output_filename 
def preprocess(genomes, annotator):
    # create working directory
    # create input file
    # create wrapper bash script for a tool
    return wrapper_script, working_dir
def run(wrapper_script):
    with Popen([‘bash’, wrapper_script], stdout=PIPE, bufsize=1, universal_newlines=True) as proc:
        for line in proc.stdout:
            print(line, end='')
    if proc.returncode != 0:
        raise CalledProcessError(proc.returncode, proc.args)
def postprocess(working_dir)
    # read tool output from working_dir
    # write annotations into output_file in temporary directory
    # delete working directory
    return output_file

Plug-in file name and location

Plug-in module files have to be placed into the genomebrowser/browser/pipeline/plugins directory, and file name should start with “genomedepot-“. Fo example, if a tool name is mytool, the plug-in file name is genomebrowser/browser/pipeline/plugins/genomedepot-mytool.py.

Annotation tool installation

Genome annotation tools for the GenomeDepot annotation pipeline have to be installed in separate conda environments. A name for a tool must start with “genomedepot-” (for example, genomedepot-mytool) to avoid conflicts with existing conda environments.

If there is no conda package for an annotation tool, the tools should be installed in a subdirectory of genomedepot/external_tools (for example, genomedepot/external_tools/mytool). Reference data files should be copied into a subdirectory of genomedepot/external_refdata (for example, genomedepot/external_refdata/mytool). Paths to executable files and reference data of the tool can be stored in GenomeDepot configuration parameters.

Plug-in configuration

Plug-in configuration parameters must include a name matching the name of the plug-in file. All plug-in parameters start with “plugins.” follwed by the tool name. For example, for the genomedepot-mytool.py module, configuration parameters start with “plugins.mytool.”.

Annotation tool plug-ins can be enabled or disabled in the GenomeDepot administration portal. To enable a plug-in, add or edit the **“plugins..enable”** parameter in the Configuration page (admin/browser/config/) and set its value to 1. To disable a plug-in, set the value to 0.

Other common plug-in configuration parameters are:

display_name: the plug-in name displayed in the administration portal pages (for example, parameter plugins.mytool.display_name defines the name of mytool in the “Run annotation tools” page)
conda_env: the name of Conda environment where the tool is installed. Usually, conda environments are called genomedepot- (for example, "genomedepot-mytool") to avoid conflicts with existing conda environments.
threads: number of threads for execution of the tool

Additional configuration parameters may be used to store location of reference data files for the tool, threshold values etc.

The annotation pipeline calls the application function of a plug-in module passing two arguments. The second argument is an object of the Annotator class called annotator, which has the annotator.config dictionary with all configuration parameters.

For example, annotator.config[‘plugins.amrfinder.threads’] stores a number of threads available for AMRFinderPlus (as a string).

Writing input files with nucleotide and protein sequences for annotation tools

Many annotation tools accept input sequence files in one of standard file formats. GenomeDepot has utility functions for export genome and protein sequences in FASTA and GenBank formats in the browser.pipeline.util module:

export_proteins: exports proteins from one or more genomes into a single FASTA file. Arguments: list of protein identifiers, output file name
export_proteins_bygenome: exports proteins from one or more genomes into FASTA files, one file per genome. Arguments: dictionary of genome names and GenBank file paths, output directory
export_nucl_bygenome: exports genome sequences into FASTA files. Arguments: dictionary of genome names and GenBank file paths, output directory

Example:

from pathlib import Path
from browser.models import Genome
from browser.pipeline.util import export_proteins, export_nucl_bygenome

# Exports all proteins from all genomes into the proteins.faa file
output_fasta_file = 'proteins.faa'
genome_ids = Genome.objects.values_list('id', flat=True)
export_proteins(genome_ids, output_fasta_file)

# Exports nucleotide sequences of all genomes into the /tmp/fna directory
output_dir = '/tmp/fna'
Path(output_dir).mkdir(exist_ok=True)
genomes = Genome.objects.values_list('name', 'gbk_filepath')
genome_data = {x[0]:x[1] for x in genomes}
export_nucl_bygenome(genome_data, output_dir)

Temporary files

A temporary directory is stored in configuration parameter core.temp_dir. All temporary files should be kept in the temporary directory. It is highly recommended to create a working directory for each run of an annotation tool inside the temporary directory, and remove the working directory when the output files are no longer needed.

Example:

from pathlib import Path

def application(genomes, annotator):
	preprocess(genomes, annotator)
	
def preprocess(genomes, annotator):
	tool_name = 'mytool'
	temp_dir = annotator.config['core.temp_dir']
	working_dir = os.path.join(temp_dir, tool_name)
	Path(working_dir).mkdir(parents=True,exist_ok=True)

Plug-in output file

The application function of a plug-in module must return an absolute path for a tab-separated text file with gene annotations. The file must contain seven columns:

Gene locus tag (as in the database).
Genome name (as in the database).
Annotation source (name of a tool producing the annotation, a database, etc. Up to 30 symbols long).
Annotation URL (link to external resource. Up to 300 symbols long).
Annotation key (Short category name, like “Protein family” or “Domain”. Up to 30 symbols long).
Annotation value (For example, identifier of a protein family or domain. Up to 50 symbols long).
Annotation note (free-text description).

GenomeDepot annotation pipeline imports gene annotations into the database and links them to genes with locus tag and genome name from the first and second columns. Lines starting with # are ignored.

Developer guide