View on GitHub

FilTar

Using RNA-Seq data to improve microRNA target prediction accuracy in animals

Usage

Preliminary Information

No matter how the user wishes to use FilTar, it is always invoked using the same format:

snakemake {name_of_target_file}

FilTar is run in this way, because it does not exist as a standard binary executable, or a script to be read by an interpreter, but rather as as a collection of snakefiles, scripts, text file and directory structures which relate to each other to achieve a common purpose i.e. the repository/directory itself is the tool.

As a result of this emphasis on directory structure, the user must ensure that they are located at the directory root when invoking snakemake.

Most parameters can be invoked by passing the --config option to snakemake and setting individual parameter values. Alternatively, the workflow can be configured by editing the config/basic.yaml file. See the 'configurations' page for more detail on this.

Record Metadata

Before the the workflow is executed, the user should specify which datasets are available for potential use by FilTar.

This is done by adding entries to the file 'metadata.tsv' in which each record represents a single sequencing run dataset. Fields for this table relate to the species of the sequencing dataset, the biological context from which the RNA was sampled, the sample accession, the run accession, and a column denoting whether the dataset refers to single end or paired end sequencing data

When specifiying the species option, a three-letter code must be used which uses the first letter of the Genus name and the first two letters of the species name. All letters must be in lowercase.

The biological context is defined at the user-level, but generally refers to any biological condition/state of interest to the researchers using the tool - e.g. 'healthy kidney', 'lung primary tumour' etc.

The sample accession and the run accession refer to the metadata attributes (of the same name) of these datasets within the sequence read archive (SRA) and the European nucleotide archive (ENA). Briefly, the 'sample accession' refers to the set of all metadata attributes for a given set of biological samples. Using this nomenclature, a set of biological replicate samples would be registered using the same sample accession. In contrast, the 'run accession' refers to a single sequencing dataset produced from a single sequencing run, and is linked with a single fastq file or a single pair of fastq files (for paired end sequencing). Please refer to SRA and ENA documentation for further information relating to these database accession types

Standard Usage

Standard and default usage refers to usage of the tool in which the user reannotates 3'UTRs, filters mRNA targets by expression level, and uses TargetScan7:

snakemake target_predictions.txt

Users can also optionally derive the genomic co-ordinates for identified predicted miRNA targets by running the following command: snakemake {species}_target_predictions_with_gene_coords.txt

However, in order to run this step, users must ensure that the reannotation option is set to false when running this step. This step can also be used with target prediction data generated using the FilTarDB web application, however, they must ensure that the transcript name, miRNA name, target site 'start' and target site 'end' columns are named 'Gene ID','Mirbase ID','UTR start' and 'UTR end', respectively.

Intermediate files

Most intermediate files generated whilst FilTar is running are automatically deleted once the process is deleted. This prevents FilTar from returning identical output even when the user has changed their configuration options

Exceptions to this general rule are files which which will not change as a result of different user input such as raw genomic sequence and index files. Alignment and pseudoalignment intermediate files are also not deleted, as these only need to be updated if the user updates the number of raw sequencing files for a given context.

One slight problem with this approach, is that it leads to some inefficiency as mostly intermediate files will have to be regenerated for every individual run of FilTar. For advanced users, with a good knowledge of both FilTar and the Snakemake job scheduling system, they can choose to edit FilTar source files to remove snakemake temp tags where they think it is appropriate to do so. However, caution is advised with this approach.

Modulating snakemake behaviour

Many options can be passed to the snakemake command in order modulate behaviour, reference should be made to the official Snakemake Snakemake documentation as an exhaustive reference

Of particular importance, is to note that Snakemake has its own built-in job scheduling system to manage the execution of different rules. Many rules can be executed in parallel using the --cores {num_cores} option. Combined use of this option and execution within high-performance computing environments enable the execution of rules across many different cores.

Warnings

  1. Storage space: The script generating the context++ scores uses RNAplfold as a dependency which itself generates ~100KB of data per transcript per species per tissue. Therefore the option to use RNAplfold with FilTar is disabled unless this dependency is already present in the user's environment - this is to protect the user from an unintentional consumption of their own computational resources. If the user is aware of the risks, they can enable automated used of RNAplfold through conda by uncommenting the relevant conda directive for the rule generating context++ scores in modules/target_prediction/targetscan/Snakefile file. However, this will result in less information being used to score target predictions using targetscan.
  2. Cleaning output directories: Be cautious to delete files if running the same analysis (e.g. same species and tissue) using different 3'UTR annotations, as targetscan7 will use RNAplfold generated from a previous analysis automatically. Again, manually deleting RNAplfold output will protect against this issue.