View on GitHub

FilTar

Using RNA-Seq data to improve microRNA target prediction accuracy in animals

Background & Motivation

The primary motivation behind FilTar is to increase the specificty of existing computational miRNA target prediction methods in animals by integrating context-specific (i.e. tissue or cell-line) expression information derived from RNA-Sequencing experiments. More specifically, by using this information we achieve the following:

  1. Filtering of putative miRNA target transcripts according to expression of those transcripts
  2. Context-specific reannotation of 3'UTR (three prime untranslated regions) of mRNA transcripts

By doing this, we a) ensure that predicted target transcripts are actually expressed within a given biological context and b) when making target predictions, we have an accurate model of the 3'UTR within that given biological context

Use of existing animal miRNA target prediction methods

As mentioned previously, FilTar does not implement target prediction 'from scratch' but rather works by taking the output from existing target prediction methods and adding context-specific information to increase the accuracy of those predictions. Therefore, it is worth providing a brief summary here of target prediction methods used:

Targetscan7

DOI: 10.7554/eLife.05005

Targetscan7 is the latest iteration of the targetscan family of algorithms, all of which require a seed match between a miRNA and a given subsequence the target transcript. A seed match exists when there is full complementarity between the miRNA seed region (i.e. nucleotides 2-7 of the miRNA) and the putative target. Seed matching is the first step of the targetscan algorithm. In the next steps, the conservation of the 3'UTRs is calculated amongst 84 vertebrate species (including the reference species), which in addition with the output of the seed matching algorithm is used to calculate the probability of conserved targeting (PCT) of each candidate interaction. The PCT is one feature of 14 of a linear regression model trained on miRNA transfection data, used to compute the context++ score which is predictive of the efficacy of each candidate interaction. More negative context++ scores indicate greater predicted downregulation of targets.

miRanda (v3.3a)

DOI: 10.1186/gb-2003-4-11-p8

The miRanda algorithm shares some similarities with the targetscan 7 algorithm in that whilst considering pairing between the 3' end of the miRNA with the target transcript - it prioritises pairing at the the 5' end of the miRNA. The -strict option can be specificed to require seed pairing in all reported interactions. miRanda also models the influence of G:U wobbles as well at the thermodynamic stability of reported interactions.

As miRanda scores the entire alignment between the miRNA and the mRNA, and does not necessarily require seed binding, it may be the algorithm of choice for users interested in noncanonical interactions between the miRNA and the putative target

Implementation as a piece of software

FilTar is implemented using the Snakemake workflow management tool.

DOI: 10.1093/bioinformatics/bty350

The basic philosophy behind snakemake is to model data processing workflows as a build process in the same style that makefiles are used to compile software. In this paradigm, in brief, there are a series of short recipes or rules which specify the input files or 'ingredients' necessary to build a given target or output file. The output of one rule can be specified as the input of another rule, meaning that a given workflow can be most easily modelled as a directed acyclic graph (DAG) terminating in the production of a given target file. When a given target is specified, snakemake will recursively trace through the DAG in order to trigger rules necessary to generate the specified target file.

This development paradigm lends itself to many useful properties for the user: Primarily, the user can generate many different files of interest with very little direct work for the user due to extensive workflow automation. Users can also generate files at intermediate stages of the canonical workflow/DAG (which can be very useful in their own right) without generating downstream data which may not be needed. For instance, the user may wish to generate reannotated 3'UTR sequences without computing predicted targets for those sequences. Also, use of snakemake ensures a higher efficiency use of resources, by modularising the build process into a number of discrete processes working in series and parallel, which means that, if the build process fails for whatever reason, it is unlikely that the user will have to begin the entire process from the beginning as intermediate files will have been generated.

Whilst not essential, it would aid users of FilTar if they familiarised themselves with the Snakemake tool and documentation, which will help users better understand how FilTar works, and to help troubleshoot any problems if they do arise. Use of snakemake and associated documentation will convey the benefits of this tool, though they can be summarised here as: Extensive automation, modularity, reproducibility, extensibility, configurability, and ease of workflow interpretation and maintenance.