CoRAL

Classification of RNAs by Analysis of Length

CoRAL is a machine learning package that can predict the precursor class of small RNAs present in a high-throughput RNA-sequencing dataset. In addition to classification, it also produces information about the features that are most important for discriminating different populations of small non-coding RNAs.

Prediction results
Download CoRAL
Installing CoRAL
Using CoRAL
Citing CoRAL
References

Prediction results for two tissue types

Genome Browser tracks

Download source and annotation packages

Annotation packages: these are required for feature generation and training

Installation instructions

==================
CoRAL dependencies
==================
  libgsl
  bigWigToBedGraph (a UCSC kent utility)
  bedtools
  samtools
  RNAfold
  ruby
  R

# Install gsl On a Debian-based system:
sudo apt-get install libgsl0-dev
  
# Install bigWigToBedGraph on any 64-bit linux:
#   (UCSC also provides MacOS binaries)
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph
sudo cp bigWigToBedGraph /usr/local/bin/  # or elsewhere in your $PATH

# Install bedtools
wget https://bedtools.googlecode.com/files/BEDTools.v2.17.0.tar.gz
tar -zxvf BEDTools.v2.17.0.tar.gz
cd bedtools-2.17.0
make
sudo cp bin/* /usr/local/bin/   # or elsewhere in your $PATH

==============
Building CoRAL
==============
make

================
Installing CoRAL
================

sudo cp bin/* /usr/local/bin/  # or elsewhere in your $PATH
# -- OR --
# add the following to your rc (e.g., ~/.bashrc):
#   PATH=$PATH:CORALDIR/bin
# where CORALDIR is the location of CoRAL

Usage

### CoRAL
# Classification of RNAs by Analysis of Length (and of other features)

# EXAMPLE USAGE:

# CoRAL is made up of several scripts, all of which need to be pointed to
# a configuration file; use "coral.conf.sample" as a template

### Required external data:
## bam file containing small RNA seq data
## annotation package (from CoRAL site): includes
#   .gff     genome annotation (known locations and types of RNA)
#   class_pri.txt    annotation priority - how to choose amongst annotations when they overlap
#   chromInfo.txt    lengths of chromosomes
## genome sequence (can be obtained from UCSC) (only needed for MFE computation)
#   requires all chrs concatenated into one FASTA file

### Set up some shell variables for clarity in this example
# input dataset
bam=data.bam
# CoRAL configuration file
conf=coral.conf.sample
# assume annotation package and genome sequence are here
annot=~/data/genomes/hsa19

# optionally specify a tmpdir if the default, /tmp, is not suitable
# export TMPDIR=~/data/tmp

### Running CoRAL
## call intervals corresponding to discrete small RNA-producing loci
# produces coral/loci.bed
call_smrna_loci.sh $bam $conf

## compute features
feature_lengths.sh $bam $conf
feature_antisense.sh $bam $conf
feature_entropy.sh $bam  $conf $annot/chromInfo.txt
feature_nuc.sh $bam $conf
feature_mfe.sh $bam $conf $annot/hsa19.fa $annot/chromInfo.txt

## label the loci based on known annotation data - this is only needed for training
annotate_loci.sh coral/loci.bed  $annot/hsa19.gff $annot/class_pri.txt

## generate data_x.txt and data_y.txt for input into the training and/or prediction
make_data_matrix.sh coral

## train a random forest classifier on the generated features for 3 classes
# the result will be in coral/run_xxxxx where xxx is a hash on the parameters used
# parameters used here: require 15 reads at a locus, use these three classes only
coral_train.R -r 15 -c "miRNA,snoRNA_CD,tRNA" \
  coral/data_x.txt coral/data_y.txt coral

## predict on entire dataset and use known data (data_y) to assess training performance
## and the model that was trained and outputted to "coral/run_*/"
# places results in pred_out dir
coral_predict.R -r 15 -y coral/data_y.txt coral/data_x.txt \
  coral/run_* pred_out


### Output file descriptions
data_x.txt 	# data matrix containing all locus and feature data
data_y.txt	# known classes based on the annotation
feat_*.txt	# individual feature data
loci.annot	# locus annotation data
loci.bed	# called loci; chr,start,end,locus_id,read_count,strand
# run_xxxx files:
class_performance.txt	# class-wise recall and ppv
class_sizes.txt		# number of loci in each class
feature_directions.txt	# difference in mean value of features within one class vs others
feature_importance.txt	# number of times features were selected by varSelRF for each class
overall_accuracy	# total performance for multi-class classifier
params.txt		# description of the parameters used for this run
data.Rdata		# the trained model

Citing CoRAL

If you use CoRAL, please cite:
Leung,Y.Y., Ryvkin,P., Ungar,L., Gregory,B.D., and Wang, L.-S. (2013) CoRAL: Predicting non-coding RNAs from small RNA-sequencing. Nucleic acids research.

References


Questions? or

Wang Lab | Penn Center for Bioinformatics | University of Pennsylvania