CoRAL

Classification of RNAs by Analysis of Length

CoRAL is a machine learning package that can predict the precursor class of small RNAs present in a high-throughput RNA-sequencing dataset. In addition to classification, it also produces information about the features that are most important for discriminating different populations of small non-coding RNAs.

Prediction results
Download CoRAL
Installing CoRAL
Using CoRAL
Citing CoRAL
References

Prediction results for two tissue types

Genome Browser tracks

Download source and annotation packages

CoRAL 1.1.1 source (tar.gz)

Annotation packages: these are required for feature generation and training

Human annotation hsa19 (UCSC hg19)
Human genome sequence (UCSC hg19)
Mouse annotation mmu9 (UCSC mm9)
Mouse genome sequence (UCSC mm9)
(forthcoming) Mouse annotation mmu10 (UCSC mm10)
Mouse genome sequence (UCSC mm10)

Installation instructions

==================
CoRAL dependencies
==================
  libgsl
  bigWigToBedGraph (a UCSC kent utility)
  bedtools
  samtools
  RNAfold
  ruby
  R

# Install gsl On a Debian-based system:
sudo apt-get install libgsl0-dev
  
# Install bigWigToBedGraph on any 64-bit linux:
#   (UCSC also provides MacOS binaries)
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph
sudo cp bigWigToBedGraph /usr/local/bin/  # or elsewhere in your $PATH

# Install bedtools
wget https://bedtools.googlecode.com/files/BEDTools.v2.17.0.tar.gz
tar -zxvf BEDTools.v2.17.0.tar.gz
cd bedtools-2.17.0
make
sudo cp bin/* /usr/local/bin/   # or elsewhere in your $PATH

==============
Building CoRAL
==============
make

================
Installing CoRAL
================

sudo cp bin/* /usr/local/bin/  # or elsewhere in your $PATH
# -- OR --
# add the following to your rc (e.g., ~/.bashrc):
#   PATH=$PATH:CORALDIR/bin
# where CORALDIR is the location of CoRAL

Usage

### CoRAL
# Classification of RNAs by Analysis of Length (and of other features)

# EXAMPLE USAGE:

# CoRAL is made up of several scripts, all of which need to be pointed to
# a configuration file; use "coral.conf.sample" as a template

### Required external data:
## bam file containing small RNA seq data
## annotation package (from CoRAL site): includes
#   .gff     genome annotation (known locations and types of RNA)
#   class_pri.txt    annotation priority - how to choose amongst annotations when they overlap
#   chromInfo.txt    lengths of chromosomes
## genome sequence (can be obtained from UCSC) (only needed for MFE computation)
#   requires all chrs concatenated into one FASTA file

### Set up some shell variables for clarity in this example
# input dataset
bam=data.bam
# CoRAL configuration file
conf=coral.conf.sample
# assume annotation package and genome sequence are here
annot=~/data/genomes/hsa19

# optionally specify a tmpdir if the default, /tmp, is not suitable
# export TMPDIR=~/data/tmp

### Running CoRAL
## call intervals corresponding to discrete small RNA-producing loci
# produces coral/loci.bed
call_smrna_loci.sh $bam $conf

## compute features
feature_lengths.sh $bam $conf
feature_antisense.sh $bam $conf
feature_entropy.sh $bam  $conf $annot/chromInfo.txt
feature_nuc.sh $bam $conf
feature_mfe.sh $bam $conf $annot/hsa19.fa $annot/chromInfo.txt

## label the loci based on known annotation data - this is only needed for training
annotate_loci.sh coral/loci.bed  $annot/hsa19.gff $annot/class_pri.txt

## generate data_x.txt and data_y.txt for input into the training and/or prediction
make_data_matrix.sh coral

## train a random forest classifier on the generated features for 3 classes
# the result will be in coral/run_xxxxx where xxx is a hash on the parameters used
# parameters used here: require 15 reads at a locus, use these three classes only
coral_train.R -r 15 -c "miRNA,snoRNA_CD,tRNA" \
  coral/data_x.txt coral/data_y.txt coral

## predict on entire dataset and use known data (data_y) to assess training performance
## and the model that was trained and outputted to "coral/run_*/"
# places results in pred_out dir
coral_predict.R -r 15 -y coral/data_y.txt coral/data_x.txt \
  coral/run_* pred_out


### Output file descriptions
data_x.txt 	# data matrix containing all locus and feature data
data_y.txt	# known classes based on the annotation
feat_*.txt	# individual feature data
loci.annot	# locus annotation data
loci.bed	# called loci; chr,start,end,locus_id,read_count,strand
# run_xxxx files:
class_performance.txt	# class-wise recall and ppv
class_sizes.txt		# number of loci in each class
feature_directions.txt	# difference in mean value of features within one class vs others
feature_importance.txt	# number of times features were selected by varSelRF for each class
overall_accuracy	# total performance for multi-class classifier
params.txt		# description of the parameters used for this run
data.Rdata		# the trained model

Citing CoRAL

If you use CoRAL, please cite:
Leung,Y.Y., Ryvkin,P., Ungar,L., Gregory,B.D., and Wang, L.-S. (2013) CoRAL: Predicting non-coding RNAs from small RNA-sequencing. Nucleic acids research.

References

[1] Joyce,C.E., Zhou,X., Xia,J., Ryan,C., Thrash,B., Menter,A., Zhang,W. and Bowcock,A.M. (2011) Deep sequencing of small RNAs from human skin reveals major alterations in the psoriasis miRNAome. Human molecular genetics, 20, 4025

Questions? or

Wang Lab | Penn Center for Bioinformatics | University of Pennsylvania