CoRAL is a machine learning package that can predict the precursor class of small RNAs present in a high-throughput RNA-sequencing dataset. In addition to classification, it also produces information about the features that are most important for discriminating different populations of small non-coding RNAs.
Prediction resultsGenome Browser tracks
Annotation packages: these are required for feature generation and training
================== CoRAL dependencies ================== libgsl bigWigToBedGraph (a UCSC kent utility) bedtools samtools RNAfold ruby R # Install gsl On a Debian-based system: sudo apt-get install libgsl0-dev # Install bigWigToBedGraph on any 64-bit linux: # (UCSC also provides MacOS binaries) wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph sudo cp bigWigToBedGraph /usr/local/bin/ # or elsewhere in your $PATH # Install bedtools wget https://bedtools.googlecode.com/files/BEDTools.v2.17.0.tar.gz tar -zxvf BEDTools.v2.17.0.tar.gz cd bedtools-2.17.0 make sudo cp bin/* /usr/local/bin/ # or elsewhere in your $PATH ============== Building CoRAL ============== make ================ Installing CoRAL ================ sudo cp bin/* /usr/local/bin/ # or elsewhere in your $PATH # -- OR -- # add the following to your rc (e.g., ~/.bashrc): # PATH=$PATH:CORALDIR/bin # where CORALDIR is the location of CoRAL
### CoRAL # Classification of RNAs by Analysis of Length (and of other features) # EXAMPLE USAGE: # CoRAL is made up of several scripts, all of which need to be pointed to # a configuration file; use "coral.conf.sample" as a template ### Required external data: ## bam file containing small RNA seq data ## annotation package (from CoRAL site): includes #.gff genome annotation (known locations and types of RNA) # class_pri.txt annotation priority - how to choose amongst annotations when they overlap # chromInfo.txt lengths of chromosomes ## genome sequence (can be obtained from UCSC) (only needed for MFE computation) # requires all chrs concatenated into one FASTA file ### Set up some shell variables for clarity in this example # input dataset bam=data.bam # CoRAL configuration file conf=coral.conf.sample # assume annotation package and genome sequence are here annot=~/data/genomes/hsa19 # optionally specify a tmpdir if the default, /tmp, is not suitable # export TMPDIR=~/data/tmp ### Running CoRAL ## call intervals corresponding to discrete small RNA-producing loci # produces coral/loci.bed call_smrna_loci.sh $bam $conf ## compute features feature_lengths.sh $bam $conf feature_antisense.sh $bam $conf feature_entropy.sh $bam $conf $annot/chromInfo.txt feature_nuc.sh $bam $conf feature_mfe.sh $bam $conf $annot/hsa19.fa $annot/chromInfo.txt ## label the loci based on known annotation data - this is only needed for training annotate_loci.sh coral/loci.bed $annot/hsa19.gff $annot/class_pri.txt ## generate data_x.txt and data_y.txt for input into the training and/or prediction make_data_matrix.sh coral ## train a random forest classifier on the generated features for 3 classes # the result will be in coral/run_xxxxx where xxx is a hash on the parameters used # parameters used here: require 15 reads at a locus, use these three classes only coral_train.R -r 15 -c "miRNA,snoRNA_CD,tRNA" \ coral/data_x.txt coral/data_y.txt coral ## predict on entire dataset and use known data (data_y) to assess training performance ## and the model that was trained and outputted to "coral/run_*/" # places results in pred_out dir coral_predict.R -r 15 -y coral/data_y.txt coral/data_x.txt \ coral/run_* pred_out ### Output file descriptions data_x.txt # data matrix containing all locus and feature data data_y.txt # known classes based on the annotation feat_*.txt # individual feature data loci.annot # locus annotation data loci.bed # called loci; chr,start,end,locus_id,read_count,strand # run_xxxx files: class_performance.txt # class-wise recall and ppv class_sizes.txt # number of loci in each class feature_directions.txt # difference in mean value of features within one class vs others feature_importance.txt # number of times features were selected by varSelRF for each class overall_accuracy # total performance for multi-class classifier params.txt # description of the parameters used for this run data.Rdata # the trained model
Questions? or
Wang Lab | Penn Center for Bioinformatics | University of Pennsylvania