HiPR: High-throughput Probabilistic inference of RNA structures

Estimate secondary structure and base pairing posteriors for a given RNA sequence from high-throughput structure-sensitive sequencing data

HiPR probabilistic model for structure and data
HiPR probabilistic model for RNA structure and structure-probing sequencing data
HiPR structure prediction
HiPR structure prediction

HiPR (High-throughput Probabilistic RNA structure inference) algorithm predicts RNA secondary structure and base-pairing probabilities using experimental data from high-throughput structure probing assays and is based on the Bayesian Markov Chain Monte Carlo (MCMC) method.

Distinctive key features:

  1. Unlike previous approaches, HiPR explicitly models the generation of all possible sequencing reads as a function of underlying (unknown) secondary structure and experimental conditions
  2. A Bayesian MCMC algorithm to estimate the base pairing posterior probabilities and secondary structure that best fits the observed sequencing reads based on the distribution of read fragments along the locus
  3. Modular framework can accomodate many structure probing protocols: reverse transcription (RT) termination-based protocols (e.g., DMS-seq [Rouskin et al. 2014], Structure-seq [Ding et al. 2015]), or protocols based on RT mutational profiling (e.g., DMS-MaPseq [Zubradt et al 2017])
  4. HiPR overcomes inherent experimental biases (e.g., preferential modification of A or C nucleotides by DMS) by joint modeling of both paired bases and all four unpaired bases
  5. HiPR base-pairing probabilities and likelihood scores may be used in many downstream analysis steps such as analysis of structural motifs and substructures, conservation analysis, or as constraints for other structure analysis methods.

Supported inputs and protocols:

  1. Structure-sensitive sequencing-based protocol (DMS-seq, DMS-MaPseq)
  2. Mapped reads (BAM), raw sequencing reads (FASTQ), collapsed reads format
  3. Genome-wide or user-provided target region analysis

Deliverables:

  1. Posterior secondary structure
  2. Base-pairing posterior probabilities for each nucleotide
  3. Posterior probabilities for base-pairing interactions

References

  1. Kuksa, P.P., Li, F., Ryvkin, P., Kannan S., Gregory, B.D., Wang, L.-S. (2019). HiPR: High-throughput Probabilistic inference of RNA structure.

Software

HiPR software is freely available for download and use:

HIPR GitHub repository
Latest release: v1.1 30 April 2018

References

  1. Rouskin, S., Zubradt, M., Washietl, S., Kellis, M., & Weissman, J. S. (2014). Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature, 505(7485), 701–705. http://doi.org/10.1038/nature12894
  2. Kuksa, P.P., Li, F., Ryvkin, P., Kannan S., Gregory, B.D., Wang, L.-S. (2019). HiPR: High-throughput Probabilistic inference of RNA structure.

Download Data

Structure-mapping data

This data has been used for testing HiPR and other methods in the manuscript (manuscript in submission).

Raw sequencing data (DMS-Seq) GSE45803 [Rouskin et al. 2014]

Raw sequencing data (DMS-MaPseq) GSE84537 [Zubradt et al. 2016]

DMS-seq mapped data (human): in vivo DMS-seq BAM (H. sapiens) K562 GSM1297493 Rep 1-2 [40 GB]

DMS-seq mapped data (yeast): in vivo DMS-seq BAM (S. cerevisiae) Rep1-4 [18 GB]

DMS-MaPseq mapped data (human): in vivo DMS-MaPseq BAM (H. sapiens) HEK 293T Rep1-2 [6 GB]

Reference secondary structure models

Reference structures used in the manuscript for comparing HiPR and other structure prediction methods

Rfam structures

Rfam structures [Nawrocki et al 2015]

Validated structures

Validated structures (H. sapiens) [Rouskin et al, 2014]
Validated structures (S. cerevisiae) [Rouskin et al, 2014]

Download Results

Predicted RNA secondary structures

RNA secondary structures predicted by HiPR and other methods
Predicted structures

Rfam structures were used as reference for structure accuracy computation.

References

  1. Rouskin, S., Zubradt, M., Washietl, S., Kellis, M., & Weissman, J. S. (2014). Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature, 505(7485), 701–705. http://doi.org/10.1038/nature12894
  2. Nawrocki, E. P., Burge, S. W., Bateman, A., Daub, J., Eberhardt, R. Y., Eddy, S. R., … Finn, R. D. (2015). Rfam 12.0: Updates to the RNA families database. Nucleic Acids Research, 43(D1), D130–D137. https://doi.org/10.1093/nar/gku1063
  3. Zubradt, M., Gupta, P., Persad, S., Lambowitz, A.M., Weissman, J.S., and Rouskin, S. (2016). DMS-MaPseq for genome-wide or targeted RNA structure probing in vivo. Nature Methods 14, 75–82. https://doi.org/10.1038/nmeth.4057
  4. Kuksa, P.P., Li, F., Ryvkin, P., Kannan S., Gregory, B.D., Wang, L.-S. (2019). HiPR: High-throughput Probabilistic inference of RNA structure.

NAME

HiPR.sh - High-throughput Probabilistic inference of RNA secondary structures.

SYNOPSIS

    
HiPR.sh reads_file rates_file structure_file
  [-outDir OutputDir] [-locusName LocusName] [-n Niter]
  [-numCPU Ncpus] [-rmin MinReadLength] [-rmax MaxReadLength]

Mandatory arguments (need to be specified before optional arguments)
  reads_file -- File containing DMS-seq reads for RNA locus of interest (collapsed read format)
  rates_file -- File containing the initial estimates of per-nucleotide modification rates
  structure_file -- File containing the sequence and initial secondary structure of an RNA of interest

Recognized optional command line arguments (need to be specified after mandatory arguments)
  -outDir <string>  -- Set name of output folder (default=HiPR_output)
  -locusName <string>  -- Set name of locus (default=UnnamedLocus)
  -n <integer> -- Set maximum number of MCMC iterations (default=100000)
  -numCPU <integer> -- Set number of CPUs to use (default=16)
  -rmin <integer> -- Set mininum read length (default=15)
  -rmax <integer> -- Set maximum read length (defauld=40)

DESCRIPTION

Estimate secondary structure and base pairing posteriors for a given RNA sequence based on the distribution of read fragments along the locus.

This program requires a file containing the sequence and initial secondary structure of an RNA of interest, a file containing DMS-seq reads, and a file containing the initial estimates of per-nucleotide modification rates. A Bayesian MCMC algorithm is then used to estimate the base pairing posterior that best fits the observed sequencing reads. The results and intermediate files are written to a directory (HiPR_output/ by default).

The output file HiPR_posterior.txt contains the base pairing posteriors at each nucleotide position, one entry per line.

The output file HiPR_structure.txt contains the consensus secondary structure.

Requirements

Mandatory files must be specified before any other optional arguments and must exist, otherwise a error or usage message will be shown.

LICENSE

MIT License https://opensource.org/licenses/MIT

Copyright (c) 2016 University of Pennsylvania

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

AUTHOR

Pavel Kuksa <pkuksa@upenn.edu> Fan Li <fanli.gcb@gmail.com> Li-San Wang <lswang@upenn.edu>

HiPR software is developed by Wang lab members, Penn Neurodegeneration Genomics Center.

Comments or Questions: HIPR@lisanwanglab.org

Wang Lab