DEVELOPMENT OF A COMPUTATIONAL PIPELINE FOR THE IDENTIFICATION OF NON-CODING RNAs FROM NEXT GENERATION SEQUENCING DATA
No Thumbnail Available
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Description
Recent advances in genomics have revealed the critical roles that non-coding RNAs play
in disease occurrence, progression, and population disparities in patient treatment
outcomes. With the evolution of Next Generation Sequencing (NGS) techniques and the
generation of genomic big data, the ability of researchers to further explore the functions
of these non-coding RNAs has become more widely accessible. However, efficient
exploration requires user-friendly computational tools that can streamline and centralize
data analysis, particularly for identifying non-coding RNAs within large volumes of
NGS data. Current computational pipelines for non-coding RNA identification are often
limited to detecting only a single class of non-coding RNA and do not integrate the latest
standalone tools. Consequently, these pipelines are not workflow efficient as they
restrict the comprehensive analysis of diverse non-coding RNA classes within a single
framework. The aim of this study is to develop a computational pipeline for identifying
multiple classes of non-coding RNAs namely micro RNAs, long non-coding RNAs and
circular RNAs from NGS data. This aim was achieved by developing scripts for the
selected software tools integrated into the pipeline and incorporating these scripts as
individual processes within a unified Nextflow script. The software tools integrated into
the pipeline include; miRDeep2, mirnovo and sRNAtoolbox for the identification of
miRNAs; CIRI and KNIFE for the identification of circRNAs; PLEK and LncDC for
the identification of lncRNAs. Nextflow was used as the scientific workflow
management system and Docker was used for containerizing all the integrated tools and
their software dependencies for easy use and reproducibility across different computing
environments. The pipeline was then evaluated using test data provided by each of the
individual software tools and it successfully identified all the reported miRNAs,
lncRNAs and circRNAs, thus proving its effectiveness. Beyond the reduced execution
time, the pipeline offers a more efficient solution by streamlining the analysis of noncoding
RNAs and eliminating the need for separate software installation and
environment setup, thereby reducing the user's workload.
Keywords
QA75 Electronic computers. Computer science