How to tackle large-scale NGS data processing with software and platform solutions

by | 18. 01. 2023 | Laboratory digitalization

Reading Time: 6 minutes

Next-generation sequencing (NGS) is a powerful tool for generating large amounts of DNA or RNA sequence data that has revolutionized many areas in the life science sector (e.g., biology, medicine, etc.). Life scientists can choose from a variety of options when it comes to experimental setups, kits, sequencing providers, and data analysis options. In a nutshell, NGS workflow from start to finish would comprise sample preparation, library preparation, sequencing, and data analysis with data exploration.

Four steps of NGS workflow

Figure 1: Four steps of NGS workflow (Source: iRepertoire)

When it comes to library preparation and sequencing, users can choose from numerous kit and sequencing providers. Some of the key players in the sector are listed below:

  • Illumina
  • ThermoFisher
  • Qiagen
  • Agilent
  • BioRad
  • PerkinElmer
  • LGC
  • Lexogen

But raw NGS data generation is only the first step in the process. Once you have the raw data, it must be correctly processed to convert into a human-readable form. In the case of RNA sequencing data, a typical data processing workflow comprises several steps that can be completed by specific bioinformatic tools. This would include trimming the reads, QC, alignment, quantification, QC, DGE, analysis, and data presentation.

RNA sequencing scheme

Figure 2: RNA sequencing scheme

Since NGS data is complex, numerous tools have been developed in the last two decades that enable NGS data processing. For example, some tools may be more efficient at aligning reads to a reference genome (e.g., STAR, Salmon, Bowtie, etc.), while others may be better at identifying genetic variants (e.g., HISAT2, etc.) or quantifying gene expression (e.g., Salmon, Kallisto, DESeq, etc.).

It is also important to note that the majority of the tools are available as open-source code and thus can be used by anyone and would cover most of the use cases. However, for some specific use cases or to have more efficient tools from a performance standpoint, those might not be accessible as open-source. For example, tools/pipelines developed by Biotech companies to cover their needs and/or enable them to differentiate from the competition.

So how should one approach NGS data analysis?

Should you use open-source tools or software packages (e.g., pipeline with some basic steps), build a custom workflow and process locally, or use off-the-shelf solutions from one of the licensed software or platform providers? It all comes down to several factors when a decision has to be made. Most notably, how much data you have to process (small scale vs. large scale), where to store data (raw and processed), type of NGS data, and level of expertise required to execute the analysis.

For example, in internal small-scale data processing, it often makes sense to use individual tools and software packages locally, especially if you are accustomed to using the command line. Why? Because you can access various tools and software packages via GitHub or even build your pipelines, you would not require any licensed software to perform the analysis.

However, keep in mind that this requires skills and a deep understanding of bioinformatics. You have to know how to set up the parameters, validate outputs on test samples, extract information and finally, use visualization tools. In most cases, this is not something most Life scientists are trained to do. Therefore, they usually rely on existing software and/or platform solutions from one of the sequencing providers or other biotech companies that have built-in and validated pipelines that enable more or less automated data processing and exploration.

On the contrary, software or a platform is almost necessary when it comes to large-scale NGS data processing. There are several key reasons for that, namely:

Processing speed
Some tools require a significant amount of computational power. Compared to local processing, cloud-based solutions can use distributed computing resources to speed up data processing. There are proprietary technologies specifically designed for high-throughput processing (e.g., Illumina DRAGEN Bio-IT Platform).

Data storage
Data could quickly pile up in terra bites of space, which can be problematic to store locally, especially since you would have raw data and processed (aligned reads, e.g., BAM files). In this case, cloud-based platforms offer elegant storage solutions (e.g., Amazon AWS, Microsoft Azure, or Google cloud).

Built-in pipelines
Platforms support processing NGS data with validated pipelines (e.g., RNA-seq, DNA-seq, WES, WGS, ATAC, Chip-seq, etc.). Most importantly these pipelines are regularly updated and have default features to analyze majority of NGS data, while at the same time, advanced users can adjust parameters and extract information that might be lost otherwise.

Data exploration
Interactive data exploration features are crucial to extracting information (e.g., gene role, differential gene expression, variants, etc.). Commonly used features are PCA plots, Heatmaps, Venn diagrams, Differential gene expression, Volcano plots, etc. There are options to access external databases to obtain additional information about a specific gene, variant, ontology, etc.

Trends, solutions, and why big players have their proprietary platforms for NGS data processing

Current trends in NGS data analysis include cloud-based platforms, machine learning (ML), and artificial intelligence (AI) techniques. Cloud platforms offer multiple benefits, especially when storing and analyzing large amounts of data without the need for expensive hardware and infrastructure. While ML and AI techniques can help identify patterns and correlations in the data that humans may not easily detect.

Below are listed some key players in this field that offer either processing capabilities and/or machine-learning approaches.

Top 5 NGS solution providers
IlluminaA leading provider of NGS technology and services. They offer a range of software tools for NGS data processing, including BaseSpace Sequence Hub, Illumina Genome Analyzer, and Illumina DRAGEN Bio-IT Platform.
Thermo Fisher ScientificThe company offers a range of software tools for NGS data processing, including Ion Torrent Genexus System.
QiagenQiagen offers a range of software tools for NGS data processing, including QIAGEN CLC Genomics Workbench and Ingenuity Variant Analysis.
AgilentThe company offers software tools for NGS data processing, including GeneSpring and Agilent SureCall.
Bio-RadBio-Rad offers software tools for NGS data processing, including the Bio-Rad CFX Manager Software and the Bio-Rad ddSEQ Single Cell Isolation System.

 

So why do most prominent players have their proprietary solutions for NGS data processing? There are several reasons why companies such as Illumina, Qiagen, Bio-Rad, and others have their proprietary solutions:

  • First, NGS data processing software complements their NGS sequencing products and services.
  • Second, companies may want to protect their intellectual property and differentiate their products from competitors. 
  • Third, companies may have their specific expertise and capabilities in areas such as data analysis or machine learning. 

Therefore they may want to leverage these strengths in their software solutions, especiallyif there is a need for customization and flexibility in NGS data processing where off-the-shelf software tools may not meet the specific needs or requirements of a given research project.

However, one has to keep in mind the effort needed for custom software development, which includes not only software development and maintenance costs, but also a broad interdisciplinary know-how. Therefore, it is very important that you have clearly defined use cases (e.g., internal NGS data processing or project-related service) and subsequentially calculated ROI. This is also the reason why for example academia can not afford custom software solutions.

What about the know-how? When it comes to software designed explicitly for NGS data processing, it is crucial to have an interdisciplinary team with hands-on experience in software development that has, at the same time, a profound understanding of biology and bioinformatics to empower life scientists with the best possible solutions.

To sum up, what are the pros and cons of custom software development vs. off-the-shelf solutions?

SOLUTIONPROSCONS
Custom software development

– Differentiation on the market

– Tailored to specific needs (e.g. kits, sequencing machines, processing performance)

– High throughput

– All in one place

– IP

– Cloud or local based

– AI implementation

– Expensive to build with longer ROI

– Maintenance cost

– Interdisciplinary know-how

Off-the-shelf solutions

– Validated pipelines for the majority of NGS data

– A one-time or yearly license

– No maintenance costs

– RnD ready

– Faster ROI

– No differentiation in the market

– Not applicable for special cases

– Lower processing performance for non-standard data

 

But most importantly, having a proprietary platform for NGS data processing expands the product portfolio and allows companies to offer tailored, customized solutions to their customers and adapt their software to the changing market needs.

Therefore, if you decide to develop a custom NGS data processing software solution, choose a company with hands-on experience with the technology. Lastly, make sure to understand your requirements; it will be much easier for your software partner to provide cost and time estimates for the project.

BioSistemika scientists have vast domain knowledge and experience in NSG data processing.
If you need a custom NGS data processing software solution, speak with our experts.

If you’re interested to learn more about our scientists’ breaking project DATANA, continue reading here.