GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

This paper introduces a new system called GenPIP that speeds up the process of analyzing genomes by combining two important steps\u2014basecalling and read mapping\u2014into one efficient process. This helps save time and energy while maintaining accuracy.

Analyze with PDFdigest

This video presentation explains the key concepts from the paper in plain language.

Content & Liability Disclaimer

This article and its accompanying video are automated summaries derived from the original research paper by Unknown authors. The original research was conducted solely by the paper's authors; PDFdigest did not conduct any of the research and makes no claims of ownership over the underlying scientific work.

The video narration is generated by artificial intelligence and references the paper's authors for attribution. The video is not narrated by any of the paper's authors. This content may contain inaccuracies, omissions, or misinterpretations of the original research. First-person language (e.g., "we found", "our results") reflects the original authors' voice, not PDFdigest's. Always read the original paper for accurate, verified information before making any decisions based on this content.

This content is provided "as is" without any warranties, express or implied. Simulated systems OÜ, its officers, directors, employees, and agents shall not be liable for any direct, indirect, incidental, special, consequential, or punitive damages arising from your use of, reliance on, or access to this content, including but not limited to errors, omissions, or misinterpretations of the original research. This disclaimer applies to the fullest extent permitted by applicable law.

Key Takeaways
  1. 1 Indexing is a preprocessing step that enables efficient queries to find matches between reference genome subsequences and reads.
  2. 2 We are motivated to reject useless reads as soon as possible to reduce computation and memory overheads.
  3. 3 We aim to quantitatively demonstrate the potential benefits of overcoming the limitations of prior works.
  4. 4 We demonstrate System C to show the potential benefit of eliminating data movement between separate accelerators and CPUs.

Introduction

Long read genome sequencing technologies have advanced genomic fields like personalized medicine, forensic science, evolutionary biology, and infectious disease investigation. Oxford Nanopore Technology (ONT) is a widely-used long-read sequencing technology.

ONT provides portable sequencing devices connected to a computer via a USB interface.

ONT devices generate long reads based on the organism’s DNA sequence.

Important Note

None of the pull down circuits are active because the left circuit cannot drain current due to high resistance and the right circuit cannot due to the off transistor.

Important Note

The key idea of the chunk-mapping-based early rejection technique is that a read probably cannot be mapped to the reference genome if enough consecutive chunks in this read cannot be mapped to the reference genome (i.e., the chaining score of.

Methodology

Reads are sent to a separate device for further analysis after basecalling. Basecalling and read mapping are the most time-consuming steps in the genome analysis pipeline due to computationally-intensive algorithms.

Study Design

Our goal is to provide effective in-memory acceleration of the genome analysis pipeline while minimizing data movement and useless computation.

We propose GenPIP, a fast and energy-efficient in-memory acceleration system for the genome analysis pipeline.

Results & Findings

A translated read is associated with a quality score for each base to reflect translation accuracy. Basecalling commonly uses a deep neural network to ensure high accuracy.

  • A translated read is associated with a quality score for each base to reflect translation accuracy.
  • Basecalling commonly uses a deep neural network to ensure high accuracy.
  • Read mapping depends on dynamic programming-based algorithms to find matching locations in the reference genome.
  • Large execution time and energy consumption overheads ensue because a significant portion of basecalling output is unused.
  • We compare GenPIP with state-of-the-art software tools and a combination of state-of-the-art in-memory accelerators.
Important Note

Indexing is a preprocessing step that enables efficient queries to find matches between reference genome subsequences and reads.

Important Note

We are motivated to reject useless reads as soon as possible to reduce computation and memory overheads.

How PDFdigest Helps You Understand Research

Instant Paper Analysis

Get structured summaries and key findings from dense PDFs in seconds.

Visual Explanations

Turn complex methods, figures, and results into clearer visual breakdowns.

AI-Powered Q&A

Ask focused questions and get answers grounded in the paper.

Try PDFdigest Free

Practical Applications

Unfortunately, mapping short chunks provides too large a list of possible mapping locations. The seeding component sends a list of the possible match locations to the read mapping controller ( 4 ).

Eighth, the read mapping controller sends the chunk and its possible match locations to the DP units to perform chaining ( 5 ).

If the query string matches one reference string in the ReRAM-based CAM, the ReRAM-based CAM outputs the address (Addr.) to access the corresponding values (i.e., the possible match locations) stored inside the ReRAM-based RAM.

Background and Motivation

This section analyzes the current genome analysis pipeline, identifies performance and energy bottlenecks, and sets the goals for the proposed GenPIP system.

Nanopore Genome Analysis Pipeline

Describes the ONT genome analysis pipeline, detailing the processes of basecalling and read mapping, and the importance of read quality control.

State-of-the-art Solutions

Reviews existing hardware accelerators for basecalling and read mapping, highlighting the advantages of non-volatile memory (NVM)-based processing in memory (PIM) accelerators.

Figures Explained

Figure1: The genome sequencing and analysis pipeline. The basecalling step (\u278a) and the read mapping step (\u278c) are the two most time-consuming steps in the genome analysis pipeline. The read quality control step (\u278b) is a highly-recommended but optional step to reduce the workload of read mapping by eliminating unnecessary computation. Dataset sizes and processing times are from [85] .
Figure 2: The basic structure of an NVM-based PIM array designed for computing an MVM operation.
Figure 3: An example NVM-based CAM array for string matching.
Figure 4: Performance comparison between four different systems.
Figure 5: Conventional pipeline (a) vs. the chunk-based pipeline (CP) of GenPIP (b).
Figure 5(b) shows our CP design. As the figure shows, chunk-based basecalling, read quality control, and a part of read mapping (seeding and chaining) are pipelined. The chunk-based execution flow not only saves time via pipelined execution (by overlapping the execution of several steps), but also reduces the need for storing intermediate data as each pipeline step can quickly consume the small amount of output that is produced by the previous step.
Figure 6: Overview of the early rejection (ER) technique in the genome analysis pipeline (the green boxes \u278b\u278e are the two earlyrejection steps we introduce).
(a)) and a high-quality read (Figure 7(b)).
Figure 7: The quality scores of the chunks in two representative reads: (a) a low-quality read and (b) a high-quality read.
Quality-Score-based Rejection (QSR) Input: the original read: read original ; length of the original read: N ; chunk size: C number of chunks needed for QSR: N qs ; quality score threshold: \u03b8 qs ; Output: rejection 1 for i=0;i< N qs ;i++ do 2 sum_sample_score += quality score of the chunk located at \u230ai\/(N qs -1)\u230b \u00d7 \u230aN \/C\u230b in read original \/\/sum the quality scores of evenly-sampled chunks in the read 3 end 4 average_score = sum_sample_score\/N qs ; 5 if average_score < \u03b8 qs then 6 return rejection = TRUE; 7 else rejection = FALSE; 9 end
Figure 8: Architecture overview of GenPIP. a The basecalling module. b The read mapping module. c The GenPIP controller.
Figure 9: Microarchitecture of the in-memory seeding accelerator.
Figure 10: Speedups of various systems normalized to CPU (300, 400, and 500 in the x-axis represent the three chunk sizes used in the evaluation).
Figure 11: Energy reduction of various systems normalized to CPU (300, 400, and 500 in the x-axis represent the three chunk sizes used in the evaluation).
Figure 12: Effect of the number of sampled chunks on ER-QSR's (a) rejection ratio and (b) false negative ratio.
Figure 13: Effect of the number of sampled chunks on ER-CMR's (a) rejection ratio and (b) false negative ratio.
PDFDIGEST AI

Struggling to understand complex research papers?

Upload any PDF and get instant AI-powered explanations, summaries, and visual breakdowns. Turn dense academic writing into clear, actionable insights.

Upload a Paper

Frequently Asked Questions

Each read has a length ranging from hundreds to millions of base pairs but has a high sequencing error rate. To store 1 or 0 in a CAM cell, the resistors are programmed to high and low resistance respectively.

Reads are sent to a separate device for further analysis after basecalling. Basecalling and read mapping are the most time-consuming steps in the genome analysis pipeline due to computationally-intensive algorithms.

Indexing is a preprocessing step that enables efficient queries to find matches between reference genome subsequences and reads. We are motivated to reject useless reads as soon as possible to reduce computation and memory overheads.

We conclude that early rejection based on the quality score of chunks should sample a small number of non-consecutive chunks to accurately guess whether or not a read is low-quality. We conclude that GenPIP is very effective at reducing energy compared to.

None of the pull down circuits are active because the left circuit cannot drain current due to high resistance and the right circuit cannot due to the off transistor. The key idea of the chunk-mapping-based early rejection technique is that a read.

This paper introduces a new system called GenPIP that speeds up the process of analyzing genomes by combining two important steps\u2014basecalling and read mapping\u2014into one efficient process. This helps save time and energy while maintaining accuracy.

Related Research

Research

Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

Vision-language models (VLMs) combining reinforcement learning (RL) ignite remarkable progress in multimodal reasoning, yet still struggle with medical images, which typically exhibit…

10 min read
Research

Helicobacter Pylori Infection and the Latest Treatment Guidelines

Helicobacter Pylori infection is prevalent worldwide, particularly in developing regions. It can lead to various health issues, including gastritis, peptic ulcer disease,…

10 min read
Research

Typeset using L A T E X twocolumn style in AASTeX631

This work proposes a novel approach to Martian climate modeling using machine learning techniques, specifically a deep neural network to model relative…

10 min read