In 2003, after 13 years of painstaking work and more than $300 million in funding, an international team of scientists celebrated a milestone: the first sequencing of the human genome. For the first time, researchers determined the order of about 90% of the chemical building blocks that make up human DNA.
Two decades later, sequencing a genome became routine. Costs have fallen from hundreds of millions to less than $1,000, and scientists have mapped millions of genomes from humans, animals, plants and microbes.
But the revolution in genomics has created a new challenge: too much data.
The National Institutes of Health’s Sequence Read Archive, one of the world’s largest public repositories of genomic data, contained more than 36 petabytes of raw sequencing data in 2020. That’s 36 million gigabytes, or nearly 480 years of continuous HD video playback.
“The amount of data in the Sequence Read Archive has actually blown up by an order of magnitude compared to what it was supposed to be originally,” says Prashant Pandey, assistant professor of computer science at Northeastern University.
Pandey, who was recently awarded an NIH grant, is working to solve this problem. His goal is to build scalable systems and techniques that allow scientists in wet labs and hospitals to search the Sequence Read Archive efficiently, accelerating discoveries in biology and medicine.
Most of the sequencing data is stored in a raw, fragmented form. Before researchers can see the full picture, or a reference genome, these fragments must be assembled.
“The assembly process is very expensive both algorithmically and to even execute it on machines because it takes a lot of time,” Pandey says.

Assembled genomes are stored in a public database that is easily searchable. However, there are no scalable computational techniques that would enable scientists to search through the bulk of the SRA. In other words, the finished genomes are easy to search, but the enormous raw DNA datasets are not.
“We have this treasure trove, this amazing and really insightful resource, which is just sitting around,” Pandey says. “We need the ability to search the raw sequencing data, all of it, at the petabyte scale.”
Building search capabilities like that, he says, requires innovation at every level.
“This requires innovating at all the levels of the stack, starting from new approximate indexing techniques, approximate data structures, building systems that can scale out in a distributed environment, hosting the whole thing in the cloud and making it publicly available for anyone to search,” Pandey says.
When DNA is sequenced, the output is not one long strand but millions of short sequences, called reads. Each experiment, which could be done on a patient sample, cell type, type of tissue or a new species, produces massive collections of these reads.
When scientists discover something new, like a virus or a bacterium, they often ask: Have we seen this before? Has it appeared in any of those past experiments?
The complication is that the discovery often comes as a much longer piece of genetic material, known as a transcript.
“We want a technique in which we can take this longest transcript and find whether it appears in any of the experiments [in the SRA] or not,” Pandey says.
His team’s solution is to build an index. They convert short reads into small sequences, called K-grams, and map them into a multidimensional map, or a so-called high-dimensional embedding, creating a digital fingerprint for each experiment. These fingerprints are then stored in an index.
When a query transcript arrives, its fingerprint is generated and compared against the index. This narrows the search from millions of experiments to just a few hundred likely matches.
“If we go and search every possible experiment to find whether the query exists or not, it will take a lot of time and it is not computationally feasible,” Pandey says. “So this index …quickly helps us to prune down our search space.”
Traditional tools can then be used to double-check the smaller set of experiments to confirm whether the sequence is present.
Pandey and his collaborators, Rob Patro at the University of Maryland, Michael Ferdman and Rob Johnson at Stony Brook University, previously developed a tool, called Mantis, to index sequences. Mantis works well for thousands of experiments, but the SRA holds millions, representing a real scaling-up challenge.
Another obstacle is accessibility: not every lab has the resources to download and run a massive index locally.
To solve this, the team decided to build the index in a distributed way, across many machines.
“The query can be transmitted to all of these distributed machines, and we can collect the results from every machine, aggregate them and give the results back to the user and to make it easily usable,” Pandey says.
They have also created a website where users can simply enter a sequence they want to query and get results.
“Like a google.com but for Sequence Read Archive,” he says.
Pandey emphasizes that usability is just as important as technical innovation. His team is collaborating with researchers at the Joint Genome Institute in California and the Utah Center for Genetic Discovery, which handle real-world datasets on human disease and plant genomes.
“The goal isn’t just to build a system and make it available,” Pandey says. “But also make sure that we work with these scientists in the field and actually help them enable their scientific discoveries.”
He’s excited to see that impact.