Key takeaways
- The AnVIL Data Explorer makes it easier for researchers to access and reuse high-value datasets, including those that focus on specific diseases like Alzheimer’s, cancer, and rare genetic disorders
- It currently organizes over 280 datasets from major NIH-supported consortia, with new data added quarterly
- By helping scientists combine information from multiple studies and build on existing resources, the Data Explorer will accelerate discovery, especially for rare and understudied conditions
Collecting high-quality genomic data is time-consuming, expensive, and often only possible through large-scale national efforts. Thankfully, a new tool developed by the UC Santa Cruz Genomics Institute’s Computational Genomics Lab is making existing datasets easier to find and use, ensuring that more researchers can build on these precious resources rather than starting from scratch.
The AnVIL Data Explorer, which is now live and ready to use on the AnVIL platform, gives researchers fast, easy access to hundreds of human genomic datasets and enables them to construct research-specific groupings to accelerate scientific progress and new discoveries for health conditions ranging from cancer to rare disease.
“This tool is about amplifying the impact of data and making the most of past public investments,” said Benedict Paten, professor of biomolecular engineering and director for computational genomics for the UC Santa Cruz Genomics Institute. “We’re making it easier for researchers to find the exact datasets they need, build cohorts, and start analyzing them right away. That means faster insights, more collaboration, and ultimately, better outcomes for human health.”
A human’s genetic sequence is billions of base pairs long, which means that studying even a single genome generates massive amounts of raw data. Multiply that by hundreds or even thousands of individuals in larger genome-wide studies, and the data volume quickly becomes enormous.
Traditional genomic research workflows have researchers download these huge datasets to local servers, which is inefficient and costly. To address this problem, the National Human Genome Research Institute (NHGRI) created the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) to allow researchers to access their data and analyze it centrally in the cloud.
The Data Explorer is a central component of AnVIL. It connects users with hundreds of datasets contributed by NHGRI-funded consortia and makes it easy to search, request access, and organize data for analysis—all within a secure, scalable environment.
“The goal is to help scientists spend less time collecting and searching for data and more time making discoveries that could improve human health,” Paten said. “It also enables researchers to combine data across studies, opening the door to insights that wouldn’t be possible with smaller datasets alone.”
The AnVIL Data Explorer is currently a gateway to over 280 datasets, including contributions from major consortia like 1000 Genomes Project, the Human Pangenome Reference Consortium, Telomere-to-Telomere, and the Center for Alzheimer’s and Related Dementias, among others. The number of available datasets is continually growing and supports studies in rare diseases, neurogenomics, aging, cancer, and beyond.
Although most datasets require access approval, users can browse core information for all of them before requesting access through a streamlined system tied to NIH credentials. Users can then analyze their data directly in Terra, which is AnVIL’s secure analysis platform built on Google Cloud.
The Explorer offers five views, organized by dataset, donor, biosample, activity, or file name, and includes a search function for quick navigation. In keeping with AnVIL’s commitment to open science and interoperability, data from the Explorer can also be accessed through other platforms on Google Cloud, such as the National Heart, Lung, and Blood Institute’s BioData Catalyst.
New users can create a free account by visiting anvilproject.org and clicking on “Launch Terra.” Once signed in, they can access the AnVIL Data Explorer directly to browse datasets, and follow simple prompts to request access through dbGaP or DUOS. The platform’s built-in help guides and documentation provide step-by-step guidance.
While AnVIL already supports a wide range of high-impact research, the platform is still in its early days. As with all collaborative science efforts, its potential for empowering discoveries will increase as more researchers jump on board. The Data Explorer team encourages users to submit feedback and suggestions via the “help” section of the platform.
“The platform is designed to grow and improve as more groups contribute data and provide feedback,” Paten said. “We’re eager to see the research community engage with it and help shape what comes next.”