Google’s VaultGemma AI Hoovers Up Your Data—Without Memorizing It

Training AI models on your data can provide powerful new insights, but it can also potentially result in them leaking sensitive information. Now Google has released a new model designed from the bottom up to prevent these kinds of privacy breaches.

Large language models are a promising way to extract valuable information from the piles of unstructured data most companies are sitting on. But much of this data is full of highly sensitive details about customers, intellectual property, and company finances.

That’s a problem because language models tend to memorize some of the data they’re trained on and can occasionally spit it back out verbatim. That can make it very hard to ensure these models don’t reveal private data to the wrong people in the wrong context.

One potential workaround is an approach called differential privacy, which allows you to extract insights from data without revealing the specifics of the underlying information. However, it makes training AI models significantly less effective, requiring more data and computing resources to achieve a given level of accuracy.

Now though, Google researchers have mapped the trade-offs between privacy guarantees, compute budgets, and data requirements to come up with a recipe for efficiently building privacy-preserving AI models. And they’ve used this playbook to create a 1-billion-parameter model called VaultGemma that performs on par with older models of similar sizes, showing privacy can be protected without entirely sacrificing capability.

“VaultGemma represents a significant step forward in the journey toward building AI that is both powerful and private by design,” the researchers write in a blog post.

Differential privacy involves injecting a small amount of noise, or random data, during the AI training process. This doesn’t change the overarching patterns and insights the model learns, but it obfuscates the contributions of particular data points. This makes it harder for the model to memorize specific details from the dataset that could later be regurgitated.

However, the amount of privacy this technique provides, known as the privacy budget, is directly proportional to the amount of noise added in the training process. And the more noise you add, the less effective the training process and the more data and compute you have to use. These three factors interact in complicated ways that make it tricky to figure out the most efficient way to build a model with specific privacy guarantees and performance.

So the Google team carried out a series of experiments with the company’s open-source Gemma family of models, varying these key parameters to discover how they interact. From this, they outlined a series of scaling laws, detailed in a pre-print on arXiv, that allowed them to predict how altering compute, data, and privacy budgets affects a model’s final performance.

One of their main insights was that ramping up compute during training doesn’t boost model accuracy unless the model is fed more data or privacy guarantees are loosened. They also found the optimal model size is roughly an order of magnitude smaller than models without differential privacy, suggesting it may be difficult to extend the approach to today’s largest models.

However, the scaling laws also predict the most compute-efficient training configuration for a particular dataset size and privacy budget. This allowed them to reduce computing requirements by between 5 and 100 times compared to alternate configurations, while achieving similar accuracy.

The team used these insights to create VaultGemma, which performed comparably to the similarly sized GPT-2 model that OpenAI released in 2019. Given the pace of advances in AI, matching the performance of a model from six years ago is not an especially high bar, but the researchers say the scaling laws they’ve identified should help close that gap.

And in a technical report accompanying the model release, the team provide strong evidence their approach prevents the model from memorizing training data. They took one million training data samples, each 100 tokens long, and fed the first 50 tokens to the model to see if it would complete the sample. While all three generations of Gemma models were guilty of regurgitating some amount of data, they found no evidence VaultGemma had memorized any of the samples.

While VaultGemma remains an experimental model with no real practical value, it demonstrates that relatively sophisticated, privacy-preserving AI models are within reach. Hopefully, others can build on these scaling laws to push the field further in this direction.

Continue Reading