Researchers at the University of Pennsylvania and the Allen Institute for Artificial Intelligence have developed a groundbreaking tool that allows open-source AI systems to match or surpass the visual understanding capabilities of proprietary models like GPT-4V and Gemini 1.5 Flash, potentially reshaping the competitive landscape between open and closed AI development.
The tool, called CoSyn (Code-Guided Synthesis), addresses a critical bottleneck in AI development: the scarcity of high-quality training data for teaching machines to understand complex visual information like scientific charts, medical diagrams, and financial documents. Rather than scraping millions of images from the internet — a practice fraught with copyright and ethical concerns — CoSyn leverages the coding abilities of existing language models to generate synthetic training data.
“We have, we lack of such data to train the model. We lack of data, like documents, charts with rich annotations to train a vision language model to do question answering over those images,” explained Yue Yang, a recent Penn Engineering Ph.D. graduate and co-first author of the research, during an exclusive interview with VentureBeat. “Those images actually are more challenging to annotate, compared to natural photos, like a picture of a dog of a cat of a house.”
The breakthrough comes as enterprises increasingly seek AI systems capable of understanding and reasoning about complex visual information — capabilities essential for everything from automated document processing to AI agents that can navigate digital interfaces independently. The work was conducted during Yang’s internship with the PRIOR team at the Allen Institute for AI and supported by the Office of the Director of National Intelligence, Intelligence Advanced Research Projects Activity, and the Defense Advanced Research Projects Agency.
How synthetic data generation solves AI’s biggest training challenge
The challenge of training AI to understand text-rich images has long plagued the field. Unlike natural photographs, scientific figures, charts, and documents require extensive annotation work that is both time-consuming and expensive. Traditional approaches have relied on harvesting images and their alt-text descriptions from the internet, but this method produces training data that is often superficial and legally problematic.
CoSyn takes a fundamentally different approach by recognizing that most text-rich images are originally created through code — Python scripts generate charts, LaTeX renders mathematical equations, HTML creates web interfaces. The research team’s insight was to reverse this process: use language models’ proven coding abilities to generate the underlying code, then execute that code to create realistic synthetic images.
“One intuition is actually those images like charts documents. We render them from programs from code, like we use Python to generate charts. We use, like latex or word to write our documents,” Yang said. “So how about we go through the reverse way, like we generated the code because the text only language model has been proved very good at writing code.”
Chris Callison-Burch, a computer science professor at Penn who co-advised the research, described the approach in simpler terms: “This is like taking a student who’s great at writing and asking them to teach someone how to draw, just by describing what the drawing should look like. We’re essentially transferring the strengths of open-source AI from text to vision.”
CoSyn-trained models outperform GPT-4V and Gemini on key benchmarks
The results are striking. Using their synthetic dataset of 400,000 images and 2.7 million instruction pairs, models trained with CoSyn achieved state-of-the-art performance among open-source systems and surpassed proprietary models on seven benchmark tests measuring text-rich image understanding.
On average, their 7-billion parameter model scored 80.9% across the benchmark suite, outperforming the previous best open-source model (Llama 3.2 11B) by 3.9 percentage points. More remarkably, even their “zero-shot” model—trained without any examples from the evaluation datasets—outperformed most open and closed models, demonstrating the transferability of capabilities learned from synthetic data.
In one particularly compelling demonstration, the researchers created a new benchmark called NutritionQA, consisting of 100 questions about nutrition label photographs. Using just 7,000 synthetically generated nutrition labels for training, their model outperformed others trained on millions of real images. “Despite being trained on millions of images, we observe that open-source VLMs are not data-efficient and perform poorly on this novel task compared to GPT-4V,” the researchers wrote in their paper.
Yang emphasized the significance: “Those big packs, they have so many resources to collecting data to run a lot of experiments, and I but I think open source models, we can give access to people, the model weights, the data we trained, or even the code, the training script, everything people can developers can build upon.”
Real companies are already using vision AI for quality control and automation
The technology is already finding real-world applications across industries. Callison-Burch cited an example from one of his teaching assistants whose company uses vision-language models for cable installation quality assurance: “They have the workers on site who are doing the installation take photographs of the processes they’re doing it, and they use that to automatically validate that each step has been followed properly.”
This type of specialized visual understanding could transform numerous enterprise workflows, from automated document processing in financial services to quality control in manufacturing. The ability to train models on specific visual tasks using synthetic data means companies can develop AI systems tailored to their particular needs without the massive data collection efforts traditionally required.
For enterprise decision makers, the research suggests a shift in how to approach AI data strategies. “I think synthetic data is a very promising way to remove the effort for human annotation. It costs less money, and it will just automatically generate large scale data, and also can avoid some copyright issues,” Yang noted.
The persona-driven approach that makes AI training data more diverse
One of CoSyn’s key innovations is its approach to ensuring data diversity. To prevent the repetitive outputs common in AI-generated content, the system employs what the researchers call a “persona-driven mechanism.” Each time CoSyn generates a synthetic example, it pairs the request with a randomly sampled persona—a short description like “a sci-fi novelist constantly bouncing off ideas for new alien worlds” or “a chemistry teacher preparing lab materials.”
“Every time we generate one syntax data, we will appear with a randomly sampled persona,” Yang explained. “This will diversify the content and styles of the examples we generated, because, like, if I provide the persona of like a PhD student, it will generate something more scientific or more about, something about academia.”
This approach enables the system to generate content across nine different categories: charts, documents, math problems, tables, diagrams, vector graphics, music sheets, electrical circuits, and chemical structures. The researchers used 11 different rendering tools, from Python’s Matplotlib for charts to LaTeX for mathematical expressions, supported by 20 specialized generation pipelines.
Why this breakthrough could level the playing field between open source and Big Tech
The implications for the broader AI industry are significant. Major technology companies like OpenAI and Google have invested billions in developing their proprietary vision-language capabilities, creating systems whose training methods and data sources remain trade secrets. CoSyn offers a path for open-source alternatives to compete without requiring similar resource investments.
“Open source models still like, like behind those closed source models, but with all the efforts, all the resources from the open source community, everyone, like, we’ve had more efforts. We have more like energy, like from, from everyone. So I think finally we can catch up,” Yang said.
The commitment to openness extends beyond just releasing the model. The complete CoSyn codebase, the 400,000-image dataset, and all training scripts are publicly available, enabling researchers and companies worldwide to build upon the work. “From the academia side, like a lot of research is built upon openness, like we need all access to the data, code, everything to discover new findings to support our claims in the papers,” Yang emphasized.
This transparency addresses growing concerns about the black-box nature of proprietary AI systems. “If you only rely on the APIs for like open AI, this may not be reliable to prove your like scientific discoveries, because they may just. Something in the back end you never know,” Yang noted.
Beyond static image understanding, CoSyn is pioneering capabilities crucial for the next generation of AI agents—systems that can autonomously navigate digital interfaces and perform complex tasks. The researchers developed synthetic “pointing data” that teaches models exactly where to click on screenshots, a fundamental requirement for web-based automation.
Using 65,000 synthetic screenshots with click annotations, their model achieved state-of-the-art performance on ScreenSpot, a benchmark for click prediction, outperforming systems trained on 1.3 million real screenshots. “We only use like several 100k synthetic screenshot, we can outperform previous model on millions of screenshots,” Yang said.
This capability is essential as the industry moves toward AI agents that can perform knowledge work autonomously. “There’s sort of like two prevailing models and how you might go about implementing agents,” Callison-Burch explained. One approach uses specialized APIs, while the other relies on agents that “literally just use web browsing capabilities in the same way that you and I do.”
The vision-based approach, enabled by technologies like CoSyn, could prove more versatile: “You’re not just calling up software function, which is relatively straightforward, but you actually have to, like, take screenshots of the current state of the web browser. Reason about where to click, navigate your mouse to that location to click.”
How synthetic data sidesteps the growing copyright crisis in AI training
The synthetic data approach also provides a potential solution to mounting legal challenges around AI training data. With ongoing litigation over whether training on copyrighted materials constitutes fair use, synthetic data generation offers an alternative path that sidesteps many intellectual property concerns.
Callison-Burch, who testified before Congress on AI and copyright in 2023, sees synthetic data as complementary to, rather than replacing, real-world training data: “I don’t think that synthetic data eliminates the need for having wide amounts of diverse training data like that’s still a core element to training AI systems, but it does allow you to extend their capabilities in really remarkable ways.”
The approach demonstrates how existing knowledge can be transferred to new applications without directly using copyrighted materials. “The underlying thing that we’re relying on here is a large language model. Can write code that’s something that it learned from its original data. We’re now applying that to a totally different application, which is creation of new training data that is unlike any of the data that it was trained on.”
The current limits of synthetic data and what comes next
Despite its promise, synthetic data generation faces important limitations. “One limitation is it may inherit the biases from the model that generates such synthetic data,” Yang acknowledged. The system can also struggle with diversity: “If you prompt a large network to generate some data among different runs, it may generate similar data.”
The current research focuses on text-rich images rather than natural photographs, limiting its immediate applicability to some domains. “What about some real photos like some other like natural images? It is hard to generate synthetic data for those two males, or even like medical images, chest X rays,” Yang noted, though she indicated ongoing efforts to extend the approach to medical imaging.
Looking ahead, Yang expects synthetic data generation to become standard practice: “In the future, in two or three years, and even for nothing, editor has been a very important component to teach model different capabilities.” However, she emphasized that optimal results will likely require combining synthetic and real-world data: “Real world data will reflect some real world distributions. Single data can be large scale. Can be more controllable.”
Early adoption signals suggest the technology is already influencing industry practices. “I heard like companies, like meta, some teams also, like all Amazon, they are trying to using our data to train their model,” Yang revealed during the interview.
For startups and smaller companies, the cost advantages could be particularly significant. “For some startups, it is cheaper to host, their host open model on their server, rather than just calling the APIs, which is less controllable,” Yang noted.
The research team’s decision to make everything open source reflects a broader philosophy about AI development. As Yang prepares to join the Allen Institute full-time after completing her Ph.D., the commitment to open science remains central to their mission. “Currently, those vision language models are quite brittle. It just needs the right data to get the right capabilities,” she said. “If you find the right data, you can improve models capability on it, and it will benefit the society.”
The vision for AI that acts, not just describes
As the research moves from academic laboratories to real-world applications, the implications extend far beyond improved benchmark scores. Yang and her colleagues are already looking toward applications that could transform how people with disabilities interact with technology, from AI that understands sign language for the hearing impaired to systems that can describe complex medical images for those with visual impairments.
“I have an idea to let the model to know how to understand the sign language or those people with hearing difficulties,” Yang said, describing potential future applications. “If you find the right data, you can improve models capability on it, and it will benefit the society.”
Callison-Burch sees even broader possibilities, particularly in robotics and scientific discovery: “Synthetic data opens up many possible applications that we don’t have naturally occurring data for. So one that Yang has also worked on at the Allen Institute is that. Ocean of creating simulated training data for robots.”
The work represents more than just a technical achievement—it’s a demonstration that open-source AI development can compete with the well-funded efforts of major technology companies through innovative approaches to fundamental challenges. As Yang noted in reflecting on her decision to join the Allen Institute rather than accept higher-paying offers from companies like Meta: “I think it’s still a very early stage of those multimodal models, and there are not much resources, open resources, or knowledge to share to the community.”
The message is clear: in the race to build AI that can truly see and understand the world, the advantage may not always go to those with the deepest pockets, but to those with the most creative solutions.