Nearly a year ago, bioengineering professor Patrick Hsu
asked:
“Is DNA all you need?”
Hsu, co-founder of the Arc Institute, posed that question while helping create Evo, an AI model trained on 2.7 million genomes of prokaryotes and phages. Chasing the grand hope of building AI models that start to understand and manipulate biology, Evo opened the possibility that training a model on tons of DNA — and just DNA — may be a promising path forward. After all, Evo
generated a novel CRISPR system from scratch
based on the genomes it was trained upon.
Hsu’s team is back to take that year-old question to the next level. Its newest model, called Evo 2, was trained on far more As, Gs, Cs, and Ts: roughly nine trillion base pairs of DNA from over 128,000 different species of bacteria, archaea, and eukaryotes. On a call with reporters, team members described Evo 2 as the largest publicly available AI biology model to date.
The team, which includes leaders from the Arc Institute — a nonprofit founded in 2021 to bankroll scientists — and the chipmaking giant Nvidia, open-sourced the model, and
on Wednesday posted a 65-page preprint
describing how the team built Evo 2 and some of the results. Evo 2 is a sweeping research collaboration with roughly 50 authors listed on the paper, most of whom are affiliated with the Arc Institute, Nvidia, UC Berkeley and UCSF.
The research consortium shows how many leaders building AI biology models come from outside the biopharma industry. The author list includes engineers affiliated with AI tech startups like Liquid AI and Goodfire. Greg Brockman, the co-founder and president of OpenAI, spent some of a sabbatical working on part of Evo 2’s design that allows it to consider an extra-long length of DNA code.
The term “foundation model” has become a ubiquitous buzzword in the AI world, but Evo 2 fits the narrower definition of being a model trained on tons of data that is general enough to handle many tasks.
The new preprint described some of those uses, like being able to predict if certain mutations to the BRCA1 gene could cause disease. The Evo 2 team reported the model appeared to understand some biology concepts on its own, like transcription factor binding sites, protein structure elements and intron-exon boundaries.
Like a true foundation model, it remains to be seen where Evo 2 could have the most impact. That contrasts with other top AI models that were designed to master one job, like AlphaFold2 being built to predict the structure of proteins. Instead, Evo 2’s builders envision researchers building task-specific applications on top of the model, and Evo 2 acting like an operating system.
“From predicting how single DNA mutations affect a protein’s function to designing genetic elements that behave differently in different cell types, as we continue to refine the model and researchers begin using it in creative ways, we expect to see beneficiary uses for Evo 2 we haven’t even imagined yet,” Arc Institute’s chief technology officer Dave Burke said in a statement.
Stanford assistant professor Brian Hie said he believes Evo 2, paired with other models, could open new ways of manipulating genomes.
“We kind of want to go beyond designing things at the molecular level because that’s all that we could control,” Hie said on the media call. “If we have a powerful model that lets us generate things at the scale of complete organisms, then this unlocks a lot of downstream tasks for which I don’t think a lot of people have even imagined what the potential use cases are.”