David Baker has been pushing the limits of designing proteins for decades.
The computational biologist started developing software, called Rosetta, to study and design proteins in the 1990s. He’s now the director of the University of Washington’s Institute for Protein Design.
The Baker lab has become a prolific producer of not just AI models and scientific articles, but of big-swinging startups. He has co-founded 21 companies, most notably
Xaira Therapeutics
, backed by $1 billion-plus to turn his lab’s research into drugs.
Baker
won the Nobel Prize in Chemistry
last week for his protein work. He sat down with
Endpoints News
to chat at our inaugural AI Day. This conversation has been substantially edited for clarity and length.
Andrew Dunn:
How’s it feel to add Nobel laureate to your list of titles?
David Baker:
It’s been a little crazy. I don’t think I’ve ever been so strung out in my life. I’m looking forward to things calming down a bit.
Dunn:
Why is designing proteins from scratch such a big deal?
Baker:
We know proteins can carry out an amazing array of functions, evolved over millions or billions of years to solve the problems. You think about photosynthesis or all the ion channels in our brains that mediate cognition. We work just amazingly well because we have amazing proteins.
There are new problems today. In medicine, we live longer, so there are new diseases. There’s always potential for new pandemic viruses. And outside of medicine, we’re heating up the planet and polluting it.
Some of these probably would be solved if there was evolutionary pressure and we had another 100 million years to wait. The promise of protein design is to be able to design new proteins that solve current problems as well as proteins in nature solve problems that were relevant during natural selection.
Dunn:
How’d you first get interested in protein design, particularly with the computational focus?
Baker:
I wasn’t the first. Bill DeGrado showed you could do
de novo
design and Steve Mayo showed you could use computers to redesign the sequence of proteins. Brian Kuhlman, when he came to my lab as a postdoc, had the idea of doing flexible backbone protein design.
When I arrived at UW, we had been doing experiments to understand how proteins fold and incorporating what we learned in the first version of Rosetta, which was really focused on structure prediction. Brian had the idea of combining that with sequence design, kind of like Steve Mayo had done, to do flexible backbone protein design.
In 2003, he and Gautam Dantas had developed
Top7
, which was a protein with a brand new structure that was different from anything in nature. That opened the floodgates. Top7 didn’t do anything, but we thought, “Now we can design new proteins to have all sorts of new functions.”
Dunn:
Where is the field in connecting protein sequences to functions?
Baker:
A lot of what we’re doing, back to the binder design problem, is we start from the structure of a target. With AlphaFold, one can generate structures pretty accurately for a lot of proteins.
Function is important, too, because we want to make things that are going to cure disease. Biology is still really hard. What people don’t understand is that the problem of making a binder to a target is close to getting solved. The real question is what should you make binders to, and there’s a lot of different answers to that.
We’re also designing conditional therapeutics that only should be active at the right time and place in the body. We have all this mechanism design — binding design, conditionality — but the real question is what is the right pair to bring together, and what is the right target to block or agonize? The biology is still a very important part of this that goes beyond design.
Dunn:
If we went back a decade, and I asked you about what you just said — designing binders to targets as a solved problem — that would be astonishing.
Baker:
This all happened in the last five years.
If you look at our recent papers, we’re just about to upload one on making really large numbers of synthetic cytokines by bringing together different pairs of receptors and novel combinations. We have something like 25 binders, 11 of which were designed for that paper. Now it’s about making whole collections of binders to attack a certain problem.
We’ve been working on these physically based methods for designing binders for quite some time. Longxing Cao and Brian Coventry made
this big breakthrough
, where we showed we could design binders to 13 different targets. That was using the old Rosetta, physically based models.
A couple years later, after we developed RFdiffusion, we found we could make binders better and faster than before. We kind of solved the problem in two different ways.
Dunn:
The acceleration of progress in the last few years is remarkable. What are the key drivers of that?
Baker:
As the technologies get better, they feed on each other. My lab’s kind of exploded, there’s so many smart people trying to develop stuff. Someone will make an advance and another person will build on it.
Another part is the application of deep learning to protein design. RFdiffusion and ProteinMPNN, these tools are now being used around the world. I get emails all the time saying we made a great binder with your software.
The other point that sometimes goes unsaid was there was this huge untapped resource in the Protein Data Bank. There were probably tens of billions of dollars put into it over 60 years. Now the AI methods are really kind of tapping what was in there.
Dunn:
When you think of lead optimization, getting something to look more like a drug, how optimistic or skeptical are you that an AI-first approach can tackle that?
Baker:
Let’s say you want to predict whether a compound was going to pass a clinical trial. If we had hundreds of thousands of trials, and we had the compounds, and knew exactly what happened in each of them, I imagine you could train a model that would be pretty effective. We obviously don’t have that data.
I think there are two paths forward: The first is to be clever and to identify proxies that will correlate with long-term success. And then you have those be structural proxies that you can optimize, like we’re going to aim for a certain amount of surface hydrophobicity or something. It’s still going to be a human guess that that property is going to correlate down that way.
The second is to generate the relevant datasets. No entity on Earth can carry out 100,000 clinical trials and collect the data. Big pharma has a lot of internal data on where did different compounds fail in the drug development pipeline. A really interesting thing now is training on that data.
The success of that is going to be determined by how extensive the datasets are. There are efforts within companies like Xaira to generate large internal datasets, and it’ll be interesting to see how those do in developing better drugs.
Dunn:
That’s maybe a holy grail for biopharma for AI. What are the Baker Lab’s holy grail, or holy grails, in the longer term?
Baker:
Being able to design molecular machines like chaperones or motors. This is really kind of futuristic, but I think it’s within reach. We can design catalysts, we can design binding, and now we’re trying to couple binding and catalysis. Designing site-specific proteases.
We’re working toward those now along with really precision medicine, making extremely conditional agonists that only act in very well-defined places. And novel agonists that generate new types of biological activities on specific cell types.
Dunn:
Why has your lab been able to stay so productive for such a long period of time?
Baker:
I have some very strong opinions on how to — well, I can just describe what my philosophy is in running a lab and institute.
It’s based on the idea of a communal brain. Sea slugs have one or two neurons and they can do really simple things. The human brain can do amazing stuff with all the neurons connected.
In my group, the emphasis is on frequent interaction, brainstorming and constant discussion. We have different types of free food every day of the week to try to get people together. If you think of each researcher as a neuron, just trying to maximize the connections.
The second is recruiting people. There are a very large number of people who would like to come to the lab and the institute. That’s the primary selection criteria: You come, you talk to everyone for two days, and everyone votes.
The other thing is I don’t ever go anywhere. I am in every day, walking around, talking to people. I’m just trying to make sure everyone is maximizing connections.
Dunn:
How selective are those votes?
Baker:
Yeah, it’s pretty selective. If you write to me for, say, a postdoc, if you look like a superstar then I ask you to talk to several people in the lab. If they think you’re a superstar, then you get invited.
Dunn:
Is it like getting into Harvard, like five percent?
Baker:
Oh, I don’t know. It’s probably harder. I haven’t really counted.
Dunn:
On startups, there’s a gold-rush mentality for everything AI among VCs. Is that helpful or hurtful to your research?
Baker:
It’s tremendously beneficial.
Dunn:
I imagine there could be tension between radical openness, of ‘Here’s everything I’m working on’ and swapping lab notebooks, versus if you’re starting a company, you might want to say, ‘This is my work and my territory.’
Baker:
That’s a really good point. In the lab, everything is totally open. But at some point if you’re going to develop a new product, you can’t really do that in academic environments.
It gets a bit more complicated for things like DeepMind, which has produced amazing software like AlphaFold. But they’re sort of a company too, so it gets confusing. They release software but don’t make it available. It just creates all this tension. Having a totally open thing that’s hooked onto people starting companies is a nice way to do it.
Dunn:
On open science, your lab has led the way in open-sourcing and publishing constantly. That contrasts with the tech world. OpenAI has largely closed their work. DeepMind, with AlphaFold 3, has teetered on what is and isn’t publicly available. Is there a risk AI bio will grow too closed?
Baker:
I think it’s working pretty well. To DeepMind’s credit, they didn’t release the AlphaFold 3 code, but they published the paper.
The ecosystem is very healthy. I remember the CASP experiment, where protein structure prediction is evaluated, and after DeepMind’s first results were presented there was a worry that big tech was going to dominate structural biology from then on. That’s not what’s happened. There’s no one entity that’s dominating, and I think that’s a good thing.
Protein design really works well in an open environment. The closed environment of a startup or a pharma is really good for taking a slightly more mature idea and developing it to the point where it can really save lives.
Dunn:
It’s been fascinating to see not just DeepMind but stints in protein research from Salesforce, Meta and others. I don’t know if that’s where Salesforce should be spending its time.
Baker:
With tech companies, the large language models are really the thing. If you’re a big tech company and you were investing money in protein folding or protein design, maybe you were getting a little bit behind on the language models.
So now, the companies are retrenching and saying, “Here’s what we need to focus on to survive and the protein stuff was kind of fun, but it was a little bit peripheral.”
Dunn:
On large language models, do you think they’re overrated in biology specifically?
Baker:
There’s probably a little bit too much hype by people who don’t quite know what they’re talking about.
Dunn:
Someone recently put it to me as the “bio-naïve” folks.
Baker:
That’s right, like ChatGPT for biology. That doesn’t really make sense. You ask ChatGPT to program a robot to move around and walk around and it’s hopeless.
Dunn:
To turn to data generation, what are the biggest data gaps?
Baker:
On the pharma side, a lot more data from later in the drug development pipeline. If we go up the food chain, large binding affinity datasets that are really accurate on many different compounds binding to many different proteins. Data on each step in the drug development pipeline, where did compounds fail?
Generating really good datasets is going to be critical for training ML models on them, and it’s going to take a lot of creativity to think about how to generate them.
Dunn:
What makes for a good grad or postgrad research project today in the Baker Lab?
Baker:
They all look a bit on the lunatic fringe. We’re trying to do things that haven’t been solved.
My basic formula is it should be a really important, unsolved project — but there’s a good chance that a really smart, creative, dedicated person could solve it in the next two or three years. I’m aggressively at the forefront.