From 10,000 Structures to 1.8 Billion Interactions: Breaking the Data Bottleneck to Engineer Efficacious Therapeutics - Part 2

Data & Analytics | Smart Biotech Scientist Podcast

January 15, 2026

For years, protein engineering has struggled with a fundamental bottleneck: machine learning models, powerful as they are, have been starved of meaningful data. Most rely on just 10,000 antibody-antigen structures from the Protein Data Bank—a drop in the ocean compared to the billions of parameters powering modern large-scale AI models. The result? Promising molecules often get stuck in preclinical limbo, not because the models aren’t smart, but because they keep hallucinating with incomplete data.

In this episode from the Smart Biotech Scientist Podcast, David Brühlmann meets Troy Lionberger, Chief Business Officer at A-Alpha Bio, whose team has quietly shattered the data ceiling by measuring and curating more than 1.8 billion protein interactions..

Key Topics Discussed

Protein–protein interactions are essential to biology but have been historically hard to characterize, limiting traditional antibody discovery.
De novo antibody design complements conventional methods to overcome these constraints.
A-Alpha Bio’s high-throughput platform generates billion-scale protein interaction datasets, far exceeding public resources like the PDB.
Massive, diverse datasets enable machine learning models to avoid hallucinations and predict antibody behavior more reliably.
Energetic and validated positive/negative data are crucial for teaching models the underlying physics of binding.
A-Alpha Bio’s technology systematically improves antibody affinity, diversity, and developability, replacing artisanal approaches.
Synthetic epitopes expand training data, allowing ML models to learn from both natural and predicted interactions.
Data quality and scale, combined with AI, can accelerate biologic development when adopted strategically and with diligence.

Episode Highlights

Limitations of public data sources like the Protein Data Bank and their impact on current protein engineering approaches [03:11]
Why combining energetic (delta G) and structural data matters for building predictive protein engineering models [05:43]
A-Alpha Bio’s approach to generating 1.8 billion protein interaction measurements for machine learning—what this enables today and what’s possible next [06:30]
Examples of how A-Alpha Bio’s platform solves challenging therapeutic problems, such as optimizing molecules for 800+ HIV variants and engineering dual-specific antibodies [07:36]
The ongoing debate: What capabilities should biotech companies keep in-house, and what works best outsourced to service providers? [09:59]
The potential of synthetic epitopes as vital tools for training models beyond the Protein Data Bank—introducing the Synthetic Epitope Atlas [12:09]
Key takeaways for scientists: the importance of diligence amidst rapidly evolving AI claims, and advice for accelerating R&D with the right data [14:57]

In Their Words

Data is always going to be, I think, the constraint in industry—whether it's training your models or validating them. I think there's an increased appreciation that we need differentiated sources of data. A-Alpha Bio is hopefully going to play some part in that ecosystem, but we need more than just a handful of companies.

And I think the second thing I hope your audience walks away with is that, with the right data, AI can be an incredibly powerful tool in accelerating your work. That said, everyone would be forgiven for being confused by the claims of many, and so I would just encourage diligence as we watch this space unfold.

Episode Transcript: From 10,000 Structures to 1.8 Billion Interactions: Breaking the Data Bottleneck to Engineer Efficacious Therapeutics - Part 2

David Brühlmann [00:00:40]:
Welcome back to Part Two with Troy Lionberger from A-Alpha Bio. In the first half, we explored why protein–protein interactions are so fundamental yet historically difficult to characterize, the limitations of traditional antibody discovery methods, and where de novo design fits into the picture. Now we dive into the solution: how A-Alpha Bio’s platform generates massive datasets at unprecedented scale, what happens when you feed 1.7 billion measurements into machine learning models, and why this data revolution is transforming therapeutic development. Let’s jump back in.

Now I would like to move on to machine learning and data in general, because AI and machine learning are powerful, but they depend on good data. I always hear my former boss—he was quite skeptical about technologies—telling us, garbage in, garbage out. And that’s very true for modern technology. For AI, you need good data, and currently there’s a limitation because engineering models today are constrained by public-domain data. So can you paint us a picture of what that looks like and what the impacts are today on drug discovery?

Troy Lionberger [00:03:11]:
I think when we look at modern protein engineering models, I mean, many of them have been trained, as you said, on public data. In that case, it's typically the PDB, so the Protein Data Bank that is being referenced in many of these cases. And just to put a perspective on that, there are a little more than 10,000 antibody structures of antibodies bound to their target antigen in the PDB today. So when you compare 10,000—the number 10,000—to how much data has been generated to train ChatGPT, for example, or other large language models, I mean, there's a 10-order-of-magnitude difference between the amount of data that has gone into generalizable machine learning models in large language models compared to what data is available for brilliant protein engineering teams today to develop their next model.

The constraint is really felt when the biggest failure mode in many of these models is what they call hallucinating. So if you have one example of a solid structure that you're training your model on, it's actually quite difficult to then ask the model, if I make a mutation in one of these residues, will my molecule still bind? And if so, how tightly? Oftentimes these models are hallucinating because they've seen the structure before, there's a minor modification to it, so the tendency—the bias, if you will—is to revert to the solution that they've seen before.

So what's missing is an ability to generate data that can give these models many, many, many examples of both positive and negative data. So being able to definitively tell this model, no, that aspartic acid was changed to an alanine, I know from my experiments it did not bind anymore. Now correct yourself so that you stop hallucinating, right? I mean, that's basically what we're trying to do with AlphaSeq—is generate large amounts of data that introduce minor variations to these molecules and directly tells these models not only do they still bind, but what is the strength of that binding interaction.

And one thing for your audience to, I think, appreciate is when we talk about making generalizable models, I think the grand ambition in the industry is that we can one day teach a machine learning model physics. If you want to teach a machine learning model physics, the missing ingredient is not just the structures, but physical parameters like Gibbs free energy (ΔG). And it just so happens that binding affinity is directly related to ΔG. So what the AlphaSeq platform is giving us is not just structural information, but also energetic information that is really going to be, I think, transformative in how these models ultimately become predictive. And the predictive nature is going to increase the better they get at understanding the underlying physics.

David Brühlmann [00:06:07]:
Wow, that's powerful. That's powerful. Even physics, yeah. What can you do today with the technology, and what will be possible as you build your models? If I'm not mistaken, you have measured 1.7 billion protein interactions, or have—I’m not sure if this is by your company or in general. Where is this going to lead?

Troy Lionberger [00:06:30]:
I think there are multiple layers to that right now. I think the latest count is 1.8 billion, but what's 100 million between friends, I guess. So that data has been used—and we generated that data—in large part to train our internal models and also to do partnerships with therapeutic developers as well.

To answer the first part of your question, what are we doing today? Today we're offering access to our platform under fee-for-service arrangements. Someone comes to us with an antibody that they want to diversify or increase the affinity of or make more developable. We have a very standard process now that three months later these customers are getting what they want, and it's very easy for us to do this work, so we're happy to do it.

Where I think things get more challenging are applications where you require multiple iterations of the AlphaSeq platform. So what I just described as a kind of fee-for-service—someone comes with a molecule, we’ll improve it for them—that requires one iteration, if you will, of our experimental platform.

For more challenging examples, I'll give you two. One would be work we did with Gilead, where we were able to essentially optimize biologics specific for HIV. So to make a neutralizing biologic for HIV, the challenge that was put in front of us is: can you make a molecule that can hit not just one or two versions of HIV variants that are present in the population, but over 800 variants? If you had asked me before I joined this company three months ago if it would have ever been possible for someone to optimize a molecule against more than 800 different variants of a target, I would tell you there's just no way. You might get one or two, but not 800. And the fact that they were able to simultaneously improve the efficacy of this molecule for virtually every variant of HIV available is really exciting.

And I think that, for clarity, took three turns of the crank. So that's how powerful I think things get when you iterate a few times on this.

Another example that I'll give you that I didn't think possible before three months ago is reliably and reproducibly making biparatopic molecules. So what I mean by that is in most cases, the parts of antibodies called the variable domains that ultimately interact with your target antigens typically will bind to a single antigen—so one variable domain, one target antigen.

What are called dual-specific antibodies are those that actually have the ability to have that same domain bind to two different antigens. I've seen a few examples of this in the past where it was serendipitously discovered. And I think a lot of therapeutics historically has been kind of serendipity—it's bespoke and artisanal. We're getting to the point where now these things are tractable and systematic.

And this is one example where we can now almost arbitrarily introduce two different paratopes into this antibody and have them interact with two different antigens. And this opens up all kinds of unique modes of action and safety profiles for therapeutics that really haven't been possible before. But this is where this ability of AlphaSeq and AlphaBind to reproducibly engineer these molecules is, I think, quite exciting.

David Brühlmann [00:09:54]:
Yeah, definitely—that's quite exciting. And it's powerful to have all these tools at our disposal. And now this leads me to the next question. Because if you're developing your company, especially if it's a smaller company or a mid-sized company, you always face the question: Well, we do have very knowledgeable and successful CROs. We have service providers. We have all kinds of technologies. So what do we do in-house? What is the capability we're building in-house? What is the knowledge or the understanding we want to generate in-house versus sourcing out? So as you are working with different companies, what is your observation or your recommendation?

Troy Lionberger [00:10:33]:
David, I think you're right that many companies want to keep their core intellectual property and contributions to these molecules in-house. I absolutely understand the tendency to want to do that.

I think my response is: there's a reason that everyone decided to move to Microsoft Word instead of typewriters. I mean, at some point we realized that there are more efficient ways to move forward. And I think it just comes down to dollars and cents.

The reality is that these therapeutic developers, they don't want to be spending their time engineering molecules—if we're honest with ourselves. I mean, what they want to be spending their time doing is de-risking molecules, not making them. And so the quicker that they can get to de-risked molecules, the better for everybody.

And that is really where I would encourage those in the field to ask the question: Could you accomplish more, faster, and cheaper than service providers? If the answer is no, then everyone should use a service provider. That is really how we're operating now in many of our applications.

We're simply trying to accelerate these teams by making their work easier, faster, and cheaper—and more importantly, de-risking that entire process, because we're quite confident that we will be able to ultimately deliver a molecule. There isn't as much of a question anymore in terms of risk. I mean, we haven't yet had a project that we haven't been successful with. Another aspect of this is just mitigating the overall risk.

David Brühlmann [00:12:01]:
Troy, before we wrap up, what burning question haven't I asked that you are eager to share with our biotech community?

Troy Lionberger [00:12:09]:
If I'm honest, there’s something that we’re really excited about—it’s coming, and we’ve been talking about it publicly, so I’m happy to share it here.

But back to this question about being ultimately limited to the entirety of the PDB and only having about 10,000 structures to train your models on—what can we do about that? Crystallography is not going to solve that problem. We have around 10,000 structures, and only a few hundred are added each year. That’s not going to give us the scalability that we ideally need for these models.

And if I were to predict where the field is going—at least on the machine learning and protein engineering side—I think there’s going to be an appreciation across the industry that synthetic epitopes will become an indispensable tool for training these models.

This took me a bit to get my head around, so I’ll try to distill it. The basic idea is that if you have one antibody, it will bind to one target—but that’s not necessarily always true by physics. You could design other antigens—synthetic antigens—specifically to bind that same antibody.

So we’ve been working on something called the Synthetic Epitope Atlas. This is a program that generates predictions about synthetic partners for antibodies, for the purpose of validating that we can actually see a structure and measure the ΔG for that interaction. And even though these data are purely synthetic—these binders do not exist in nature—we’re showing that you can train a model almost just as well as with the PDB, but using things that don’t exist in nature.

You’re teaching these models structure, epitopic diversity, and feeding them ΔG and affinity measurements. That, for me, is really exciting to be a part of at these early stages. And increasingly, we’re starting to hear across the industry that more and more groups are pursuing synthetic epitope development specifically for machine learning and model training.

Just to put this into perspective: I mentioned that there are about 10,000 antibody structures available today to train models on. In just one of our measurements, we’ve shown that you can generate about 1,000 viable structures that are all validated, along with 29,000 confirmed non-binders—molecules with simple mutations that break affinity. That’s roughly 30,000 distinct measurements from a single experiment that can be used to train models. That’s pretty exciting when you think about how an alternative to the Protein Data Bank could ultimately grow.

David Brühlmann [00:14:51]:
Yeah, that’s exciting. With everything we’ve covered today, what is the most important takeaway?

Troy Lionberger [00:14:57]:
The most important takeaway I hope your audience walks away with is, first of all, that data is always going to be the constraint in this industry—whether it’s training models or validating them. There’s an increased appreciation that we need differentiated sources of data.

A-Alpha Bio is hopefully going to play some part in that ecosystem, but we need more than just a handful of companies. And the second thing I hope your audience takes away is that, with the right data, AI can be an incredibly powerful tool for accelerating your work.

That said, everyone would be forgiven for being confused by the claims being made today, and so I would just encourage diligence as we watch this space unfold.

David Brühlmann [00:15:41]:
Troy, this was fantastic. Thank you so much for sharing your insights and your perspective on where drug discovery is heading. Definitely exciting times ahead. Where can people get in touch with you and learn more about your technology?

Troy Lionberger [00:15:56]:
Yeah, our website is www.aalphabio.com—there’s a lot of information there. You’re also welcome to connect with me on LinkedIn. I’d be happy to connect with your audience.

David Brühlmann [00:16:10]:
Excellent. I’ll put the links in the show notes. Smart biotech scientists, please reach out to Troy. And Troy, once again, thank you so much for being on the show today.

Troy Lionberger [00:16:20]:
Thank you, David. It was great speaking with you.

David Brühlmann [00:16:22]:
What a conversation. From measuring billions of protein interactions to predicting the next generation of therapeutics, Troy Lionberger has given us a glimpse into the future of drug discovery. The integration of high-throughput biology and machine learning isn’t just incremental progress—it’s transformational.

If this episode sparked new ideas for your own work, share it with a colleague and leave a review on Apple Podcasts or wherever you listen. Until next time, keep innovating. Thank you so much for tuning in. Until next time, keep innovating.

For additional bioprocessing tips, visit us at www.bruehlmann-consulting.com. Stay tuned for more inspiring biotech insights in our next episode. Until then, let's continue to smarten up biotech.

Disclaimer: This transcript was generated with the assistance of artificial intelligence. While efforts have been made to ensure accuracy, it may contain errors, omissions, or misinterpretations. The text has been lightly edited and optimized for readability and flow. Please do not rely on it as a verbatim record.

Next Step

Book a free consultation to help you get started on any questions you may have about bioprocess development: https://bruehlmann-consulting.com/call

About Troy Lionberger

Troy oversees Business Development and Alliance Management at A-Alpha Bio. He started his career in the lab after earning a PhD from the University of Michigan and completing postdoctoral training at UC Berkeley. During nearly a decade at Berkeley Lights, Troy held senior leadership roles in R&D, product strategy, and business development, driving commercial initiatives that secured over $200M in deals across biopharma, agriculture, and industrial biotech.

Before joining A-Alpha Bio, he was Chief Business Officer at Abbratech, guiding the company from stealth-mode antibody discovery to a partner-focused biotech. Troy combines deep technical expertise with a track record of scaling platform companies through strategic, transformative partnerships.

Connect with Troy Lionberger on LinkedIn.

David Brühlmann is a strategic advisor who helps C-level biotech leaders reduce development and manufacturing costs to make life-saving therapies accessible to more patients worldwide.

He is also a biotech technology innovation coach, technology transfer leader, and host of the Smart Biotech Scientist podcast—the go-to podcast for biotech scientists who want to master biopharma CMC development and biomanufacturing.

Hear It From The Horse’s Mouth

Want to listen to the full interview? Go to Smart Biotech Scientist Podcast.

Want to hear more? Do visit the podcast page and check out other episodes.
Do you wish to simplify your biologics drug development project? Contact Us

Free Bioprocessing Insights Newsletter

Join 400+ biotech leaders for exclusive bioprocessing tips, strategies, and industry trends that help you accelerate development, cut manufacturing costs, and de-risk scale-up.

Enter Your Email Below

Please wait...

Thank you for joining!

When you sign up, you'll receive regular emails with additional free content.

Why Regulatory Affairs Belongs in Drug Design: 30 Years of CMC Lessons from Discovery to GMP Manufacturing - Part 1

June 2, 2026

Bioprocessing

Is Bioprocess Education Keeping Up With New Tech? The Training Gap Industry Cannot Afford to Ignore - Part 2

May 28, 2026

Bioprocessing

Is Bioprocess Education Keeping Up With New Tech? The Training Gap Industry Cannot Afford to Ignore - Part 1

May 26, 2026

Innovation

How to Source, Manufacture, and Scale the Earliest Stem Cells for Allogeneic Cell Therapy Without Ethical Barriers - Part 2

May 21, 2026

Most biotech leaders struggle to transform promising molecules into market-ready therapies. We provide strategic C-level bioprocessing expert guidance to help them fast-track development, avoid costly mistakes, and bring their life-saving biologics to market with confidence.

Contact

hello@bruehlmann-consulting.com

Seestrasse 68, 8942 Oberrieden
Switzerland

Free Consultation

Schedule a call