How to build a virtual cell and biology scaling laws

wpnews.pro

Markov Biosciences, a startup in San Francisco, is betting that biology is about to have its GPT moment. In this episode, founder Adam Green explains the "bitter lesson" for biology, the idea borrowed from Richard Sutton that large unbiased datasets and the right training objective tend to outcompete models with hard-coded rules and human priors. Adam thinks, in particular, that the virtual cell field overinvested in collecting expensive perturbation data. Green’s counterargument is that the data needed to train useful virtual cells is not limiting, but rather compute (and the loss function) are. By treating single-cell RNA-seq as a ranking problem rather than raw counts (a century-old idea traceable to a 1927 psychophysics paper), they found that virtual cells pre-trained on plain observational data show clean scaling laws, getting monotonically better at predicting unseen perturbations as the models grow, and beating a state-of-the-art model built specifically for that task.

If you’d like to sponsor a future episode, please email nsmccarty3@gmail.com. To listen to this episode, search for “The New Biology” on your favorite podcast app.

Timestamps #

00:00 — Cold open

01:58 — First prospective clinical predictions from a virtual cell model

05:38 — What is a "virtual cell"?

08:01 — The problems with single-cell RNA-seq 11:31 — The urns analogy

19:54 — Why RNA, and observational vs. perturbational data

23:29 — The bitter lesson for biology

29:06 — Generative ranking and geometric Plackett-Luce

38:27 — Ablations and loss function

47:23 — Cells as specimens

59:26 — The Antibody-Drug conjugate case study

1:11:16 — Will we ever understand biology?

Transcript #

Adam Green (00:00) Yeah, I think we’re talking past each other.

Niko McCarty (00:02) Like what Markus Covert is doing—saying, I want to simulate a cell using mathematical equations. Is that kind of thing useful?

Adam Green (00:11) I think it’s a fun abstraction. The sort of unsupervised pre-training and scaling we saw in text, images, protein sequence modeling is not going to work in the same way when we bring it to single-cell biology, and therefore we need a new approach. If you said that in 2018, it’s insane. When I say something like that now in 2026 about biological world models, people think you’re insane. As you scale the model up and it saw more and more observational data—then you fine-tune a tiny bit of perturbational data, and then you evaluate it on perturbations it has not seen. It gets monotonically better at that task. So much so that it beats the current state-of-the-art model that was pre-trained on perturbational data with multiple injected knowledge sources, specifically for the task of perturbation prediction. Yeah, I think long term the ambition is solved biology.

Niko McCarty (01:03) Today’s guest is Adam Green. He’s the founder of Markov Biosciences, a company building a virtual cell for biology. And he has some viewpoints about how to train these models that differ from the mainstream. One of the things I really want to get at in this conversation is this idea of: will we ever develop a complete understanding of the cell? And if so, how will we do that? Will we do it using black-box models with sparse autoencoders, where we can interpret the outputs of that model? Or will we ever be able to build a bottom-up mechanistic understanding of the cell? But before we go there, I want to ask Adam about a recent paper that Markov put out, where they made very specific predictions about a class of drugs known as antibody-drug conjugates in cancer. So Adam, welcome.

Adam Green (01:58) Yeah, good to be here. So we put out a paper—if you could call it that, a Twitter article currently—on a particular antibody-drug conjugate. So antibody-drug conjugates are the hottest modality in oncology right now. There are hundreds of clinical trials ongoing. The basic concept is: small molecules are promiscuous, they’re hard to target any particular cell type with. But antibodies are quite specific. And so, what if you were to conjugate an antibody with a small molecule or some kind of payload, and use it as a kind of precision-guided payload delivery system to a cancer cell? People have been trying this for, I guess, two-plus decades. And we looked at one of the most popular targets for these ADCs, which is TROP2. TROP2 shows up on many of these epithelial tumors—lung, breast, bladder. And the surprising thing: we found no one really knew how the complex of the antibody bound to the receptor internalizes into the cell to deliver the payload. It’s something you’d think would obviously be known, given that thousands of patients have been dosed. There are already approvals for these ADCs. And so we took a virtual cell and we queried it, and we said: what is providing the ride for this receptor across the membrane, and then after that, how does it traffic inside the cell to reach its destination? I think our model came up with a pretty clear prediction. It’s falsifiable. It seems to converge with other lines of evidence from clinical pharmacokinetics, tumor expression. And what makes it interesting is, I think it is the first prospective prediction from a virtual cell with real clinical stakes and large sums of pharma revenue on the line. It could pan out, it might not pan out, but as a class of thing that can be done with virtual cells, I think it is unique and the first of its kind.

Niko McCarty (04:03) What were the actual predictions, and do you know of anybody testing these? Are you planning to fund experimental studies to do this, or are you just kind of hoping that the big pharma companies will test your predictions?

Adam Green (04:18) We’ve scoped experimental packages with CROs to test the predictions, all the way from the initial mechanism—which we believe is the co-localization of this receptor with a particular tetraspanin, which is a special type of protein that we think organizes the trafficking, or the internalization, of this complex into the cell. And so we have a bunch of experiments looking at two different drugs: Datroway, which is AstraZeneca’s drug, and Gilead’s drug, Trodelvy. And what we think explains the difference in their two pharmacokinetics, their clinical outcomes.

Niko McCarty (05:03) And just to clarify: you made these predictions using a virtual cell model that Markov trained, and you did so without any underlying biological knowledge about ADCs. What I want to ask, basically, is: okay, ninety percent of drugs fail, and everybody talks about Eroom’s law. My question is, in which ways would virtual cell models, like the kind that Markov is training, actually increase that efficacy? Because we haven’t really seen evidence yet that virtual cell models actually improve clinical success rates of drugs.

Adam Green (05:38) Yeah, it’s a big question. Maybe it’d be useful to step back and define this term “virtual cell,” because it’s pretty nebulous. Personally, I’m not in favor of the term. I think it’s been so debased as to be beyond use. But generally, what people are gesturing at with this term is a machine learning model trained on some sort of biological data that does something. That’s not a very useful definition, but it’s such a big umbrella that that’s basically what it means. Two axes that you could parse this at are: the scope of the system you’re concerned with—so maybe it is actually at the level of the cell, as the name might imply; maybe it’s at the level of a spatial tumor biopsy; maybe it’s even at the level of clinical response. That’s one axis. And then the other axis is: how do we relate to these things as scientific objects? What do we expect of them? The distinction I like here is between simulators and specimens. A simulator, I think, is the dominant view of what a virtual cell is. It is a stand-in for experiments we’d otherwise have to run in the lab. And the thinking goes something like: experiments are really important for biology, for whatever reason. They’re slow, they’re noisy, they’re often costly. If you had a computational stand-in for these experiments that are costly, take a long time, and are really noisy, and you could run it at basically zero cost, it would somehow accelerate biomedical progress. The alternative view, and the one I think we subscribe to at Markov, is that virtual cells are going to be more useful as specimens. By this I mean: if you train a machine learning model on biological data the right way, it should learn—in making the loss go down—something about the nature of the underlying biological system you’re trying to model. The tough part is, how do you actually extract that understanding from the model and make it useful?

Niko McCarty (08:01) I want to understand how Markov trains a virtual cell. Obviously you have some belief that single-cell RNA-seq is the right sort of data to collect, but you feel that there are serious flaws with that data. So I want to go through all that, and then along the way kind of understand what other people are doing and how Markov differs. So let’s just start with a discussion of how people capture single-cell RNA-seq data, and what are the sources of bias in that data collection when somebody measures the transcripts within a cell?

Adam Green (08:38) So when you do this process of encapsulating a cell in a droplet and lysing its contents, there are a few sources of technical noise that can emerge. In the case of these polyadenylated capture methods like 10x 3’, one step is just: how many of the transcripts do you capture? You can imagine a cell is quite literally a bag of molecules. It’s a bunch of RNAs, proteins, et cetera, floating in solution. And if you want to get an accurate representation of that cell, the thing you’d want to do is capture all of its contents. But due to quirks of the library chemistry, you usually only capture a subset. Initially this is quite low, on the order of sub-ten percent. With modern library prep techniques, you’re getting thirty-plus percent. So it’s better.

Niko McCarty (09:31) So how many transcripts does a typical human cell have floating around?

Adam Green (09:37) Yeah, it depends on the cell type—is it highly differentiated or not. But say, on the order of a hundred thousand is a rough estimate.

Niko McCarty (09:47) So you might capture twenty thousand of those transcripts.

Adam Green (09:50) The big question is, how many of the transcripts do you capture? And then downstream, what proportion of them get sequenced and show up? The field realized pretty quickly, by doing some really crafty studies, that we were not capturing all the transcripts. Now that we’ve improved capture rates pretty substantially, the following step is—

Niko McCarty (10:14) But sorry, where are we today? So it was twenty, thirty percent. How many transcripts do we capture today with the most current methods?

Adam Green (10:22) I think with the 10x GEM protocols it’s on the order of thirty-plus percent. The technical factor I think is more important is sequencing depth. Suppose I’ve captured thirty thousand transcripts—thirty thousand unique molecules. Now I need to read them out. Long story short, the amount of sequencing you do of the transcripts you capture directly determines the distribution of RNA counts. And the field knew this—

Niko McCarty (10:27) Okay. So not a huge boost. Adam Green (10:51) —knew this in terms of expression and absence of expression: counts of zero versus ones or greater. This is called dropout. There are various terms for this, but the field was obsessed with dropout. They were saying, if I have a vocabulary of twenty thousand genes, why am I only seeing five thousand of them in any given cell? Why are fifteen thousand at zero? Is it because they aren’t expressed in that cell, or is it because I’m not capturing and sequencing them? And so that conversation, and the statistical models that followed trying to explain this, is the beginning of where our approach differs from all these other groups and how we train these models.

Niko McCarty (11:31) So I think one useful way to understand this distribution when you’re doing single-cell RNA-seq would be—you’ve told me before about this analogy about urns. If you imagine each cell is an urn, and it has balls of twenty thousand different colors inside of it, and each ball is present in a different amount corresponding to the RNA transcripts. Tell me that story. Tell me about that analogy, and how you think about what these distributions actually are when we do single-cell RNA-seq.

Adam Green (12:04) Yeah, it’s a vivid metaphor—colored balls and urns. It comes from probability theory, but I think it illustrates the problem well. So imagine you have an urn. It has twenty thousand different colors of balls—red, midnight blue, green, taupe, if you like—at varying concentrations. Just assume the urn is a hundred thousand balls. You don’t know the underlying proportions of the different colored balls. This is our cell. I ask you to draw, say, a thousand balls from the urn. You get, let’s say, a hundred red, five blue, et cetera, et cetera, zero taupe. You look at the taupe and you’re like, why? Does this urn not have taupe? Does it not like taupe? Or did I simply not get lucky enough to pull a taupe ball? You do this across replicates. So I have ten urns, they all look the same. They’re the same type of urn, they have the same urn behavior. And then for some of them I get taupe balls. So if you do this across enough replicates and you actually plot the distribution of counts you get for taupe from each urn—let’s say I have a thousand urns, I do a thousand draws from each, and I just plot how many taupe I get in each cell—what it’s going to look like, assuming taupe actually is a pretty low proportion in the cell, is I’m going to have this massive distribution at zero, and then some kind of right-skewed distribution of one, two, three, four, five, et cetera. And what the field tried to do is they said, well, what if we fit distributions to these count data? What do they appear to be? And so the nature of these statistical distributions—and in particular, do we see more zeros than we should expect?—was one of the major questions the single-cell biology field concerned themselves with. What the field did not realize, and this is somewhat surprising given that you’d think they’d concern themselves with such questions, is that the number of draws you take from the urn determines not only the rate of zeros you see, but the rate of all the other integer counts you see.

Niko McCarty (14:29) What do you mean by that? You mean that in some cells that gene is actually not expressed? What do you mean by the amount of zeros that you would expect to see?

Adam Green (14:38) So the field is wrestling with this question: we see a lot of zeros for many genes. Is this indicative of the gene not being expressed in these cells on average, or us just not capturing and sequencing it? Much of the field leaned toward the former explanation, and came up with increasingly convoluted statistical models to explain why we saw so many zeros. But what Valentine, and later Serge and Stevens, formalized in a really great model is that if you look at the actual distribution of expression where we should expect it to be, and you mix it with this measurement model—this model of what our drawing process is like from the urn, how do we take out the balls?—the resulting distribution explains pretty well the number of zeros we see, and the full tail of counts. And then the big unlock for us was realizing that these two key levers I’m talking about—transcript capture and sequencing depth—affect not only the zero rate, but the full distribution of counts. And where this gets to virtual cell models is: if I have two identical urns and I take a thousand draws from one and ten thousand draws from the other, even though the underlying expression proportions might be identical, my view of what’s going on inside those urns is going to be radically different. And that informs how you want to train models on these data.

Niko McCarty (16:05) Just to take a quick synopsis here. So you’re saying that the field collected single-cell RNA-seq data from hundreds of thousands of cells, hundreds of millions of cells—okay, so the single-cell RNA-seq field has data on hundreds of millions of cells. And you’re saying that some of those cells have different chemistries than others, which means that their capture rate is very different between cells.

Adam Green (16:15) Hundreds of millions. Niko McCarty (16:34) And some of the cells have been sequenced much more shallowly than others, and some have been sequenced very deeply. And the big problem is that when people use all these data to train virtual cell models, that introduces biases, essentially. And we have to deal with those biases using new statistical frameworks.

Adam Green (16:53) Yeah. So the big question is: cells vary biologically, but they also vary due to some of these technical factors. Maybe the lab assistant was having a bad day, the mouse was having a bad day, it was warm out. And the question is, when you observe two cells, how do I know what is the signal and what is the noise?

Niko McCarty (17:16) People were treating all of them as just the same. So they were training these models using just aggregated data.

Adam Green (17:22) Right. So people tried to get around this in ways that were kind of naive. One thing you can do is say: rather than trying to model the individual counts, I can model frequencies. So maybe I only have a thousand transcripts I captured, but if I look at the rate—if I say I have five blue out of a thousand, call it a rate of point-five percent—maybe that’s more robust than modeling the individual counts. There are lots of reasons why this doesn’t work. Firstly, what do you normalize against? Secondly, as you get to the low-count regime of one to four, there are actually really great papers that come out of ecology showing why these sorts of transformations do not properly deal with the noise issue. Really, the crux was: the field tried, around 2023, to train some of these early virtual cell models, saying, given the expression of some genes in a cell, predict the other genes and their expression. The field tried that. And the conclusion they came to is that this approach is doomed to fail.

Niko McCarty (18:22) When was that? Twenty twenty-three? Adam Green (18:48) Yeah. The Geneformer paper was accepted into Nature in 2023. scGPT came out maybe a bit later. But these were pretty naive. Geneformer was prescient in that it was trying to do some kind of rank-based prediction. But in general, the conclusion the field arrived at was that fundamentally there is something wrong with these data, because when these models were benchmarked—and there are a couple of bearish benchmarking publications that came out late 2023, early 2024—it looked like these models did no better than really naive methods of just predicting the mean expression in the cell, for tasks like cell type classification, whatever. And I think the field then conflated the failure of these models to scale with a failure—or a poverty, some kind of innate badness—of the data.

Niko McCarty (19:17) Biology could not have scaling laws because it’s just more complicated than other domains.

Adam Green (19:22) Or some kind of appeal to causality. A lot of these lines of thinking have not been made clear, and they’re pretty ill-formed. When you push people on them, there appear to be gaps. But the basic position that was arrived at is that the sort of unsupervised pre-training and scaling we saw in text, images, protein sequence modeling is not going to work in the same way when we bring it to single-cell biology, and therefore we need a new approach.

Niko McCarty (19:54) At some point, the field decided that a cell is a bag of RNA transcripts. And if we capture good data and train models on just those RNA transcripts, we’ll be able to understand the cell as a whole. In other words, RNA is the Goldilocks molecule to capture in the cell to build virtual cells. And of course, there’s a lot of discussion today around building multimodal models that incorporate proteins or spatial information. But at some point the field decided that single-cell RNA-seq was going to be the data on which to train these models that do much more than just predict gene expression. So I’m kind of curious about that philosophy. Why did the field decide that this was the right medium with which to train their models?

Adam Green (20:44) Yeah, I think the evidence that people subscribe to that view is that there are billions of dollars going both to for-profit startups and nonprofits to generate RNA readout data. That definitely seems to be the consensus. But I do actually think there are reasons to believe that RNA might be preferable to protein, even if you had similar technology for doing single-cell proteomics. And this basically gets back to this statistical view, the idea of compressed sensing. Like we said before, you have these urns; in any cell, maybe you have on the order of a hundred thousand balls, and you do some draws from them. If you need to look at the proteome, you are talking two to three, maybe four orders of magnitude more balls floating around, just because every RNA transcript can code for multiple proteins. Meaning that if you want to get an accurate representation of what’s going on in that cell, you need to do more draws. But I think the claim would be—and this is the strong claim I subscribe to, and I think our white paper gestured at it—they encode this state in an RNA, in the epigenome, in the proteome. But if you learn a good enough model of a modality that has enough signal in it, it will converge to a shared statistical reality that allows you to predict something like the proteome, and maybe even subcellular localization of proteins, as we showed in our paper. Does this mean that RNA is the modality to rule them all in the limit? Probably not. There are lots of reasons why it’s suboptimal. The field has converged on RNA as being the thing they’re collecting. And then the key distinction becomes—and this relates to the earlier conversation about the nature of these data—do you just collect observational data of cells? Just cells doing their thing, hanging out? Or do you need perturbational data? I have a cell, I have some kind of exogenous intervention, I apply a small molecule, I knock out a gene, I overexpress a gene—what happens? In order to learn a world model of the cell, or a virtual cell. Given that we’ve converged on RNA as being the modality, the question is, which of these two data sources do you need? Is observational sufficient, with a little bit of perturbational sprinkled on top? Or do you actually need massive amounts of perturbational data to learn a so-called causal model of the cell?

Niko McCarty (23:29) What is the thesis? So Arc Institute, of course, is training their virtual cell models using perturbation data, right? The pre-training is done using perturbation data. What is their thesis as to why that’s the right framework? Because my naive guess, as an outsider, would be to say that of course we should pre-train the model on just observational transcripts, and then fine-tune for whatever task we want that virtual cell to do. And so I’m genuinely curious as to why such a big segment of the field in virtual cells is training specifically on perturbation data.

Adam Green (24:07) Yeah. I won’t speak about specific actors, but I’ll speak about the camp as a whole—and this is the dominant camp. So I’m going to give an extended analogy to the field of NLP, natural language processing, because I think it has already undergone this kind of transition that the biological world models field, of which virtual cells are a subset, is currently undergoing. And in retrospect, why people went down this path will become clear. Imagine it’s the mid-2000s, you’re working in NLP, you want to build machine learning models for language. Say you’re really concerned with translation: I want to build a model that can translate from English to French. Naively, you might think, well, how about I go talk to a bunch of translators, collect a bunch of parallel texts, and then train a model directly on that? Or you might think, language has grammar and syntax, and we can decompose it into these Chomskian trees—maybe I need to go collect a bunch of that data to teach my model what proper syntax and grammar are. And so this is really the dominant mode of NLP for many decades. You might call it big parallel corpora—dating back all the way to the Rosetta Stone—big treebanks, these trees decomposing sentences into their structure. But what people started to attempt in the early 2010s is, they said, maybe we can get good representations of language at a general purpose, and then fine-tune them on a tiny bit of data to do these tasks we care about, like translation. And so there’s a long history of these unsupervised language learning approaches. But basically the watershed moment was GPT-1, by Radford et al. in 2018, where they showed that if you take a large corpus of data—BookCorpus—you take a general-purpose model, the Transformer, which was invented by Google a year prior, and you do this basic prediction task of “predict the next token,” the model developed general-purpose representations that could then be transferred to downstream tasks such as translation, allowing you to get away with far less translation data to be good at translation. Now we look back on this, now that everyone is scaling-pilled—and here I’m referring to Richard Sutton’s bitter lesson. The potted version is that the history of machine learning shows that approaches that try to introduce how humans think about things should work, ultimately end up losing to approaches that leverage computation. The ML field has completely swallowed and assimilated the bitter lesson pill, and now quite directly believes that as you scale these models up on a big observational or unsupervised dataset with the right objective—in this case, with Radford et al., predicting the next token—they will develop arbitrarily capable representations useful for downstream tasks. If you said that in 2018, it’s insane. When I say something like that now in 2026 about biological world models, people think you’re insane.

Niko McCarty (27:32) What is the bitter lesson for biology, in your worldview? You’re saying the field is training these virtual cells on paired data, these perturbation data, but instead we need what?

Adam Green (27:45) Our belief is that biological world models—this idea of training machine learning models on data collected from biological systems—is going to undergo, or already is undergoing, the same evolution that NLP underwent, or image modeling, or protein sequence modeling, which I’d say is distinct from biological world models. And there’s going to be this sloughing off of inductive biases, of human priors on what the model should know. The things we care about will ultimately be learned through a bunch of unsupervised pre-training, followed by a tiny bit of post-training on the things you care about, like perturbation data. And so Yann LeCun’s metaphor of the cake is quite useful here. He says, if you’re making a cake, the big thing is the sponge cake—stack it up; the icing on top is secondary; and then the cherry is tertiary, on top of that. But you don’t have a cake if you don’t have the spongy layers. And so the field’s just going for straight icing. We’re saying, no, in principle you can bake a cake using these data. You just have to understand the data-generating process and match it to the right objective—just as in the history of NLP, image modeling, you had to find the right objective for the data.

Niko McCarty (29:06) And what is the right objective? Okay, let’s talk about that. What is generative ranking? This has to do with how you deal with the zeros, right?

Adam Green (29:09) Generative ranking. The zeros, and the whole distribution of counts. So to return to the colored balls and urns: imagine I have two identical urns, they have the same content inside them, the same number of balls. I do two draws of a thousand each from the two urns. I look at the data, and I’m like, okay, this one has one of this color, this is zero of that color, and so on. I know this is due to technical artifacts. How do I train a model across such data? Our claim is that if you want to train in the manner of a GPT-1-style model—large, unsupervised, learns with more data—you need a loss objective that is able to abstract away from the noise in these data, the randomness in two sets of draws from the urns, and pay attention to the signal. And so the punchline is that the ordinal structure—that is, the ranking structure of “gene A is expressed more than gene B, more than gene C, more than gene D”—is more robust to these library chemistry technical artifacts.

Niko McCarty (30:20) You mean more robust than either counts or frequency of counts? It’s better to just say gene A is the highest, gene B is the second highest.

Adam Green (30:29) Yes. And so in our paper we test this pretty rigorously, doing these ablations, giving the best possible chance to these other types of models at different loss functions, and we isolate pretty clearly that the loss function appears to be the missing ingredient that unlocks scaling on unsupervised data and allows you to bake this cake. Now, the way we actually got there is a really long story, and it was kind of accidental, but—

Niko McCarty (30:55) Tell me that story. How did you come up with this approach? It’s called geometric Plackett-Luce.

Adam Green (31:02) Maybe to set it up with the balls and urns, and to make it more concrete, the task is something like this. I have a hundred million urns. I do draws of varying depths from them. Now I have this task: I want to learn a general model of “urn-ness.” Like, in this type of urn, I tend to see this color ball; these two colors tend to covary across different types of urns. This is a machine learning problem. The challenge you run into is, again, the underlying signal for learning this is really noisy. So maybe in one urn, if I pay attention to the zeros, I’m just paying attention to technical noise, and that’s going to dilute my signal. Maybe there’s variation purely in who’s doing the draws from the urns, and I pick up on that. Or which lab it’s being done in. And so the task becomes: given a set of draws from an urn—let’s say I draw a thousand balls, I put them behind a veil, I don’t tell you the colors. I reveal to you the first color is red. I’m going to say red is in the middle of the ranking distribution. Given that, I want you to predict the rest of the balls and their relative orderings. So when you do this, what does your model have to learn? It’s not predicting the true underlying proportions. You can’t see those unless you draw all the balls out of the urn. But it’s trying to predict, given something I know about the library chemistry, given I know how many balls I drew from the urn, what does this type of urn tend to look like? Which balls does it tend to express? If I now know that it expresses this ball at high rank, it gives me information about the rest of the balls in the urn. So the way you actually parameterize this—and this is the evolution of the loss function that took multiple years, just iteration empirically—and the funny thing is—Mark—yeah, and all credit here due to Glenn, brilliant ML engineer who figured all of this out—in the end we converged on what turned out to be a hundred-year lineage in ordinal ranking, dating all the way back to this 1927 psychophysics paper by Thurstone. He’s asking this question of inter-rater reliability. If I have balls of different weights, or if I have sounds of different pitch, and I ask person A which ball is heavier, or I ask person B which ball is heavier, which sound is higher-pitched or not, there’s going to be variation in that process. But generally speaking, these people are going to concord on which balls are heavier, which sounds are higher-pitched. Whereas if you ask them to tell me the weight in grams of this ball, or tell me the absolute pitch of that tone, there’s going to be much more noise.

Niko McCarty (33:57) I see. So you eliminate noise by asking people for just ordinal rankings.

Adam Green (34:03) Yeah. I wouldn’t say you eliminate noise, but you’re more invariant to it. And so this is a 1927 paper from psychophysics. I didn’t actually know about it until very recently. But it turns out there’s an extremely rich lineage pursuing this idea of ordinal ranking, scales of measurement, types of measurement in psychology, for instance, that eventually converged on this model we ended up using, calling geometric Plackett-Luce. So there’s this really rich, nearly century-long history, dating all the way back to Thurstone, on ordinal comparisons and modeling them—preferences, Likert scale data, “rate this on a scale of one to seven,” “what is the severity of your symptoms?” And ultimately, what it converged on, in the model that we use, is geometric Plackett-Luce, which has found most recent application in ranking chocolate puddings and golfers. So it’s not a very popular concept. It’s pretty niche. But it allows you a way to model this kind of ranking problem and express direct likelihoods of “my model predicts here’s the actual permutation of rankings of colors we expect to see,” including with ties—which is a big leap in what makes geometric Plackett-Luce workable.

Niko McCarty (35:30) I see. And just to clarify, Markov—so when you train a virtual cell, you’re taking existing public single-cell RNA-seq data, you’re applying this geometric Plackett-Luce thing to all the data, and then you’re doing pre-training on that. And I guess my question is just, tell me about the scaling laws that you’re seeing. Does it seem like biology has enough data? Or do we actually need more data of this kind? What are the existing flaws in the amount of data we have, and the structure of that data, that you’re observing?

Adam Green (36:12) So if you go back a couple of years, when I started training these models—I don’t have a background in ML or biology. I tried to figure out how to do this, I cracked open the PyTorch textbook, I’m hand-coding reshapes on tensors. You just try the naive thing of, maybe let’s predict raw counts, or maybe let’s do this ranking thing, but in a much more naive way of just predicting the names of the genes. And at that point in time, it wasn’t clear if these scaling laws would emerge. And so what has transpired—and again, all credit to Glenn, who has figured this out—is that once you find the right loss objective to match the data, they do appear to obey clear scaling laws. And there is plenty of juice to be squeezed out of these observational data. They learn general-purpose representations that transfer to downstream tasks. And this is—again, to map back to the NLP space—GPT-1 showed that in principle you can train a model on a bunch of these data and it transfers to these downstream tasks. What GPT-2 then showed is that as you scale up the amount of pre-training and the size of the model—the more knobs that are available to learn—the model gets better at these downstream tasks. So we looked at this for our model as we scaled it up to a billion parameters plus, and we said, how does it do on downstream tasks like perturbation prediction, which is really the motivating task that most people think about when they think about virtual cells? And what we found, quite excitingly, is that as you scale the model up and it saw more and more observational data, then you fine-tune a tiny bit of perturbational data, and then you evaluate it on perturbations it has not seen in a given cell type—it gets monotonically better at that task. So much so that it beats the current state-of-the-art model that was pre-trained on perturbational data with multiple injected knowledge sources, specifically for the task of perturbation prediction.

Niko McCarty (38:27) So to clarify: when you were initially training virtual cell models, you tried to train them using just count data, you tried frequency data—and you’re saying that you did not see scaling laws with these other ways of treating the data?

Adam Green (38:45) Not as robust. And so in the paper we do a really clean ablation, where we say, let’s keep the compute budget the same, the data the same, but only vary the loss function. And then we train negative binomial models, which is a way of parameterizing the counts; geometric models, which is the simpler case of negative binomial proportions; mean squared error over the counts; and then geometric Plackett-Luce. And then you can say, given a fixed compute budget—number of gene tokens it’s seen—how well does the model then do on some held-out evaluation set? So I think it was healthy lung, or we had a couple of cell types we held out from pre-training. And then you ask, say, I show you fifty percent of the genes in the cell, predict the remaining fifty. So that’s a binary prediction task of, is it a one or is it a zero? We also looked at a ranking evaluation using Spearman loss. And what you see is that the geometric Plackett-Luce scales the best monotonically. It increases on the Spearman metric. It gets better and better as you scale it up with more pre-training. Surprisingly, the mean squared error and the proportional do pretty well. I wouldn’t have expected that beforehand, but they don’t scale as well as the geometric Plackett-Luce. And then I think the truly surprising fact for the field is that the models that did worse were those that explicitly parameterize the count distribution. So geometric collapsed mid-training at multiple scales. And then negative binomial, which provides more distributional flexibility—there are more knobs you can fit—collapsed even more catastrophically. It’s somewhat puzzling, because the field has been climbing this distributional complexity ladder of “let’s try to fit the integer count distributions better and better and better.” But our claim, what we think the paper shows, is that no—that is simply the wrong abstraction layer to operate at. You’re climbing a ladder on the wrong wall. In fact, the right objective is going to be fitting the ranking structure, not the raw counts. And when you do that in the right way, the model seems to scale. So that gave me conviction that there’s something there. Now, did we know earlier on? Yeah, we saw with earlier models and earlier ranking approaches that there’s some kind of scaling. But this is so robust, and scaled into the multi-billion-parameter range, that it led me to believe that this truly might be the final approach for this sort of noisy multi-set modeling.

Niko McCarty (41:23) So just to clarify, you’re saying the industry norm for training these models is still to include all the zeros that we see. So if an urn has twenty thousand balls and you only pulled out five thousand of them, when they do the pre-training, they keep the fifteen thousand zeros.

Adam Green (41:41) In most cases, yes. So if you’re predicting proportions, you’d say, I have gene A, I’m going to predict some proportion and then compute a loss. If it’s a zero, compute a loss against zero. Or if I have some kind of parameterized count model like geometric, I’m going to predict my single parameter for the geometric, and then I say, what is the likelihood of observing a zero given this distribution—and that goes into your loss. But yes, no one has taken ranking seriously and actually extended it to a full closed-form likelihood. Geneformer was a preliminary attempt, our earlier models were a preliminary attempt, but this is the first time it has been done performantly at scale and shown to work.

Niko McCarty (42:27) Okay. So you put out this white paper, basically saying, here’s how we should treat our data using geometric Plackett-Luce. Do you have any evidence that the other people in this space are now adopting that to train their own models? Or is Markov kind of doing it on its own?

Adam Green (42:47) As far as I know, we’re doing it on our own. The implementation isn’t easy. We try to provide some guidance in the paper—it’s quite a long supplement. But I don’t think it’s been adopted.

Niko McCarty (42:58) Presumably people are taking what you say seriously. When you put out this white paper, you have evidence that people in the field are reading it and talking about it.

Adam Green (43:07) Yeah, I know the paper’s being read. I think some interpret the views expressed as hostile to the perturbation prediction crowd, but I think they’re complementary. But—

Niko McCarty (43:25) The only thing you’re really changing that is presumably antagonistic to the existing field is, you’re saying we should pre-train on just observational data, and then fine-tune on the perturbation data if you want your cell to do perturbation prediction. But presumably people should not take exception to the geometric Plackett-Luce stuff, because that’s just how you treat the data, right?

Adam Green (43:49) Yeah, I don’t think people take issue with the loss function. But on the perturbational data—and again, these issues are kind of related, in that what motivated collecting all this perturbational data was, in part, the assumption that training observational models will not work. And so if you show that, no, in fact there’s signal in these data that you can extract, that slightly impugns the motivations for perturbational data, saying maybe we don’t need to be allocating so much money to generating these data. And there’s a lot riding on them. Nonprofits are putting hundreds of millions of dollars into generating these data. There are startups that have raised a billion dollars-plus and built bespoke models for these data.

Niko McCarty (44:37) And tell me what their thesis is. We don’t have to name anybody, but the field is collecting perturbation data because—the way I understand it—they think that drugs, for example, are perturbations. So if we could figure out how a drug changes the transcriptome of a cell, maybe we could figure out ways to change the state of that cell back into a healthy condition. And so their assumption must be that in pre-training on perturbation data, we’ll be able to make drugs better. In silico, we’ll be able to predict which drugs are more likely to push a target cell into a particular state. Is that kind of the thesis of these players?

Adam Green (45:28) Presumably. And it gets to this specimen-versus-simulator dichotomy. But I think where that point of view—that if we had a better simulator it would accelerate drug discovery—goes wrong is when you actually play this out and you think about what leads to a drug candidate getting sent down the pipeline and succeeding. What are the actual decisions being made? And is experimental velocity the rate-limiter currently? Or is it something more like understanding, or how quickly you can use experimentation in the service of navigating through drug space? A lot of these things are blended together. And again, it’s hard to figure out people’s real motivations. They appear to be of the belief that more experimentation will lead to better drugs faster. I am of the belief that—

Niko McCarty (46:08) It— Adam Green (46:23) —you could hand me a black-box simulator of “here’s what would happen if you take this cell state and you perturb it with this drug,” and it wouldn’t meaningfully move the needle on actual drug success rates currently.

Niko McCarty (46:35) What do you mean by “search the space”?

Adam Green (46:37) Yeah. Imagine I have all possible cell states—thousands, tens of thousands of different cell types and different microstates—and I apply, of the space of all possible small molecules, the set that are drug-like, and I can predict what would happen. Does that actually get you to drugs that make patients healthier? Well, if you can only model a single cell, probably not. If you want to extrapolate to larger and larger systems, you need some notion of mechanism. So I think the era of virtual cells as simulators will be short-lived, and probably not contribute all that much to downstream drug R&D productivity.

Niko McCarty (47:23) So your thesis, just to clarify one final time, is that by pre-training on just observational data and then building sparse autoencoders on top of it—which I’ll ask you about—we’ll be able to build up a deeper understanding of cells. So they won’t only be simulators of drug perturbations; they’ll actually be world models, or cell models—so not simulators, but what you’re calling specimens, almost like a model organism in silico. So the end goal of what you’re doing is not necessarily to make better drugs—that’s a possible outcome if it works. It’s more to understand the mechanism of how cells work.

Adam Green (48:04) Yeah, I think long term the ambition is to solve biology. Short term, make useful world models of biology that can be applied to everything from discovering better drug targets, figuring out the mechanism of action of existing drugs, patient stratification, biomarker selection. But the core philosophical belief undergirding specimens—and this is going to sound kind of highfalutin or philosophical—is that cells are biological agents. They are embedded in their environment, and they’re each doing some kind of task. This is encoded in the distribution of molecules inside them. So you’ve got a neighbor, he’s butting up against you, you get sent a paracrine signal, you send one back. But these are fundamentally agentic systems—not in the sense of reasoning agents, but technically agents that are trying to minimize free energy. So we can talk about the free energy principle. But they themselves have some model of the world encoded in the distribution of molecules. And if you can recover, using machine learning, a model of these molecules, you can in some sense recover what the cell believes, has hopes, dreams, aspirations—maybe. Probably not. And then with that, if you are able to dissect that black-box model, that’s a proxy for the cell’s internal model of its environment. Then you can figure out the best ways to poke and prod it to get it to behave how you want.

Niko McCarty (49:36) So I think what would be helpful is if you just walked me through the full steps by which you train your virtual cell. Where does the data come from? How do you treat the data, and so on from there?

Adam Green (49:48) Yeah. So back when I started training these models, and other people started training them, like 2023, data is extremely messy. Luckily, now there are these massive standardized datasets like scBaseCount from the Arc Institute, CELLxGENE from the Biohub, that give you pretty easy access to these data. So you get these data, and your individual data object is a cell and an associated set of counts for all the genes in the vocabulary. And then you have some metadata, maybe—the patient had this disease, this is their sex, this is the tissue of origin, maybe cell type labels. And then to do the actual generative ranking under the hood, it’s just a standard transformer architecture. And then the question is, how do you feed in the data? And how do you loss it? What is the way you extract the signal from it? So the basic setup is: I have a cell, I’m going to look at its genes, I’m going to randomly permute them. Let’s say I’m going to randomly permute the expressed genes—the genes that have counts of one or greater—and I’m going to show you one of them. I’m going to tell you its rank. And then I want the model to predict the rankings of all the other genes, given this first gene. And so you can actually set this up in an autoregressive manner to make it nice and fast. And then after I see that, I make my prediction. And then I calculate my loss using this geometric Plackett-Luce objective: given the strength parameters, the thetas predicted by my model, what is the likelihood of observing the actual gene expression ranking? And then you reveal another random gene, you do that, and then you do that successively across hundreds of millions of cells. And in so doing, the model learns a surprising amount.

Niko McCarty (51:41) And how do you do the fine-tuning? I guess it depends on what you want to achieve, right?

Adam Green (51:45) The only fine-tuning task we look at—you have this model, it’s learned representations, you can do a lot of different things with them. Perturbation prediction, as I said, is the main task the field has concerned itself with. It’s the one they really care about for benchmarking. Personally, I don’t care that much about perturbation prediction, but if you want to show your model’s the best, you have to do it. So what you do—people abuse language here, they mean lots of different things by what is out of distribution, what is in distribution—but the basic task for cross-cell transfer is saying, in the Replogle dataset, I have four cell types. Okay, I’ve seen roughly 2,000 perturbations in each cell type. And then per perturbation, per cell type, I’ve seen 100 of that—I’ve seen 100 cells, they’re HepG2, they’ve been perturbed with gene X, knocked it down. Now I’m going to show my model that. So I’m going to show it knockdown of gene X in HepG2, another cell type, another cell type. And then I’m going to ask you to predict what would happen if you knock down this gene in a cell type where you haven’t seen that perturbation. And so the basic fine-tuning, in this case—we want to go back to counts so we can compare our model against these other models on the same sorts of metrics. So you just take a new head at the end of the model, and instead of trying to predict ranks, it just predicts counts. And so you fine-tune it on these data, and then you evaluate it on the held-out perturbations in the cell type you’re evaluating on, then see how that performance improves as you take bigger and bigger models and do this fine-tuning.

Niko McCarty (53:19) And so what is the current state of the art on perturbation prediction with a virtual cell? And you’re saying that you actually beat that with the new Markov model?

Adam Green (53:28) The previous state of the art—it came out a couple of months ago on this particular Replogle benchmark—was Excel, from Xaira. I mentioned this model before, but it’s quite cool. They tried to scale the model up. It’s a really cutting-edge diffusion architecture. It does have these injected knowledge sources; they pre-trained on perturbation data. But when evaluated on this exact task—we matched the way they set the task up and the way they evaluated other models—we found that on the metric of mean absolute error, which is the most basic metric you can think of (what is your average L1 error in predicting gene expression in the held-out cell-by-perturbation pairs), our model beat state of the art.

Niko McCarty (54:19) And so you’re training this model solely on single-cell RNA-seq data, but it’s learning things about the cell that are not RNA-based, right? It’s learning more than just gene expression data. So tell me about that. What is the model actually learning from the single-cell RNA-seq data?

Adam Green (54:39) Yeah. This is, I think, the most exciting thing, and really the driving motivation for me since I first started believing in this kind of agentic view of biological systems—that they themselves are modeling their environment, and if you can learn a model of that, you’ll learn something about what they are computing internally. So first we just looked at basic regulatory genomics. This has been done before. You can collect data on, say, the binding of a transcription factor—a specialized protein—upstream of a gene on the DNA. Or you can say, if I look at the DNA and I look at these flanking regions around genes, do I see certain motifs that match the transcription factors? There are nucleotide patterns that are predictive of whether a transcription factor will bind to that region, and therefore whether that transcription factor will regulate that gene. So we looked at the ENCODE dataset, which is pretty old—it came out in 2012, I think—which collects some of these functional genomic readouts. And the simplest task you can look at is: I have a set of transcription factors, I have a set of target genes; which of the transcription factors actually regulate which genes, as indicated by binding of the transcription factor to regions around that gene? So we evaluated our model on this. Using a really naive method, you just take the embeddings, the learned embeddings of the genes, you do a cosine similarity between them—between the transcription factor and the target gene—and then you do an AUROC, trying to find the threshold that separates the target genes versus the non-target genes. And surprisingly, our model did decently on this. It got the highest AUROC out of any of the models we tested. This is nowhere near, I should say, the performance of models built specifically for this task. And if you talk to the functional genomics people, they’ll say that’s not impressive, 0.57. But to my mind, it’s the proof of principle that a model never trained on this task was capable of doing it—that is most exciting.

Niko McCarty (57:02) But your thesis presumably is that with more scaling—more compute, more data—the AUROC on these tasks that are not, like, regulatory tasks will just continue to improve. But nobody has tested that yet? Are you seeing that these improve with scaling laws? Are you getting better and better at these other tasks?

Adam Green (57:21) Yeah, where this departs from the perturbation prediction stuff—where we see really clear monotonic scaling—is that when we’re probing the model here, we’re just looking at the earliest layer, the embedding layer. What is the general representation of each gene? And surprisingly, you do a simple cosine similarity—do these vectors point in the same direction?—and that is predictive of this. But as you scale the models up, I was somewhat surprised to see this—though there’s precedent in NLP—it seems like that approach does worse and worse. Just looking at the embeddings—the knowledge seems to migrate into further layers of the model.

Niko McCarty (57:56) So you’ve written in prior essays about this idea of biocompute. My question is, are we biocompute-limited or data-limited? What do we actually need to build better virtual cell representations? Is it just that we need to start integrating multimodal data? What is your thesis about the new frontier of virtual cell models?

Adam Green (58:18) Yeah, my somewhat contrarian opinion is that we are, on the current margin, limited by engineering and compute, and there’s tons of juice we can squeeze out of these models if you train them the right way.

Niko McCarty (58:33) You mean with the existing data? Adam Green (58:34) With the existing data. And there are billions of cells coming online by the end of 2026. I don’t think we’re lacking for data anytime soon. Now, the space of questions you can actually ask with a virtual cell model is a very small part of the space of broader questions about biology we want to ask, or that are clinically relevant. But what I think the ADC biomarker nomination—and using our model to predict how TROP2 is internalized and trafficked—led me to believe is that if you circumscribe the problem cleanly enough and you make certain assumptions, and you try to operate in a regime where you can parcel out the effects of certain variables, then there are many clinically relevant questions you can ask and get interesting answers to, with current models and current data budgets.

Niko McCarty (59:26) And that’s a good segue. Let’s talk about the antibody-drug conjugate stuff. To clarify: you built a model, you pre-trained on just observational single-cell RNA-seq. And then—I don’t know how you did the fine-tuning, if any—you made these predictions. Tell me about that.

Adam Green (59:46) Yeah, no fine-tuning. So the setup was, we were looking at: does our model know where drug targets localize in the cell? Does it know the functional coupling of different signaling molecules? And then we just asked—I mean, the original motivation of this was commercial. I think pharma partners would care about this. They don’t just want to see a cool toy. I think they’d care about antibody-drug conjugates. I didn’t know what those were; I had to look it up about a month ago. And so I found—okay, there are a lot of drugs targeting these things, lots of clinical trials. Surely we know how they work, right? So, to rehash, what is the basic process? You have an antibody, it’s coupled with a linker—a chemical tether—to a payload, or multiple of that payload, and then the goal is for the antibody to bind a receptor on the surface of the tumor. It gets internalized, endocytosed. And then, dot dot dot, endosomal biology makes its way to deliver the payload, in most cases to the lysosome. And so I looked at TROP2. I was like, okay, I didn’t find any literature explaining how it worked. For HER2, we know how it works pretty well—it’s clathrin-mediated endocytosis. Sorry—

Niko McCarty (1:01:07) Just to clarify, what is its role in a cancer cell, for example?

Adam Green (1:01:11) So TROP2 is—you could call it an adhesion molecule. It’s kind of a receptor. It shows up on a lot of these epithelial tumors—lung, bladder, breast—and it’s a desirable drug target because it’s pretty specific to tumors. If you deliver a payload to a specific cell type, it’s good to choose a target that’s specific to that cell type, expressed at sufficient density, shows up on the surface.

Niko McCarty (1:01:38) So these companies were building antibodies that bind to TROP2, and they’re carrying a payload that then kills those cells right after they get internalized. Okay. So you’re interrogating what about TROP2?

Adam Green (1:01:50) Yeah. So you can take TROP2 and you can say, which other genes in my model are functionally coupled to it—using this really simple method of, what is their cosine similarity? And again, the motivation was, I looked in the literature, I’m like, surely people knew how this works. How does this internalize? Apparently they didn’t. You look at the slides from the pharma companies, like AstraZeneca and Gilead, who are building incredible drugs here targeting this. Do they know how this works? You look at the slides, and it’s an empty box labeled “internalization” and “delivery to the lysosome.” And I’m like, that’s curious. Okay, so what does our model have to say about it? You look at its functional neighbors, and the ones that appear to show up are related to what you might call lipid rafts, or even caveolar—that’s not quite the right word—but the neighbors appear to be things on the surface of the cell that are in these small domains called tetraspanin microdomains. And in particular, there is this tetraspanin, this special type of glycoprotein I think, TM4SF1, that showed up. We said, okay, that’s interesting. What is this protein doing there? So you look at what this protein is, and it turns out it was discovered 40 years earlier in the same sorts of epithelial tumors. The literature shows this protein plays a role in organizing these little vesicles that internalize in the cell. So it’s like a little organizational, structural protein that shepherds cargo inside. So our model said this is the top tetraspanin. That’s one line of evidence, right?

Niko McCarty (1:03:47) My question—and this is a detour, more about the business case—but if these big companies like AstraZeneca have already taken these antibody-drug conjugates to phase three trials—

Adam Green (1:03:58) Some are approved. Niko McCarty —okay, some are approved. Why is it important that they know about the tetraspanin that associates with TROP2?

Adam Green (1:04:08) The motivating mystery is that you have a few of these TROP2-targeting ADCs. You have Gilead’s Trodelvy—and these are just the brand names—AstraZeneca’s Datroway, and I believe Merck has one as well; it might be approved. But what you see is a fascinating clinical double dissociation. So in the case of TROP2 ADCs, you see something quite interesting. If you restrict what you’re looking at to breast cancer, and you look at HR-positive breast cancer—which is the most common type of breast cancer, it has hormone receptors—versus triple-negative breast cancer, highly malignant basal-type cancer, and you look at the trials for these drugs and you try to match along lines of therapy, are they immunotherapy-eligible: you see that Trodelvy, Gilead’s drug, and Datroway, AstraZeneca’s drug, both do decently well in triple-negative breast cancer in terms of overall survival rate hazard ratios. But in the HR-positive cancer, Gilead’s drug does really well, and AstraZeneca’s drug doesn’t seem to do very well at all. So they’re targeting the same TROP2 on basically the same cancers—you can look at the trials and try to match for which line of therapy, the other characteristics of the patients. The question is, why? And then you look at the drugs and you say, how do they differ? And the main way they differ is in their chemistry, and the downstream pharmacokinetics.

Niko McCarty (1:05:41) And so you’re saying one of these drugs is not doing well and one is doing well, and your model can figure out maybe why that is?

Adam Green (1:05:49) Yeah. One of the drugs is doing well in certain patients. The other drug is doing well in two types of patients. But the trade-off you have is efficacy versus toxicity. So Gilead’s drug is highly labile. It is this Goldilocks linker that releases its payload in plasma—it’ll do it once it gets into early endosomes, it’ll definitely do it if it gets into late endosomes and into the lysosome. I’m reading this, I’m trying to say, does our model explain this in some way? What is the difference between HR-positive cancer, where Datroway doesn’t seem to do well but Gilead’s highly labile drug does, versus triple-negative breast cancer, where both seem to do pretty well? And then we asked our model about that.

Niko McCarty (1:06:36) The question is, can your model actually say something about how we should alter the drug to make it work against this other class of cancer? Are you making concrete predictions about the strength of the bond in that linker and its efficacy against different types of tumors?

Adam Green (1:06:55) I think our model is suggestive toward that. The thing I’d focus on first, and what we focus on in the piece, is not how should we alter the linker chemistry, but: can we find a biomarker that predicts whether a patient will be responsive to one drug or the other? Because there is an efficacy-toxicity trade-off here. Trodelvy has a high rate of toxicity events. It is a brutal drug to take. Datroway is more of a scalpel. When it hits, it hits very hard, but you’ve got to find the patients where that happens.

Niko McCarty (1:07:27) And so your thesis is that this biomarker might be this tetraspanin associated with TROP2?

Adam Green (1:07:33) I’m not committed to TM4SF1 as the biomarker that is going to explain all TROP2 ADC efficacy. I think what I was trying to do with the piece is say that virtual cells, biological world models, can be treated as specimens and then make predictions about biological wiring with clinical implications. TM4SF1 is part of that. It’s part of the internalization equation. What the pharmacokinetics get at in the linker chemistry is the downstream trafficking. So imagine the plasma membrane is like a border. Maybe you have someone who shepherds you across the border. You get inside. Okay, well, there are a lot of places you could go. Where do you end up? If you look at the tumors in which these drugs are being applied, many of them are epithelial tumors in apically polarized cells. All of them express this other protein at high levels that our model nominated, RAB25. And so this is the second axis of ADC biomarkers, which is trafficking. So once you get past the border, the question is, where do you end up? This is endosomal biology. This is really well-trodden territory. We know a lot about how this works. And the basic idea is, you have this vesicle that comes inside, and it needs to make a series of membrane fusions with endosomes—which just means “soma inside the cell,” like inner body—to eventually make its way to the lysosome. RAB25 is, I believe, a GTPase, this kind of marker that tells where the initial vesicle should go—which kind of endosome you should end up in.

Niko McCarty (1:09:32) Okay. You’ve nominated a target. You’ve shown that your virtual cell can do this sort of thing. Pharma companies might care about that in terms of investigating mechanisms of which populations are treatable by their—

Adam Green (1:09:47) Yeah, maybe I’ll ground it like this. When you look at a big patient breast cell atlas and you plot the different subtypes of cancer along these two axes, you look at expression of TM4SF1, which we argued is this organizer protein that helps shepherd TROP2, the ADC target, into the cell. And then you look at RAB25, which I’m arguing is a marker of: are you going to get to the lysosome, or are you going to get recycled back to the surface? You find that HR-positive versus triple-negative breast cancer patients are in opposite quadrants. HR-positive patients tend to be low TM4SF1, they tend to be high RAB25—there is high recycling going on. If you look at triple-negative breast cancer patients, they are low on RAB25, they are high on TM4SF1. They have a lot of the organizing protein to shepherd the ADC into the cell, and not a lot of the marker that tells it to recycle back out. So presumably they’re getting more directly routed to the lysosome. We believe this explains the clinical dissociation you see between Gilead’s drug and AstraZeneca’s drug. And our claim is that there are tons of trials that are failing. We think there are subsets of patients in these populations who will be responsive to these drugs, because they are, say, lower in RAB25, higher in the tetraspanin required to actually internalize the drug.

Niko McCarty (1:11:16) Okay, so I want to zoom out a little bit. Clearly there’s a business case here, that these virtual cell models might be useful tools for digging into mechanisms of biology, or at least proposing hypotheses of mechanistic interrogations. You’re proposing actual experiments that people can do, related to how these antibodies get internalized. But my understanding about Markov, and your motivation—and you keep using this term “specimen”—is that you seem to have this thesis that virtual cell models, these agentic models, are the way by which we will understand all of biology. The goal is to solve biology. Whereas a lot of classical biophysicists, and the history of molecular biology, has been that the best way to build understanding is to make often reductionist observations and then try to piece together our reductionist observations into higher principles. So there are kind of two approaches people are taking to build virtual cell models. What I would call the top-down—what you’re doing—which is to say, let’s train these large statistical models and then interrogate them to understand what they’re doing. And then the other approach, like by Marcus Covert at Stanford, which is, let’s understand how the cell works, model it as a series of equations. So my question is, do you think that approach is wrong? Do you think it will be impossible to understand biology from bottom-up, reductionist observations?

Adam Green (1:13:03) Yeah, it’s a good question, and the one that’s motivated me for the past five years. Short answer, I think it’s going to be a hybrid between the two. The journey I’ve taken has been this kind of thesis, antithesis, synthesis toward: how do you extract mechanism from these black-box top-down models? Initially, when I got into the field and knew nothing about it, I became pretty skeptical of what was going on—just the rate of progress in the field, in this kind of infinite treadmill of “if we keep collecting data and recovering the mechanistic wiring of how these systems work, eventually that’s going to translate to real progress.” But you look at, for instance, what Francis Collins said thirty years ago and what he said ten years ago, and—

Niko McCarty (1:13:35) What do you mean by that? Adam Green (1:14:03) —the field just never delivered on the progress it promised. And so I was somewhat puzzled by this.

Niko McCarty (1:14:09) Despite more and more data, right? Tons and tons of data, but you’re saying not necessarily a deeper understanding. And Eroom’s law continues, and most drugs fail in clinical trials.

Adam Green (1:14:20) Yeah. And the question was, if you only care about controlling biological systems toward salutary ends—like making people healthy—do you need understanding, or is some kind of black-box control sufficient? And so back in 2022, after reading hundreds of papers getting up to speed in biology, I came to the conclusion that the current approach is not going to work. It is not bearing the fruit we were promised. We need to go completely in the ML scaling direction. I became very bitter-lesson-pilled, and rejected what I called the mechanistic mind, which is this research ethos that pervades how most people think about biology—saying we want to carve nature at the joints, understand necessity, sufficiency. There are ideas like causality, if you believe in that. And people claimed that the more and more mechanistic understanding we had, eventually it would help us design better drugs and reduce cancer mortality rates. On the opposite pole is this black-box, “just throw more data at it,” top-down, as you called it, approach. I had not fleshed that out. But in pursuing this idea and learning more about how drug discovery and development is actually done, I’m like, okay, we have a lot of data, the models are decent, but I don’t know—maybe mechanism matters. You don’t just want a function that takes in patient state and desired resulting state and spits out a drug. For a lot of reasons, but primarily because drug discovery and development is iterative. You have to convince other people—like the FDA—of how these drugs work. And critically, we did not have such black-box models that bat out perfect drug candidates. So I’m like, okay, how do I reconcile that with my commitment to black-box, top-down machine learning models? And I came across mechanistic interpretability, which is this subfield of machine learning that says, we train a large language model, it learns to do interesting stuff—how does it do that? Can you actually crack open the black box, shine a light on it, and figure out the internal circuits it is using to compute the answer to a question? And when the ML people did this a few years ago, they found interesting concepts—like the Golden Gate Bridge is a concept inside the mind of one of these LLMs. And I thought, okay, what if we apply that to our model? And so we did that in late 2024. We applied this technique called sparse autoencoders, which says, when the model’s trying to predict gene expression, it’s moving around all these vectors of numbers—can you find a dictionary of features, or directions in space, that explain all these different vectors you see? And when we did that on one of these earlier models, it discovered what were plausible biological features related to gene regulatory modules, or, like I mentioned before, plasma cell differentiation state. And it even seemed to know some things about what transcription factors were predictive of one differentiation trajectory or the other. So that was pretty suggestive, I think—that this specimen approach, that these models recover something about biology, and if you know how to extract it from the black box, you can get a lot out of these systems. Now, where does that leave us? Are we going to—is that going to be the solution, just purely top-down probing these models? Or is there going to be some kind of complementary bottom-up approach?

Niko McCarty (1:18:01) What are the classes of things where you think we need the bottom-up approaches?

Adam Green (1:18:07) Yeah, I’d say it’s more, how do you combine priors from bottom-up approaches, or other models, with these top-down approaches?

Niko McCarty (1:18:14) So it’s not purely unsupervised. The winning solution would be to have an unsupervised—well, I don’t know the terms, but you’re saying in some way we need to feed in our priors.

Adam Green (1:18:25) I’d say you want to feed in information about other modalities, like maybe functional genomics, protein structural priors, and combine them with these predictions or priors you get from the black-box world models—to, again, point your evidentiary apparatus toward the interesting parts of biological space to run the experiments. The metaphor I like is: you’re in a fishing boat. One way to fish is to trawl a big net and catch everything and brute-force the space. Another way is to use sonar or whatever and try to find the interesting spots where the fish are swimming, and then go spearfishing there.

Niko McCarty (1:19:06) Just to try to summarize what you’re saying: we might have these black-box models that make initial, weak predictions, that we then guide using known biophysics or known biochemistry to refine the predictions. And that might be the way we make discoveries in the future. Is that kind of what you’re saying?

Adam Green (1:19:29) I think it’s the near-term approach. But in the data and compute limit, the biological world models become arbitrarily accurate representations of the true underlying biology, and it’s going to be one model to rule them all.

Niko McCarty (1:19:45) A biological world model. And then we’re going to have smaller models that complement this, right? Like we’ll have AlphaFold, we’ll have these other things that augment the predictions of the world model.

Adam Green (1:19:57) Probably not. No. My prediction would be there’s going to be a singular, unified model—we can debate what language it’s going to be trained in—that is going to be able to read out these different modalities we might care about. Like we showed in our paper, we train on mRNA, it learned something about protein, it learned something about DNA motif enrichment. This is going to be the generator of biological hypotheses. Maybe downstream, if you’re trying to select which of the possible experiments to run, you can bring to bear these other models about structural biology and get a different kind of prior. But I think the primary specimen that is going to be delivering not only new clinical hypotheses, but also basic biology knowledge, is going to be a unified biological world model.

Niko McCarty (1:20:50) Okay. And so you think the mechanistic guidance, your bottom-up—initially when I asked this question, you said you think it’ll be both, top-down and bottom-up. But you’re talking about bottom-up approaches of a specific kind. These are very narrow bottom-up tools, right? Like structural prediction tools. Are you against the vision of modeling an entire cell from the bottom up using probabilistic equations and hundreds of differential equations? You’re kind of saying that you think that won’t work out?

Adam Green (1:21:30) I just don’t think it’s useful. And I don’t think structural models or protein sequence models are bottom-up in this sense.

**Niko McCarty (1:21:37)** But why is the mechanistic model not useful?

**Adam Green (1:21:40)** Yeah, I think we’re talking past each other.

Niko McCarty (1:21:44) Like what Marcus Covert is doing, saying, I want to simulate a cell using mathematical equations. Is that kind of thing useful?

Adam Green (1:21:53) I think it’s a fun abstraction. Ontologically, what is going on at the lowest level in systems is this kind of bottom-up, right? Molecules collide, proteins fold. Sure, that’s happening. Can you design a system of differential equations that, if you run it forward in time, accurately predicts what a cell will do? Maybe. Would it be useful? How much compute does it require? Do you end up with a—

Niko McCarty (1:21:55) What do you mean by that? Adam Green (1:22:23) —a one-to-one map of the territory? I don’t know. If you take that for, like, molecular dynamics simulation, sure, that’s probably pretty useful. Maybe there are some processes where it is impossible to develop a good machine learning model of the dynamics, and you just need to run it bottom-up. And if that system’s governed by a well-known set of equations, then just run it forward in time. We have a pretty good idea of how quantum mechanics works. I think biology at the cell level and above is so much more complex and contingent that it is not expressible in a set of differential equations. The dynamics of biology are expressible in a very, very large neural network. But even to capture something as simple as a conditional, or an AND gate, using differential equations—certain things are just beyond. You cannot express them in these terms. And so I am against mechanism insofar as it restricts the set of tools we use to try to express biology in. I’m not against mechanism per se. I think it’s important for making these low-inductive-bias, very general models useful—again, we don’t just want a black box, that’s not useful. But I think, and this is what I argued in the 2022 essay, human legibility—our ability to understand a system—is a hard constraint on our models. And I think it has been limiting us. It has gotten us quite a lot. We know a tremendous amount about the cell, a tremendous amount about biology, but to truly accelerate biomedical progress, we may need to discard these assumptions about legibility, and—man, I really want to know what’s going on inside it—in favor of something that is a bit more black-box.

Niko McCarty (1:24:08) Yeah. Or we only illuminate very specific parts at a given time, you’re saying. You know, I think the main motive of the mechanistic approach is that it guides experiments. You can understand your knowledge base; you can actually run a simulation with known parameters and known equations and see if it matches experiment. And then that tells you something about the next experiment you might have to do, or the next measurement you might have to take, to decrease the error between these two things. But presumably, if you have good ways to interrogate a black box using sparse autoencoders to guide those experiments, then it’s trying to do the same thing from a different approach.

Adam Green (1:24:54) Yeah. And then the question becomes, will our ability to hold all of this mech-interp knowledge derived from the black box in our heads at once reach its limit? I think we’re in this period of what I’d call liminal legibility, where we are going to crack open these systems of, say, a single cell, we’re going to discover—

Niko McCarty (1:25:08) And what do you think about that? Adam Green (1:25:22) —laws of biological dynamics that make sense to us. We can map them to existing paradigms we have of, like, this is how functional genomics works, or this is how protein folding works. But eventually the complexity of the dynamics is going to exceed our capacity to hold them in our heads. And we’re probably going to be off a lot of this to AI agents—not models of biological agents, but reasoning agents, which are quite in vogue. And they will be the ones using the mech-interp toolkit on these models, being the AI scientist, if you will, doing the research. And so maybe this is three years, five years, but at some point we’ll just relinquish control and admit that our attempts to understand these biological systems were a stopgap, and they truly exceed our comprehension—especially as you go beyond the scale of the single cell, to bigger and bigger systems.

Niko McCarty (1:26:24) And your vision is that we’re going to have some kind of autonomous lab with agents, interrogating our black-box models and then designing experiments, in some kind of distant future?

Adam Green (1:26:35) Why not the not-so-distant future? Niko McCarty (1:26:36) You’re just kind of agnostic about the experimental component, right?

Adam Green (1:26:40) I don’t know about agnostic. I think one interesting question is, how much can you get through just reasoning alone, searching the literature, versus how much do you need world models? If I’m trying to pick the next experiment to run, can I just run an AI scientist in a data center in Texas for long enough, and will it spit out the right answer that you then go validate in the lab? Or is it going to be more of this iterative process, where probably biological world models will play an important role? But verification is going to be necessary regardless.

Niko McCarty (1:27:20) I think another related question, on this issue of what is sufficient for understanding, would be to ask: if we had a magic tool—and some people are actually trying to build this tool—that we could put into a cell and it would measure everything inside of it. So it would sequence the genome and tell you everything about which genes are active, which RNA transcripts are present, which proteins and proteoforms are present. Imagine we could just quantify every molecule in a cell and its position. Would that be sufficient for understanding? If we had a magic molecular sensor, if we could read everything, is that sufficient to then understand the cell?

Adam Green (1:28:07) I don’t think to understand it, but to model it—given sufficient data and enough capacity to learn about the dynamics that govern those data, yeah. But again, I think imposing this requirement of understanding is too much, because it’s a hard inductive bias on what we expect stuff to look like. If you say understanding is cashing something out in the language of differential equations, then you’re going to capture a lot of it—gene A regulates gene B, I can model that—but there’s a lot of the structure that you’re going to lose. And so I think understanding is a bit too high of a bar to shoot for.

**Niko McCarty (1:28:51)** I see. But isn’t that your bar?

**Adam Green (1:28:54)** No, my bar is control.

Niko McCarty (1:28:55) So, predictability—yeah, what do you mean by that? What does control mean in this context?

Adam Green (1:29:00) I have a system, I want it to do something—how do you do that? Now, it might sound like I’m arguing for the simulator-based approach. But I’m not. I’m saying that understanding the dynamics of the system is instrumental to this purpose of control. But I’m not wedded to understanding. If you gave me a black-box model that spat out drug candidates that caused some desired shift in state—

Niko McCarty (1:29:09) Of perturbations. Adam Green —I think that’d be pretty great.

Niko McCarty Well, yeah, I think that’s a good place to end. So thank you so much, Adam.

Adam Green (1:29:32) Yeah, it was fun. Thanks.

source & further reading

letter.nikomc.com — original article Why Are Cells Small?

How to build a virtual cell and biology scaling laws

Timestamps #

Transcript #

Run your AI side-project on zahid.host