Biology is hard, and it’s very easy to accidentally produce misleading results. This post only aims to help us improve together and get closer to the dream of understanding biology instead of calling individual methods or authors out.
UPDATE: Others have also reached the same conclusion, Constantin et al. just published this.
It’s becoming more important than ever to keep your benchmarks healthy!
Foundation models in biology are a great concept.
The core idea is to show the AI how cells can look like – what are the “allowed” transcriptomic states (gene expression-level combinations) that describe actual cells versus “disallowed” states which don’t describe an actual working cell.
Why is any state disallowed in the first place?
A live cell is a very carefully regulated environment. Changing its protein levels is just like going in and reorganizing the gears in a steam engine – done randomly the engine will most likely blow up. The underlying assumption is that for an AI to learn the shared surface of the cells, it needs to build an internal representation of gene regulation – a gene regulatory network (GRN) hidden inside the Transformer’s clockwork: the attention weights.
This is an idea that might just work; but does it?
As with most large AI models, theoretical reasoning only gets you so far, the proof of the pudding is in the eating. Do we have a task which would be infeasible to do well without knowing the “true” human GRN? It turns out there is: perturbation prediction. If the model can figure out how cells respond to a gene (or a combination of genes) getting knocked out that, it must have built a useful internal understanding of biology.
Fortunately, the authors of scGPT1 did pose this question to the model, here are the results:
ScLLM model performance comparison on the Adamson et al.2 dataset.
Score on the y-axis is the gene-wise Pearson correlation of differential expressions – true vs predicted
Are we done? This seems to imply that we can predict differential expression of genes with 0.6+ correlation for new perturbations. This seems respectable, and while not perfect, probably already useful.
Let’s add a simple bias predictor, predicting only the mean gene expression for each gene in the trainset.
ScLLM model performance comparison on the Adamson et al.2 dataset, with train mean predictor performance.
Score on the y-axis is the gene-wise Pearson correlation of differential expressions – true vs predicted
Has our previously respectable performance been just outmatched by a simple mean predictor? Let’s dive into the data to understand what’s going on. Calculating how well differential expressions correlate between all pairs of samples yields the following plot. (Note that the samples have been pseudo-bulked which is an interesting lesson on its own – despite having tens of thousands of cells, you may still only have a few dozen samples of information, as shown below.)
Pairwise DE correlations of sample pairs in the Adamson et al.2 dataset
Reading the plot gives us a clue on why the mean predictor performs so well: most of the responses are very similar! #0, #1, #57 and #58 is a clear outlier group. Most of the other examples are extremely correlated. While you can observe a few separate response mechanisms in the bulk of the samples, the correlation is 0.5+ even across response types (and easily 0.8+ within a response group for different samples).
Probably the reason behind these responses being so highly correlated is that all perturbations have been targeting different proteins in the same growth pathway, hence the highly similar results.
Having data that’s so biased is not ideal in any dataset, but especially in biology you want to have outliers; they are the bread and butter of your benchmark. When your AI will get used in the real world, these would become the shiny sparks that lead scientists to the golden ore of finding selective patient populations. (You might even want to focus your scoring function on finding outliers, but that is a story for another time.)
Our findings above tell us that this is not a useful benchmark set, but not that the method itself is bad.
The Replogle et al.3 dataset gives us a much better picture indeed:
Pairwise DE correlations of sample pairs in the Replogle et al.3 dataset
While we still have correlated rows, there are many more unique responses in this dataset. Indeed, the performance of the mean predictor drops significantly: to 0.35 from the previous 0.7 in the Adamson set. Unfortunately, all models’ performance drop below the mean predictor too; scGPT, in particular, drops to 0.24.
ScLLM model performance comparison on the Replogle et al.3 dataset, with train mean predictor performance. Score on the y-axis is the gene-wise Pearson correlation of differential expressions – true vs predicted
And even here, we’re still only working with a single cell line, K562! A good benchmark should be able to measure how well your model works on untrained cell lines, as frankly, each new patient will behave like a cell line you’ve never seen before.
If there is one lesson here, it’s this: check your benchmarks! Having good benchmarks is harder than it looks. Let me plug our EFFECT paper[4] at the end which can give you some considerations to start.
Most of the actual work underlying this post was done by Gerold Csendes and Bence Szalai – thanks for your work!
Kris Szalay
CTO@Turbine
[1]: Haotian et al. “scGPT: toward building a foundation model for single-cell multi-omics using generative AI”
[2]: Britt Adamson et al. “A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response”
[3]: Replogle et al. “Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq”
[4]: Szalai et al. “The EFFECT benchmark suite: measuring cancer sensitivity prediction performance – without the bias”
back