Benchmark

About
Usage
FAQ
Roadmap
License

About

With the availability of large preclinical datasets on cancer drug sensitivity and gene essentiality, computational biology models for predicting cancer sensitivity are gaining popularity. However, comparing these models proves to be a challenging task, as there are numerous published models and methods available, making it difficult to conduct meaningful comparisons without reproducing them on your own data.

Armed with the experience of benchmarking our own models at Turbine, we publish the Turbine Benchmark Suite. This carefully composed benchmark set focuses on models’ ability to identify biologically applicable predictions. While this benchmark set is not entirely foolproof and can potentially be overfit with sufficient attempts, we have made substantial efforts to ensure its resilience.

Our approach revolves around three key principles:

True holdout train/test splits: We prioritize results based on true holdout train/test splits. Unlike random splits, we believe that cell-, gene-, and drug-exclusive splits offer more meaningful insights in real-life scenarios for predicting cancer sensitivity. Emphasizing these splits enables researchers to evaluate model performance in situations that closely resemble practical applications.
Selective performance: Instead of solely identifying universally ineffective or harmful drugs across all cells, we measure per target node performance. This approach requires passing predictors to demonstrate selectivity. In other words, a successful predictor must exhibit the ability to discern the specific contexts in which drugs are beneficial or detrimental for predicting cancer sensitivity.
Bias detection and mitigation: To identify biases orthogonal to the measured metrics, we employ a so-called Bias Detector. For instance, models driven by general drug sensitivity of cell lines may pass the selective performance threshold, but they can only identify trivial drug-cell line associations. Our Bias Detector framework helps identify such bias-driven models.

By adhering to these principles, our aim is to provide a benchmark that facilitates fair and meaningful comparisons of computational biology models in predicting cancer sensitivity. We encourage researchers to develop robust and selective predictors that transcend the limitations of bias and demonstrate their utility in real-world scenarios.

Usage

This data is not intended to be a competition set, it is designed to be a resource for your own projects. To make it easy to use, we have made the test data publicly available. At the same time this means that anyone attempting to misuse the data could potentially overfit it with enough attempts.

You can access the train/test data from the Releases section. The splits and target metrics are provided in separate JSON files, categorized into “ko” for gene essentiality and “drug” for drug sensitivity.

For gene essentiality predictions, we have created the following splits based on DepMap data, (https://depmap.org/portal/, https://www.nature.com/articles/ng.3984):

TRAIN: This is the training set.
RND: The random test set consists of cell lines and perturbations present in the training set, excluding the specific cell line-perturbation pairs.
CEX: The cell-exclusive test set enables prediction of the effects of training perturbations on new cell lines.
GEX: The gene expression test set allows prediction of the effects of new perturbations on the training cell lines.
AEX: The all-exclusive test set facilitates prediction of the effects of new perturbations on new cell lines.

In our benchmark, we have utilized only a subset of gene essentiality data (a subset of genes) in order to keep the dataset more balanced. However, we provide the EXT_GEX and EXT_AEX splits to explore performance on the genome-wide DepMap data.

Each JSON file contains the following information about the samples:

cell_line: The perturbed cell.
perturbation: The perturbed (CRISPR KOd) gene.
gene_effect: The target variable for predictions, representing the gene effect from DepMap. A gene effect of 0 indicates no fitness effect of the KO, while negative and positive gene effects represent negative and positive fitness effects, respectively. Typically, a gene effect of -0.5 indicates a “significant” viability reduction.

Similarly, for drug sensitivity prediction, we have created the following splits based on GDSC2 data (https://www.cancerrxgene.org/, https://www.cell.com/cell/fulltext/S0092-8674(16)30746-2):

TRAIN: This is the training set.
RND: The random test set consists of cell lines and perturbations present in the training set, excluding the specific cell line-perturbation pairs.
CEX: The cell-exclusive test set enables prediction of the effects of training perturbations on new cell lines.
DEX: The gene expression test set allows prediction of the effects of new perturbations (drugs) on the training cell lines.
AEX: The all-exclusive test set facilitates prediction of the effects of new perturbations on new cell lines.

Each JSON file contains the following information about the samples:

cell_line: The perturbed cell.
perturbation: The PubChem ID of the drug.

We have three different target metrics:

LN_IC50: The natural logarithm of the half inhibitory concentration (IC50).
z-score: The drug-wise normalized version of LN_IC50.
AUC: The area under the drug response curve.

We did this splitting exercise 3 times in order to allow gauging the dependence of each model on a specific given test set. The file behind the “all train/test sets” button contains all 3 “split variants”. The primary test set is the test set of split 0.

Evaluation scripts:

You can use our evaluation metrics and bias detector to correctly evaluate your predictions.

The downloadable zip file includes example files (example.json for targets and example.npy for predictions), precalculated cell and perturbation biases and two notebooks to run the evaluation scripts (eval_script.ipynb) and the bias detector (bias-detector.ipynb).

FAQ

Q: I have a set of models trained separately for each drug / KO! Can I use this set?
A: Of course! Just leave the PEX and AEX splits out – the CEX results will still be valid.

Q: Should I run all split variants?
A: It makes sense to train/test all split variants once, so you can ensure you’re not overfitting any specific split. But generally, results on split 0 are fine on their own – the downloadable primary test set is actually just split 0’s test.

Q: Where are the rest of the genes?
A: We’ve only included genes for which we could generate node2vec embeddings from Omnipath data.

Roadmap

Planned for future versions:

Harmonized drug and gene dependency train / test sets so a single model can use both without data leaks.
RNAi tests
Synthetic lethality and combination tests

License

The evaluation scripts and sample models are released under a CC-BY-SA 4.0 license. In a nutshell, feel free to use them in your projects – even commercial ones, as long as you don’t resell the datasets themselves.

If you publish results using or derived from the EFFECT benchmark, please cite the following article:
https://www.biorxiv.org/content/10.1101/2023.10.02.560281

The drug sensitivity dataset is based on GDSC data, so their license also applies (which are largely similar terms)

The gene dependency dataset is based on DepMap, which is published under CC-BY-4.0, don’t forget to attribute them as well!

Release v 1.0
Primary test All train and test sets Evaluation scripts Publication

Release v 1.0

Initial release containing two independent datasets: one for gene dependency model training and prediction (based on DepMap Achilles data), and another for benchmarking drug sensitivity capabilities (based on GDSC2 data).

An important caveat: don’t use the drug training sets to train for the gene dependency test or the other way around! It will leak data into the holdout sets, invalidating your results.

Also, if you assemble your own train sets to test for these benchmark targets, make sure the drugs’ targets don’t overlap with any genes in the GEX test set and vice versa, genes in your training set shouldn’t overlap with drugs’ targets in the DEX test set.

Statistics:

Drug dataset:

Cell lines in TRAIN (& DEX): 555
Cell lines in CEX (& AEX): 139
Drugs in TRAIN (& CEX): 117
Drugs in DEX (& AEX): 18

Total set sizes:

	split 0 (primary)	split 1	split 2
TRAIN	46.038	42.896	42.654
CEX	14.334	13.460	13.427
DEX	9.479	13.579	13.821
AEX	2.424	3.396	3.489

CRISPR KO dataset:

Cell lines in TRAIN (& GEX): 803
Cell lines in CEX (& AEX): 201
Genes in TRAIN (& CEX): 1036
Genes in GEX (& AEX): 258
Genes in extended GEX: 6052

Total set sizes:

	split 0 (primary)	split 1	split 2
TRAIN	665.430	662.953	657.971
CEX	208.196	211.308	217.520
GEX	207.150	206.364	204.828
AEX	51.850	52.620	54.172
EXT_GEX	4.859.048	4.840.880	4.804.580
EXT_AEX	1.216.216	1.234.368	1.270.684

The EFFECT™ BENCHMARK SUITE

A standard set of benchmarks for computational biology models predicting cell response

About

Usage

FAQ

Roadmap

License

RELEASES

Release v 1.0