Moshkov N, Becker T, Yang K, Horvath P, Dančik V, Wagner BK, Clemons P, Singh S, Carpenter AE, Caicedo J. Predicting compound activity from phenotypic profiles and chemical structures. Nature Communications. 2023.

Recent advances in deep learning enable using chemical structures and phenotypic profiles to accurately predict assay results for compounds virtually, reducing the time and cost of screens in the drug-discovery process. We evaluate the relative strength of three high-throughput data sources—chemical structures, images (Cell Painting), and gene-expression profiles (L1000)—to predict compound activity using a sparse historical collection of 16,186 compounds tested in 314 assays for a total of 679,819 readouts. All three data modalities can predict compound activity with high accuracy in 7-8% of assays tested; replacing million-compound physical screens with computationally prioritized smaller screens throughout the pharmaceutical industry could yield major savings. Furthermore, the three profiling modalities are complementary, and in combination they can predict 18% of assays with high accuracy, and up to 59% if lower accuracy is acceptable for some screening projects. Our study shows that, for many assays, predicting compound activity from phenotypic profiles and chemical structures could accelerate the early stages of the drug-discovery process.


Bhate SS, Seigal A, Caicedo J. Deciphering causal genomic templates of complex molecular phenotypes. bioRxiv. 2022.

We develop a mathematical theory proposing that complex molecular phenotypes (CMPs, e.g., single-cell gene expression distributions and tissue organization) are produced from templates in the genome. We validate our theory using a procedure termed Causal Phenotype Sequence Alignment (CPSA). CPSA finds a candidate template of a CMP by aligning – without using genetic variation or biological annotations – a phenotypic measurement (e.g., a tissue image) with a reference genome. Given any edit to the CMP (e.g., changing cellular localization), CPSA outputs the genomic loci in the alignment corresponding to the edit. We confirm that three CMPs (single-cell gene expression distributions of the immune system and of embryogenesis, and tissue organization of the tumor microenvironment) have templates: the loci output by CPSA for therapeutically significant edits of these CMPs reveal genes, regulatory regions and active-sites whose experimental manipulation causes the edits. Our theory provides a systematic framework for genetically redesigning CMPs.

Moshkov N, Bornholdt M, Benoit S, Smith M, McQuin C, Goodman A, Senft R, Han Y, Babadi M, Horvath P, et al. Learning representations for image-based profiling of perturbations. bioRxiv. 2022.

Measuring the phenotypic effect of treatments on cells through imaging assays is an efficient and powerful way of studying cell biology, and requires computational methods for transforming images into quantitative data that highlights phenotypic outcomes. Here, we present an optimized strategy for learning representations of treatment effects from high-throughput imaging data, which follows a causal framework for interpreting results and guiding performance improvements. We use weakly supervised learning (WSL) for modeling associations between images and treatments, and show that it encodes both confounding factors and phenotypic features in the learned representation. To facilitate their separation, we constructed a large training dataset with Cell Painting images from five different sources to maximize experimental diversity, following insights from our causal analysis. Training a WSL model with this dataset successfully improves downstream performance, and produces a reusable convolutional network for image-based profiling, which we call Cell Painting CNN. We conducted a comprehensive evaluation of our strategy on three publicly available Cell Painting datasets, discovering that representations obtained by the Cell Painting CNN can improve performance in downstream analysis up to 25% with respect to classical features, while also being more computationally efficient.

Caicedo JC, Arevalo J, Piccioni F, Bray M-A, Hartland CL, Wu X, Brooks AN, Berger AH, Boehm JS, Carpenter A, et al. Cell Painting predicts impact of lung cancer variants. Molecular Biology of the Cell. 2022;(6).

Most variants in most genes across most organisms have an unknown impact on the function of the corresponding gene. This gap in knowledge is especially acute in cancer, where clinical sequencing of tumors now routinely reveals patient-specific variants whose functional impact on the corresponding genes is unknown, impeding clinical utility. Transcriptional profiling was able to systematically distinguish these variants of unknown significance as impactful vs. neutral in an approach called expression-based variant-impact phenotyping. We profiled a set of lung adenocarcinoma-associated somatic variants using Cell Painting, a morphological profiling assay that captures features of cells based on microscopy using six stains of cell and organelle components. Using deep-learning-extracted features from each cell’s image, we found that cell morphological profiling (cmVIP) can predict variants’ functional impact and, particularly at the single-cell level, reveals biological insights into variants that can be explored at our public online portal. Given its low cost, convenient implementation, and single-cell resolution, cmVIP profiling therefore seems promising as an avenue for using non–gene specific assays to systematically assess the impact of variants, including disease-associated alleles, on gene function.


Doron M, Mozannar H, Sontag D, Caicedo JC. Machine Teaching with Generative Models for Human Learning. In: The International Conference on Machine Learning. 2021.

Experimental scientists face an increasingly difficult challenge: while technological advances allow for the collection of larger and higher quality datasets, computational methods to better understand and make new discoveries in the data lag behind. Existing explainable AI and interpretability methods for machine learning focus on better understanding model decisions, rather than understanding the data itself. In this work, we tackle a specific task that can aid experimental scientists in the era of big data: given a large dataset of annotated samples divided into different classes, how can we best teach human researchers what is the difference between the classes? To accomplish this, we develop a new framework combining machine teaching and generative models that generates a small set of synthetic teaching examples for each class. This set will aim to contain all the information necessary to distinguish between the classes. To validate our framework, we perform a human study in which human subjects learn how to classify various datasets using a small teaching set generated by our framework as well as several subset selection algorithms. We show that while generated samples succeed in teaching humans better than chance, subset selection methods (such as k-centers or forgettable events) succeed better in this task, suggesting that real samples might be better suited than realistic generative samples. We suggest several ideas for improving human teaching using machine learning.

Pratapa A, Doron M, Caicedo J. Image-based cell phenotyping with deep learning. 2021.

A cell’s phenotype is the culmination of several cellular processes through a complex network of molecular interactions that ultimately result in a unique morphological signature. Visual cell phenotyping is the characterization and quantification of these observable cellular traits in images. Recently, cellular phenotyping has undergone a massive overhaul in terms of scale, resolution, and throughput, which is attributable to advances across electronic, optical, and chemical technologies for imaging cells. Coupled with the rapid acceleration of deep learning–based computational tools, these advances have opened up new avenues for innovation across a wide variety of high-throughput cell biology applications. Here, we review applications wherein deep learning is powering the recognition, profiling, and prediction of visual phenotypes to answer important biological questions. As the complexity and scale of imaging assays increase, deep learning offers computational solutions to elucidate the details of previously unexplored cellular phenotypes.

PMID: 34023800


Caicedo J, Roth J, Goodman A, Becker T, Karhohs K, Broisin M, Molnar C, McQuin C, Sing S, Theis F, et al. Evaluation of Deep Learning Strategies for Nucleus Segmentation in Fluorescence Images. Cytometry Part A. 2019;95(9):952–965.

Identifying nuclei is often a critical first step in analyzing microscopy images of cells and classical image processing algorithms are most commonly used for this task. Recent developments in deep learning can yield superior accuracy, but typical evaluation metrics for nucleus segmentation do not satisfactorily capture error modes that are relevant in cellular images. We present an evaluation framework to measure accuracy, types of errors, and computational efficiency; and use it to compare deep learning strategies and classical approaches. We publicly release a set of 23,165 manually annotated nuclei and source code to reproduce experiments and run the proposed evaluation methodology. Our evaluation framework shows that deep learning improves accuracy and can reduce the number of biologically relevant errors by half. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.

PMID:31313519 / PMCID: PMC6771982
JC C, A G, KW K, BA C, J A, M H, C H, T B, M D, C M, et al. Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl. Nature Methods. 2019;16(12):1247–1253.

Segmenting the nuclei of cells in microscopy images is often the first step in the quantitative analysis of imaging data for biological and biomedical applications. Many bioimage analysis tools can segment nuclei in images but need to be selected and configured for every experiment. The 2018 Data Science Bowl attracted 3,891 teams worldwide to make the first attempt to build a segmentation method that could be applied to any two-dimensional light microscopy image of stained nuclei across experiments, with no human interaction. Top participants in the challenge succeeded in this task, developing deep-learning-based models that identified cell nuclei across many image types and experimental conditions without the need to manually adjust segmentation parameters. This represents an important step toward configuration-free bioimage analysis software tools.

PMID: 31636459 / PMCID: PMC6919559