Doron M, Mozannar H, Sontag D, Caicedo JC. Machine Teaching with Generative Models for Human Learning. In: The International Conference on Machine Learning. 2021.

Experimental scientists face an increasingly difficult challenge: while technological advances allow for the collection of larger and higher quality datasets, computational methods to better understand and make new discoveries in the data lag behind. Existing explainable AI and interpretability methods for machine learning focus on better understanding model decisions, rather than understanding the data itself. In this work, we tackle a specific task that can aid experimental scientists in the era of big data: given a large dataset of annotated samples divided into different classes, how can we best teach human researchers what is the difference between the classes? To accomplish this, we develop a new framework combining machine teaching and generative models that generates a small set of synthetic teaching examples for each class. This set will aim to contain all the information necessary to distinguish between the classes. To validate our framework, we perform a human study in which human subjects learn how to classify various datasets using a small teaching set generated by our framework as well as several subset selection algorithms. We show that while generated samples succeed in teaching humans better than chance, subset selection methods (such as k-centers or forgettable events) succeed better in this task, suggesting that real samples might be better suited than realistic generative samples. We suggest several ideas for improving human teaching using machine learning.

Caicedo J, Arevalo J, Piccioni F, Bray M-A, Hartland CL, Wu X, Brooks AN, Berger AH, Boehm JS, Carpenter AE, et al.

Cell Painting predicts impact of lung cancer variants

. bioRxiv. 2021.

Most variants in most genes across most organisms have an unknown impact on the function of the corresponding gene. This gap in knowledge is especially acute in cancer, where clinical sequencing of tumors now routinely reveals patient-specific variants whose functional impact on the corresponding gene is unknown, impeding clinical utility. Transcriptional profiling was able to systematically distinguish these variants of unknown significance (VUS) as impactful vs. neutral in an approach called expression-based variant-impact phenotyping (eVIP). We profiled a set of lung adenocarcinoma-associated somatic variants using Cell Painting, a morphological profiling assay that captures features of cells based on microscopy using six stains of cell and organelle components. Using deep-learning-extracted features from each cell’s image, we found that cell morphological profiling (cmVIP) can predict variants’ functional impact and, particularly at the single-cell level, reveals biological insights into variants which can be explored in our public online portal. Given its low cost, convenient implementation, and single-cell resolution, cmVIP profiling therefore seems promising as an avenue for using non-gene-specific assays to systematically assess the impact of variants, including disease-associated alleles, on gene function.

Caicedo J, Moshkov N, Becker T, Yang K, Horvath P, Dančik V, Wagner BK, Clemons P, Singh S, Carpenter AE. Predicting compound activity from phenotypic profiles and chemical structures. bioRxiv. 2021.

Recent advances in deep learning enable using chemical structures and phenotypic profiles to accurately predict assay results for compounds virtually, reducing the time and cost of screens in the drug-discovery process. We evaluate the relative strength of three high-throughput data sources—chemical structures, images (Cell Painting), and gene-expression profiles (L1000)—to predict compound activity using a sparse historical collection of 16,186 compounds tested in 314 assays for a total of 679,819 readouts. All three data modalities can predict compound activity with high accuracy in 7-8% of assays tested; replacing million-compound physical screens with computationally prioritized smaller screens throughout the pharmaceutical industry could yield major savings. Furthermore, the three profiling modalities are complementary, and in combination they can predict 18% of assays with high accuracy, and up to 59% if lower accuracy is acceptable for some screening projects. Our study shows that, for many assays, predicting compound activity from phenotypic profiles and chemical structures could accelerate the early stages of the drug-discovery process.

Pratapa A, Doron M, Caicedo J. Image-based cell phenotyping with deep learning. 2021.

A cell’s phenotype is the culmination of several cellular processes through a complex network of molecular interactions that ultimately result in a unique morphological signature. Visual cell phenotyping is the characterization and quantification of these observable cellular traits in images. Recently, cellular phenotyping has undergone a massive overhaul in terms of scale, resolution, and throughput, which is attributable to advances across electronic, optical, and chemical technologies for imaging cells. Coupled with the rapid acceleration of deep learning–based computational tools, these advances have opened up new avenues for innovation across a wide variety of high-throughput cell biology applications. Here, we review applications wherein deep learning is powering the recognition, profiling, and prediction of visual phenotypes to answer important biological questions. As the complexity and scale of imaging assays increase, deep learning offers computational solutions to elucidate the details of previously unexplored cellular phenotypes.

PMID: 34023800


Caicedo J, Roth J, Goodman A, Becker T, Karhohs K, Broisin M, Molnar C, McQuin C, Sing S, Theis F, et al. Evaluation of Deep Learning Strategies for Nucleus Segmentation in Fluorescence Images. Cytometry Part A. 2019;95(9):952-965.

Identifying nuclei is often a critical first step in analyzing microscopy images of cells and classical image processing algorithms are most commonly used for this task. Recent developments in deep learning can yield superior accuracy, but typical evaluation metrics for nucleus segmentation do not satisfactorily capture error modes that are relevant in cellular images. We present an evaluation framework to measure accuracy, types of errors, and computational efficiency; and use it to compare deep learning strategies and classical approaches. We publicly release a set of 23,165 manually annotated nuclei and source code to reproduce experiments and run the proposed evaluation methodology. Our evaluation framework shows that deep learning improves accuracy and can reduce the number of biologically relevant errors by half. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.

PMID:31313519 / PMCID: PMC6771982
JC C, A G, KW K, BA C, J A, M H, C H, T B, M D, C MQ, et al. Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl. Nature Methods. 2019;16(12):1247-1253.

Segmenting the nuclei of cells in microscopy images is often the first step in the quantitative analysis of imaging data for biological and biomedical applications. Many bioimage analysis tools can segment nuclei in images but need to be selected and configured for every experiment. The 2018 Data Science Bowl attracted 3,891 teams worldwide to make the first attempt to build a segmentation method that could be applied to any two-dimensional light microscopy image of stained nuclei across experiments, with no human interaction. Top participants in the challenge succeeded in this task, developing deep-learning-based models that identified cell nuclei across many image types and experimental conditions without the need to manually adjust segmentation parameters. This represents an important step toward configuration-free bioimage analysis software tools.

PMID: 31636459 / PMCID: PMC6919559


Caicedo J, McQuin C, Goodman A, Singh S, Carpenter A. Weakly Supervised Learning of Single-Cell Feature Embeddings. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018:9309-9318.

We study the problem of learning representations for single cells in microscopy images to discover biological relationships between their experimental conditions. Many new applications in drug discovery and functional genomics require capturing the morphology of individual cells as comprehensively as possible. Deep convolutional neural networks (CNNs) can learn powerful visual representations, but require ground truth for training; this is rarely available in biomedical profiling experiments. While we do not know which experimental treatments produce cells that look alike, we do know that cells exposed to the same experimental treatment should generally look similar. Thus, we explore training CNNs using a weakly supervised approach that uses this information for feature learning. In addition, the training stage is regularized to control for unwanted variations using mixup or RNNs. We conduct experiments on two different datasets; the proposed approach yields single-cell embeddings that are more accurate than the widely adopted classical features, and are competitive with previously proposed transfer learning approaches.

PMID: 30918435 / PMCID: PMC6432648



Caicedo J, Singh S, Carpenter A. Applications in image-based profiling of perturbations. Current Opinion in Biotechnology. 2016;39:134-142.

A dramatic shift has occurred in how biologists use microscopy images. Whether experiments are small-scale or high-throughput, automatically quantifying biological properties in images is now widespread. We see yet another revolution under way: a transition towards using automated image analysis to not only identify phenotypes a biologist specifically seeks to measure ('screening') but also as an unbiased and sensitive tool to capture a wide variety of subtle features of cell (or organism) state ('profiling'). Mapping similarities among samples using image-based (morphological) profiling has tremendous potential to transform drug discovery, functional genomics, and basic biological research. Applications include target identification, lead hopping, library enrichment, functionally annotating genes/alleles, and identifying small molecule modulators of gene activity and disease-specific phenotypes.

PMID: 27089218