Publications by Year: 2021


Doron M, Mozannar H, Sontag D, Caicedo JC. Machine Teaching with Generative Models for Human Learning. In: The International Conference on Machine Learning. 2021.

Experimental scientists face an increasingly difficult challenge: while technological advances allow for the collection of larger and higher quality datasets, computational methods to better understand and make new discoveries in the data lag behind. Existing explainable AI and interpretability methods for machine learning focus on better understanding model decisions, rather than understanding the data itself. In this work, we tackle a specific task that can aid experimental scientists in the era of big data: given a large dataset of annotated samples divided into different classes, how can we best teach human researchers what is the difference between the classes? To accomplish this, we develop a new framework combining machine teaching and generative models that generates a small set of synthetic teaching examples for each class. This set will aim to contain all the information necessary to distinguish between the classes. To validate our framework, we perform a human study in which human subjects learn how to classify various datasets using a small teaching set generated by our framework as well as several subset selection algorithms. We show that while generated samples succeed in teaching humans better than chance, subset selection methods (such as k-centers or forgettable events) succeed better in this task, suggesting that real samples might be better suited than realistic generative samples. We suggest several ideas for improving human teaching using machine learning.

Caicedo J, Moshkov N, Becker T, Yang K, Horvath P, Dančik V, Wagner BK, Clemons P, Singh S, Carpenter AE. Predicting compound activity from phenotypic profiles and chemical structures. bioRxiv. 2021.

Recent advances in deep learning enable using chemical structures and phenotypic profiles to accurately predict assay results for compounds virtually, reducing the time and cost of screens in the drug-discovery process. We evaluate the relative strength of three high-throughput data sources—chemical structures, images (Cell Painting), and gene-expression profiles (L1000)—to predict compound activity using a sparse historical collection of 16,186 compounds tested in 314 assays for a total of 679,819 readouts. All three data modalities can predict compound activity with high accuracy in 7-8% of assays tested; replacing million-compound physical screens with computationally prioritized smaller screens throughout the pharmaceutical industry could yield major savings. Furthermore, the three profiling modalities are complementary, and in combination they can predict 18% of assays with high accuracy, and up to 59% if lower accuracy is acceptable for some screening projects. Our study shows that, for many assays, predicting compound activity from phenotypic profiles and chemical structures could accelerate the early stages of the drug-discovery process.

Pratapa A, Doron M, Caicedo J. Image-based cell phenotyping with deep learning. 2021.

A cell’s phenotype is the culmination of several cellular processes through a complex network of molecular interactions that ultimately result in a unique morphological signature. Visual cell phenotyping is the characterization and quantification of these observable cellular traits in images. Recently, cellular phenotyping has undergone a massive overhaul in terms of scale, resolution, and throughput, which is attributable to advances across electronic, optical, and chemical technologies for imaging cells. Coupled with the rapid acceleration of deep learning–based computational tools, these advances have opened up new avenues for innovation across a wide variety of high-throughput cell biology applications. Here, we review applications wherein deep learning is powering the recognition, profiling, and prediction of visual phenotypes to answer important biological questions. As the complexity and scale of imaging assays increase, deep learning offers computational solutions to elucidate the details of previously unexplored cellular phenotypes.

PMID: 34023800