We develop a mathematical theory proposing that complex molecular phenotypes (CMPs, e.g., single-cell gene expression distributions and tissue organization) are produced from templates in the genome. We validate our theory using a procedure termed Causal Phenotype Sequence Alignment (CPSA). CPSA finds a candidate template of a CMP by aligning – without using genetic variation or biological annotations – a phenotypic measurement (e.g., a tissue image) with a reference genome. Given any edit to the CMP (e.g., changing cellular localization), CPSA outputs the genomic loci in the alignment corresponding to the edit. We confirm that three CMPs (single-cell gene expression distributions of the immune system and of embryogenesis, and tissue organization of the tumor microenvironment) have templates: the loci output by CPSA for therapeutically significant edits of these CMPs reveal genes, regulatory regions and active-sites whose experimental manipulation causes the edits. Our theory provides a systematic framework for genetically redesigning CMPs.
Publications by Year: 2022
Measuring the phenotypic effect of treatments on cells through imaging assays is an efficient and powerful way of studying cell biology, and requires computational methods for transforming images into quantitative data that highlights phenotypic outcomes. Here, we present an optimized strategy for learning representations of treatment effects from high-throughput imaging data, which follows a causal framework for interpreting results and guiding performance improvements. We use weakly supervised learning (WSL) for modeling associations between images and treatments, and show that it encodes both confounding factors and phenotypic features in the learned representation. To facilitate their separation, we constructed a large training dataset with Cell Painting images from five different sources to maximize experimental diversity, following insights from our causal analysis. Training a WSL model with this dataset successfully improves downstream performance, and produces a reusable convolutional network for image-based profiling, which we call Cell Painting CNN. We conducted a comprehensive evaluation of our strategy on three publicly available Cell Painting datasets, discovering that representations obtained by the Cell Painting CNN can improve performance in downstream analysis up to 25% with respect to classical features, while also being more computationally efficient.
Most variants in most genes across most organisms have an unknown impact on the function of the corresponding gene. This gap in knowledge is especially acute in cancer, where clinical sequencing of tumors now routinely reveals patient-specific variants whose functional impact on the corresponding genes is unknown, impeding clinical utility. Transcriptional profiling was able to systematically distinguish these variants of unknown significance as impactful vs. neutral in an approach called expression-based variant-impact phenotyping. We profiled a set of lung adenocarcinoma-associated somatic variants using Cell Painting, a morphological profiling assay that captures features of cells based on microscopy using six stains of cell and organelle components. Using deep-learning-extracted features from each cell’s image, we found that cell morphological profiling (cmVIP) can predict variants’ functional impact and, particularly at the single-cell level, reveals biological insights into variants that can be explored at our public online portal. Given its low cost, convenient implementation, and single-cell resolution, cmVIP profiling therefore seems promising as an avenue for using non–gene specific assays to systematically assess the impact of variants, including disease-associated alleles, on gene function.