Hit calling#

scmorph leverages single-cell morpholigical profiles for hit calling. This has the advantage that it does not rely on averaging or summarizing a potentially diverse cell population. Instead it proceeds in these steps

Create PCA of single-cell profiles
Per treatment and plate (where applicable), calculate how far treated cells are from control cells.
Calculate the KS statistic and test for each treatment whether its KS is higher than that of a background distribution. This background distribution is created by testing all untreated wells against each other.

In this way, scmorph derives a hit score that is based entirely on single-cell populations.

Let’s see this in practice!

import numpy as np
import pandas as pd
import scmorph as sm
import seaborn as sns
import matplotlib.pyplot as plt


# Convenience function for loading in example data
def load_example_data():
    """# Load in a subset of the Rohban data"""
    # To load in the full dataset, you may use sm.datasets.rohban2017()
    # but note that this will download the dataset from the internet,
    # occupying ~8Gb of disk space.
    adata = sm.datasets.rohban2017_minimal()

    # Fill in information about untreated cells. For more information, check out the
    # batch effect correction tutorial.
    adata.obs["TARGETGENE"] = adata.obs["TARGETGENE"].astype(str).replace("nan", "UNTREATED")

    # Ensure plate names are strings
    adata.obs["Image_Metadata_Plate"] = adata.obs["Image_Metadata_Plate"].astype(str)

    # Remove cells with missing values
    sm.pp.drop_na(adata)

    # standardize the data per plate
    sm.pp.scale_by_batch(
        adata, batch_key="Image_Metadata_Plate", treatment_key="TARGETGENE", control="UNTREATED"
    )

    # Randomise order of samples in adata to help with plotting
    adata = adata[np.random.permutation(adata.obs.index), :]
    return adata

# Convenience function for plotting the KS statistics
# Skip this on first read, but you can use it as reference for plotting
# the results of the KS test discussed later in this tutorial
def plot_ks_statistics(ref_ks, treat_ks):
    """Plot output of `get_ks`"""
    # Combine reference and treatment KS statistics for plotting
    ref_ks["group"] = "DMSO"
    treat_ks["group"] = "CEPBA"
    combined_ks = pd.concat([ref_ks, treat_ks])

    ### Plotting

    # Create the violin plot
    plt.figure(figsize=(4, 4), dpi=250)
    sns.violinplot(x="group", y="ks_stat", data=combined_ks, inner=None)

    # Overlay with swarm plot
    sns.swarmplot(
        x="group",
        y="ks_stat",
        data=combined_ks,
        hue="is_significant_0.05",
        palette={True: "white", False: "black"},
        edgecolor="black",
        linewidth=1,
        size=6,
    )

    # Add labels and title
    plt.xlabel("")
    plt.ylabel("KS Statistic")
    plt.title("KS Statistic Distribution")
    plt.legend(title="Significant\n(5% FDR)", loc="upper left")
    plt.show()

Note that this tutorial uses only a small dataset, comprising five plates and only one treatment (CEBPA). If you would like to see this with a bigger dataset, you can use sm.datasets.rohban2017() in the code above, but note that this will download the dataset from the internet, occupying ~8Gb of disk space.

Let’s see how strongly CEBPA affects single-cell morphological profiles in PCA. For this, we first normalize the data to the controls on each plate. Then, we will look at the single-cell PCA to see if there’s a visual difference between the two groups.

# Step 1:
adata = load_example_data()

# Compute PCA
sm.pp.pca(adata)

We can see the separation of CEBPA from DMSO treated cells in PCA as per below. This is a good indicator the treatment is having a strong effect.

sm.pl.pca(adata, color=["TARGETGENE", "Image_Metadata_Plate"], ncols=1)

../../_images/9ed2ad3a3e0517c78f888a921dbb1c4ec03cfdde3a07d82f037d2c6e20a54596.png

scmorph also allows you to look at the separation of treatment and control in PCA with cumulative distributions. For example, the below plots show us that CEBPA and DMSO treated cells differ in PC1 and 2, but not so much in PC3 and 4.

sm.pl.cumulative_density(
    adata, [0, 1, 2, 3], layer="pca", color="TARGETGENE", xlim=(-25, 25), xlabel="PC value", n_col=2
)

../../_images/9ddafbb306a36017503a8578e1b8f18cd06a6ee2fb479193d00b838b0ea31574.png

<seaborn.axisgrid.FacetGrid at 0x179a86f50>

Note that we are creating the above PCA for visualisation purposes only. When performing hit calling with scmorph, most of these steps happen automatically in. The only exception is per-plate normalization, which must be applied before. We can perform hit calling with the sm.tl.get_ks function, where we can specify all relevant metadata and perform hit calling.

This function returns two things: a background dataframe, where control wells are tested for differences with all the other control wells on the same plate, and a treatment dataframe, where each treatment has been tested against all controls on the same plate.

# Step 1, 2 and 3: Per treatment and plate (where applicable), calculate how
# far treated cells are from control cells.
adata = load_example_data()

ref_ks, treat_ks = sm.tl.get_ks(
    adata,
    treatment_key="TARGETGENE",
    control="UNTREATED",
    well_key="Image_Metadata_Well",
    batch_key="Image_Metadata_Plate",
    control_wells=None,
)
treat_ks

Building negative control p-value distribution
Computing treatment p-values

plate	control	treatment	ks_stat	ks_pval	ks_qval	is_significant_0.05	is_significant_0.1
41744	UNTREATED	CEBPA	0.480886	7.612527e-119	3.806263e-118	True	True
41754	UNTREATED	CEBPA	0.441858	5.484131e-106	9.140219e-106	True	True
41755	UNTREATED	CEBPA	0.354889	6.734590e-71	8.418237e-71	True	True
41756	UNTREATED	CEBPA	0.441083	2.915800e-112	7.289501e-112	True	True
41757	UNTREATED	CEBPA	0.294381	1.015384e-41	1.015384e-41	True	True

CEBPA tested as significantly different from control on each of the five plates, as indicated by the is_significant_0.05 column, which indicates that the treatment was significant at the 5% FDR threshold. To see this visually, we can combine the results in a plot like below.

plot_ks_statistics(ref_ks, treat_ks)

../../_images/6de8cfc79a3155e92a45cb5f28d737ac6dff6207bb33f2e9da1d67830e09d4d7.png

Note that in the above plot, each point in the DMSO violin indicates a one-to-many comparison (one DMSO well vs. all the other DMSO wells on the same plate), whereas the CEBPA points indicate a many-vs-many comparison (all CEBPA wells on a plate vs all DMSO wells on that plate).

We can redo this analysis to have the groups on equal footing: by testing both wells of CEBPA separately against all DMSO wells, for example. Note that this is similar to the the above, and only differs in how samples are pooled.

# Assign treatment and control groups, with CEBPA replicates as separate groups
adata = load_example_data()
adata.obs["group"] = "control"
adata.obs.loc[adata.obs["Image_Metadata_Well"] == "f02", "group"] = "CEBPA_rep1"
adata.obs.loc[adata.obs["Image_Metadata_Well"] == "f04", "group"] = "CEBPA_rep2"

# Repeat above analysis with separate groups for CEBPA replicates
ref_ks, treat_ks = sm.tl.get_ks(
    adata,
    treatment_key="group",
    control="control",
    well_key="Image_Metadata_Well",
    batch_key="Image_Metadata_Plate",
    control_wells=None,
)

Building negative control p-value distribution
Computing treatment p-values

Having separated technical replicates, we can see that each of the CEBPA wells tests significantly different to control. In the below, each CEBPA point indicates one technical replicate of CEBPA.

plot_ks_statistics(ref_ks, treat_ks)

../../_images/38039616a6b994d21e756e580eaae2cdbc425e961a3990ec2ece2680a167f184.png

In this tutorial, we covered how to perform hit calling with scmorph and interpret the results of the KS test. If you want to further interpret the outputs, consider ranking treatments by how often they were considered significant across replicates, for example.