How to Compare Gene Signatures on Polly

Shruti Malavade, Shrushti Joshi
May 17, 2023
How to Compare Gene Signatures on Polly

Gene Signature comparisons with available datasets have proven to be a powerful technique utilized by biopharma R&D teams for drug discovery, biomarker identification, development, and personalized medicine.

This technique allows researchers to analyze the expression levels of large numbers of genes in samples from individuals with a particular condition or disease and compare it to a conserved cluster of genes whose expression levels are most strongly associated.

This gene signature can then be used to search public databases of gene expression data for other drugs or compounds that can revert the disease signature, indicating a potential therapeutic effect.

However, extracting associated signatures from public databases can be challenging due to various processing pipelines, syntaxes, schemas, and metadata annotations used at the source. We address these challenges through Polly’s RNA-Seq Omixatlas.

This blog discusses how users can compare signatures using Polly's RNA-Seq OmixAtlas.

What is Polly?

Polly is a biomedical data platform for life sciences R&D, primarily delivering bulk & single-cell RNA-seq data, along with 24 other data types. It delivers 155 TB of FAIR and ML-ready biomedical data from ~30 different public and proprietary sources to customers. Polly’s RNA-Seq OmixAtlas (OA) contains curated RNA seq datasets collected from Gene Expression Omnibus (GEO). This richly curated resource provides a good base for researchers looking to find datasets with similar transcriptional profiles to their gene sets of interest.
All datasets on Polly are:

  • Consistently Processed
    End-to-end data processing (identifier mapping, QC, normalization, and alignment) is orchestrated through the Kallisto pipeline.  Consistent processing on the entire Atlas allows samples to be reliably combined into cohorts and used to develop RNA-Seq signatures.
  • Enriched with Metadata
    All datasets are enriched with over 21 searchable metadata fields (disease, gene, tissue, drug, control, etc.) at the dataset, sample, and feature levels. This means that users can quickly run SQL queries to find datasets with normal to-disease comparisons and define cohorts of their choice.

Our Approach

How to Compare Gene Signatures on Polly

Generate Query Signature(S)

The first step to compare Gene Signatures is to create a query wherein the gene of interest can be searched against a dataset to identify a closely associated gene cluster. To generate a query signature, the following steps are required:

  1. Define your query: Define the biological process or condition of interest for which one wants to generate a gene signature.
  2. Choose a dataset: Select a dataset containing gene expression data for your query’s samples.
  3. Pre-processing the data: This involves normalizing the data, filtering out low-expression genes, and correcting for batch effects.
  4. Identify differentially expressed genes: Statistical methods such as limma or DESeq2 are used to identify differentially expressed genes between your query and control groups.
  5. Construct the gene signature: Combine the list of differentially expressed genes into a gene signature using a method such as a gene set enrichment analysis (GSEA), principal component analysis (PCA), or support vector machine (SVM).
  6. Validate the gene signature using independent datasets or experimental validations to confirm its relevance to your query.

Or Polly experts can be contacted that will work with your scientists to customize these steps as needed and capture transcriptome profiles and generate queries (gene signature vectors) that will run on Polly’s signature database. The query will consist of gene clusters that were significantly differentially expressed in the experiment with Log Fold Change, p-values, and adjusted p-values

Example of a query: Given an input of gene set and Log Fold Change values, search for all datasets that show maximum cosine similarity scores with the input genes and their differential expression results.

Creating a Signature Database Derived from Data on Polly

  • Experiment designs of all RNA-Seq datasets on Polly are evaluated. Datasets containing control and perturbation samples are then extracted from this collection.
  • A differential gene expression analysis is performed on these cohorts of control and perturbed samples.
  • The resulting statistical computation includes a distinct ID for the Differential Comparison, Gene Names, and their values for Log Fold Change, p-Value, and adjusted p Values.
  • These results and metadata, such as perturbations, controls, disease, drug, or genotype, are indexed on Polly in query-able .gct files.
  • Simultaneously a database of gene signatures vectors is created based on your choice of thresholds for Log Fold Change and adjusted p-value cut-offs.

Identifying Datasets Similar to the Query Signature

This signature database can now be queried to identify datasets with similar transcriptional profiles to the Query Signature. For instance, users can run complex SQL queries to identify:

  • Datasets where diseased samples are compared to normal and are similar to the query signature.
  • Datasets where a particular disease is treated with some drug and shows a reverse profile to the query signature.
  • Datasets where genes are differentially expressed in cancer cells compared to normal cells, which drugs can target.

Ranking the Output

  • Our experts work with your team to identify your preferred method for finding similar or dissimilar transcriptome profiles from the database and rank these.  We can employ standard scores in literature such as the Jaccard index, cosine similarity, concordance/discordance ratio, etc.
  • We also created a random query gene signature with the same number of differentially expressed genes and obtained the distribution of similarity scores to serve as a background distribution. This helps identify significant similarity scores for the given query signature, which can be downloaded in Excel or copied in .txt or .doc.

Case Study: Predicting Synergistic Drug Combination for COVID-19

Methodology

We used signature reversal and multivariate gene expression signatures to identify potential drug combinations for COVID-19. To do this, publicly available transcriptomics data from COVID-19 studies and drug signatures from LINCS were compiled, processed, and curated. All datasets were ingested through Polly's proprietary curation pipeline, enriched with ontology-backed metadata, and engineered to a query-able .gct format.

Predicting Synergistic Drug Combination for COVID-19 | Elucidata

Results

  • Thirty-seven reference drug candidates based on the similarity between drugs and disease profiles were identified on Polly. Drugs with low drug-disease similarity across most disease profiles were shortlisted.
  • Drug combinations were evaluated based on similarity to reference drugs and disease signature reversal. Twenty-eight combinations with low reference drug similarity and high disease signature reversal were prioritized.
Want to perform gene signature comparisons effectively? Talk to us!

Other Resources

Talk to our Data Expert
Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.
Oops! Something went wrong while submitting the form.

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

  • Data-Driven Target Selection: Polly integrates multi-omics data to identify key genes relevant to patient subgroups.
  • Accelerated Drug Discovery: The platform prioritizes targets based on disease associations and biomarker relevance, expediting the discovery and validation process.
  • Improved Reproducibility: Harmonized datasets ensure reliable and reproducible findings for target validation.

How does Polly help in training classifier models for patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly provides pre-processed, harmonized datasets that enable AI/ML model training for patient classification. It supports feature selection, dimensionality reduction, and validation workflows to build robust predictive models for precision medicine applications.

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly analyzes both single-cell and bulk multi-omics data to identify stage-specific genetic markers. By applying machine learning algorithms to detect patterns in gene expression, Polly helps researchers map lineage differentiation and gain insights into disease progression.

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly builds disease-specific atlases by:

  1. Aggregating multi-omics datasets from curated sources.
  2. Harmonizing data using standardized ontologies.
  3. Annotating datasets with clinical metadata.
  4. Structuring the information into disease-specific cohorts for targeted biomarker and therapeutic research.

How does Polly integrate multiple data types for more reliable patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly integrates genomics, transcriptomics, proteomics, and clinical data into a unified, multi-dimensional view of patient populations. This helps researchers uncover complex biological relationships and enhances predictive modeling for patient subgroups.

Can Polly handle data quality issues and unstructured data from public repositories?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Yes, Polly automatically processes raw, unstructured data from public sources, addressing missing values, batch effects, and inconsistencies. Its machine learning–driven pipelines filter out noise and standardize data, ensuring higher-quality datasets for seamless analysis.

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly's harmonization engine normalizes, processes, and integrates diverse datasets using standard ontologies and metadata frameworks. This ensures consistency, removes batch effects, and enhances the reliability of downstream analyses for precise patient classification.

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly streamlines patient stratification by:

  • Harmonizing and Integrating Multi-omics Data: Polly standardizes data across different sources, making it analysis-ready.
  • Curating High-quality Datasets: The platform ensures datasets are clean, structured, and well-annotated, thereby improving the reliability of downstream analyses.
  • Enabling AI-driven Insights: Polly applies machine learning models to uncover patterns and classify patients effectively.
  • Ensuring Reproducibility and Scalability
  • Automated pipelines and version-controlled workflows allow for efficient scaling to large datasets while maintaining detailed records of each analysis step, making it easier to reproduce or modify results.

What challenges do researchers face when performing patient stratification using multi-omics data?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Researchers encounter several challenges, including:

  • Data Heterogeneity: Multi-omics data come from different platforms, making integration complex.
  • Data Quality Issues: Public datasets often contain missing values, noise, or inconsistencies.
  • Computational Complexity: Large-scale multi-omics data require significant computational power and expertise to process.
  • Interpretability: Even with powerful analytical methods, extracting clear and meaningful biological insights from high-dimensional data remains a significant challenge.

What is patient stratification, and why is it important for precision medicine?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Patient stratification is the process of categorizing patients into subgroups based on genetic, molecular, or clinical characteristics. This approach is crucial for precision medicine because it identifies which patient populations are most likely to respond to specific treatments, thereby improving therapeutic outcomes and reducing the risk of adverse effects.

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly provides access to a curated repository of RNA-seq datasets that are consistently processed and enriched with metadata. This harmonization allows researchers to efficiently search for datasets with similar transcriptional profiles, facilitating transcriptome profiling and biomarker identification.

What methodologies does Polly use to identify synergistic drug combinations?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly utilizes signature reversal and multivariate gene expression signatures to predict potential drug combinations. By analyzing publicly available transcriptomics data and drug signatures, Polly can identify drugs or compounds that may have therapeutic effects by reversing disease signatures.

How does Polly rank datasets similar to a gene signature query?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly ranks similar datasets using cosine similarity scores, which measure how closely a dataset's transcriptional profile matches the query signature. This helps researchers quickly find relevant datasets for further analysis and validation.

What steps are involved in creating a query gene signature on Polly?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Researchers define the biological process of interest, select a dataset, preprocess the data, identify differentially expressed genes, and validate the signature. Polly’s platform streamlines this process with expert support and ML-ready datasets.

How does Polly's RNA-Seq Atlas simplify gene signature analysis?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly's RNA-Seq Atlas addresses challenges in extracting associated signatures from public databases by providing a curated resource of RNA-seq datasets collected from the Gene Expression Omnibus (GEO). This richly curated resource helps researchers to find datasets with similar transcriptional profiles to their gene sets of interest.

What is gene signature comparison, and why is it important in drug discovery?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Gene signature comparison analyzes gene expression patterns to identify disease-related signatures. It helps researchers find drugs that can reverse disease signatures, aiding in therapeutic discoveries.

Made in Webflow