BDI-Viz Demo

Contents

Problem

Fenyo Lab Use Case 1

In biomedical research, datasets from diverse studies often need to be integrated into a unified schema, such as the Genomic Data Commons (GDC). However, schema matching is time-consuming, error-prone, and requires expert domain knowledge, especially given the complexity and scale of biomedical datasets.

Challenges

  • Manual schema matching processes, commonly used by researchers, are slow and struggle to scale with large datasets.

  • Automatic methods often lack the precision required, leading to errors and inconsistencies that demand expert intervention. Some biomedical schemas are very similar and will be nearly impossible to infer even for state-of-the-art matching methods.

Prerequisites

Installation

Before starting this demo, ensure that the bdi-viz package is installed from PyPI. You can do this by running the following command:

pip install bdi-viz

[1]:
import pandas as pd

from bdiviz import BDISchemaMatchingHeatMap
from bdikit import match_schema
from bdikit.mapping_algorithms.column_mapping.algorithms import TwoPhaseSchemaMatcher
/ext3/miniconda3/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Access to Source and Target Data

GDC Metadata Validation Services: https://docs.gdc.cancer.gov/Data_Dictionary/gdcmvs/

Data Dictionary Viewer: https://docs.gdc.cancer.gov/Data_Dictionary/viewer/


Dataset: link
Paper: link

Proteogenomic Characterization of Endometrial Carcinoma

Dou et. al

We undertook a comprehensive proteogenomic characterization of 95 prospectively collected endometrial carcinomas, comprising 83 endometrioid and 12 serous tumors. This analysis revealed possible new consequences of perturbations to the p53 and Wnt/β-catenin pathways, identified a potential role for circRNAs in the epithelial-mesenchymal transition, and provided new information about proteomic markers of clinical and genomic tumor subgroups, including relationships to known druggable pathways. An extensive genome-wide acetylation survey yielded insights into regulatory mechanisms linking Wnt signaling and histone acetylation. We also characterized aspects of the tumor immune landscape, including immunogenic alterations, neoantigens, common cancer/testis antigens, and the immune microenvironment, all of which can inform immunotherapy decisions. Collectively, our multi-omic analyses provide a valuable resource for researchers and clinicians, identify new molecular associations of potential mechanistic significance in the development of endometrial cancers, and suggest novel approaches for identifying potential therapeutic targets.

[2]:
source = pd.read_csv("dou_bdiviz.csv")
target = "gdc"

source
[2]:
Country BMI Gender Ethnicity Race Tumor_Focality FIGO_stage Age Histologic_Grade_FIGO Path_Stage_Primary_Tumor-pT Path_Stage_Reg_Lymph_Nodes-pN Clin_Stage_Dist_Mets-cM Path_Stage_Dist_Mets-pM tumor_Stage-Pathological Histologic_type Tumor_Site Tumor_Size_cm
0 United States 38.88 Female Not-Hispanic or Latino White Unifocal IA 64.0 FIGO grade 1 pT1a (FIGO IA) pN0 cM0 Staging Incomplete Stage I Endometrioid Anterior endometrium 2.9
1 United States 39.76 Female Not-Hispanic or Latino White Unifocal IA 58.0 FIGO grade 1 pT1a (FIGO IA) pNX cM0 Staging Incomplete Stage IV Endometrioid Posterior endometrium 3.5
2 United States 51.19 Female Not-Hispanic or Latino White Unifocal IA 50.0 FIGO grade 2 pT1a (FIGO IA) pN0 cM0 Staging Incomplete Stage I Endometrioid Other, specify 4.5
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Carcinosarcoma NaN NaN
4 United States 32.69 Female Not-Hispanic or Latino White Unifocal IA 75.0 FIGO grade 2 pT1a (FIGO IA) pNX cM0 No pathologic evidence of distant metastasis Stage I Endometrioid Other, specify 3.5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99 Ukraine 29.40 Female NaN NaN Unifocal IA 75.0 FIGO grade 3 pT1a (FIGO IA) pNX cM0 Staging Incomplete Stage I Endometrioid Other, specify 4.2
100 Ukraine 35.42 Female NaN NaN Unifocal II 74.0 FIGO grade 2 pT2 (FIGO II) pN0 cM0 Staging Incomplete Stage II Endometrioid Other, specify 1.5
101 United States 24.32 Female Not-Hispanic or Latino Black or African American Unifocal II 85.0 NaN pT2 (FIGO II) pN0 Staging Incomplete Staging Incomplete Stage II Serous Other, specify 3.8
102 Ukraine 34.06 Female NaN NaN Unifocal IA 70.0 NaN pT1a (FIGO IA) pN0 cM0 Staging Incomplete Stage I Serous Other, specify 5.0
103 Ukraine NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Serous NaN NaN

104 rows × 17 columns

BDI-Viz Heatmap

  • Interactive Heatmap: Click on a heatmap cell to select candidates.

  • Candidate Manipulation:

    • Use the Accept Match, Reject Match, and Discard Column buttons to manage matching candidates.

    • Undo or redo actions with the Undo and Redo buttons.

  • Value Comparisons: Explore similar candidate values in detail.

  • Detailed Analysis: Examine value distributions and schema descriptions for selected attributes.

  • Filtering Options:

    • Filter by candidate data type using the Candidate Type Selector.

    • Identify similar source columns with the Similar Sources Slider (based on embeddings).

    • Adjust threshold values with the Candidate Threshold Slider to refine matches.

Note: This result is generated by a finetuned language model, BDI-Viz can also be used with other matching methods as long as we have the top-k candidates.

[3]:
heatmap_manager = BDISchemaMatchingHeatMap(
    source,
    target=target,
    top_k=20
)

heatmap_manager.plot_heatmap()
[3]:

Passing Curated Results to BDI-Kit

BDI-Viz seamlessly integrates as an extension of BDI-Kit. By passing results from BDI-Viz to BDI-Kit, you can retain and utilize all manually updated candidates for further processing.

If you notice that some columns are still not properly aligned, you can return to the BDI-Viz Heatmap and continue refining the matching candidates as needed.

[4]:
two_phase_viz = TwoPhaseSchemaMatcher(top_k_matcher=heatmap_manager)
column_mappings = match_schema(source, target=target, method=two_phase_viz)
column_mappings
[4]:
source target
0 Country country_of_birth
1 BMI demographics
2 Gender gender
3 Ethnicity ethnicity
4 Race race
5 Tumor_Focality tumor_focality
6 FIGO_stage irs_stage
7 Age weight
8 Histologic_Grade_FIGO histologic_progression_type
9 Path_Stage_Primary_Tumor-pT margin_distance
10 Path_Stage_Reg_Lymph_Nodes-pN peripancreatic_lymph_nodes_tested
11 Clin_Stage_Dist_Mets-cM inrg_stage
12 Path_Stage_Dist_Mets-pM masaoka_stage
13 tumor_Stage-Pathological ajcc_pathologic_t
14 Histologic_type history_of_tumor_type
15 Tumor_Site tumor_shape
16 Tumor_Size_cm tumor_depth

Groundtruth Lookups

Source Column

Target Column (GDC Schema)

Matching Type

Notes

Country

country_of_birth

Exact match in values, semantically similar names

BMI

bmi

Exact match in name

Gender

gender

Exact match in name and values

Ethnicity

ethnicity

Exact match in name and values

Race

race

Exact match in name and values

Tumor_Focality

tumor_focality

Exact match in name and values

FIGO_stage

figo_stage

Exact match in name, semantically similar values

(e.g., “IA” to “Stage IA”)

Age

age_at_diagnosis / age_at_index

Semantically similar names

“weight” incorrectly returned as top result by matcher; unit mismatch (days vs. years)

Histologic_Grade_FIGO

tumor_grade

Semantically similar names, values

(e.g., “FIGO grade 1” to “G1”)

Path_Stage_Primary_Tumor-pT

ajcc_pathologic_t

Semantically similar names, values

(e.g., “pT1 (FIGO I)” to “T1”)

Path_Stage_Reg_Lymph_Nodes-pN

ajcc_pathologic_n

Semantically similar names, values

(e.g., “pNX” to “NX”)

Clin_Stage_Dist_Mets-cM

ajcc_clinical_m

Semantically similar names, values

(e.g., “cM0” to “cM0 (i+)”)

Path_Stage_Dist_Mets-pM

ajcc_pathologic_m

Semantically similar names, values

(e.g., “pM1” to “M1”)

tumor_Stage-Pathological

ajcc_pathologic_stage

Semantically similar names, exact values

(e.g., “Stage I” to “Stage I”); requires knowledge of use-case-specific standards

Histologic_type

primary_diagnosis

Unrelated names, semantically similar values

(e.g., “Serous” to “Serous adenocarcinofibroma”)

Tumor_Site

site_of_resection_or_biopsy

Unrelated names, semantically similar values

(e.g., “Posterior endometrium” to “Endometrium”)

Tumor_Size_cm

tumor_largest_dimension_diameter

Semantically similar names