BDI-Viz Demo
Contents
Problem
In biomedical research, datasets from diverse studies often need to be integrated into a unified schema, such as the Genomic Data Commons (GDC). However, schema matching is time-consuming, error-prone, and requires expert domain knowledge, especially given the complexity and scale of biomedical datasets.
Challenges
Manual schema matching processes, commonly used by researchers, are slow and struggle to scale with large datasets.
Automatic methods often lack the precision required, leading to errors and inconsistencies that demand expert intervention. Some biomedical schemas are very similar and will be nearly impossible to infer even for state-of-the-art matching methods.
Prerequisites
Installation
Before starting this demo, ensure that the bdi-viz package is installed from PyPI. You can do this by running the following command:
pip install bdi-viz
[1]:
import pandas as pd
from bdiviz import BDISchemaMatchingHeatMap
from bdikit import match_schema
from bdikit.mapping_algorithms.column_mapping.algorithms import TwoPhaseSchemaMatcher
/ext3/miniconda3/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Access to Source and Target Data
GDC Metadata Validation Services: https://docs.gdc.cancer.gov/Data_Dictionary/gdcmvs/
Data Dictionary Viewer: https://docs.gdc.cancer.gov/Data_Dictionary/viewer/
Proteogenomic Characterization of Endometrial Carcinoma
Dou et. al
We undertook a comprehensive proteogenomic characterization of 95 prospectively collected endometrial carcinomas, comprising 83 endometrioid and 12 serous tumors. This analysis revealed possible new consequences of perturbations to the p53 and Wnt/β-catenin pathways, identified a potential role for circRNAs in the epithelial-mesenchymal transition, and provided new information about proteomic markers of clinical and genomic tumor subgroups, including relationships to known druggable pathways. An extensive genome-wide acetylation survey yielded insights into regulatory mechanisms linking Wnt signaling and histone acetylation. We also characterized aspects of the tumor immune landscape, including immunogenic alterations, neoantigens, common cancer/testis antigens, and the immune microenvironment, all of which can inform immunotherapy decisions. Collectively, our multi-omic analyses provide a valuable resource for researchers and clinicians, identify new molecular associations of potential mechanistic significance in the development of endometrial cancers, and suggest novel approaches for identifying potential therapeutic targets.
[2]:
source = pd.read_csv("dou_bdiviz.csv")
target = "gdc"
source
[2]:
| Country | BMI | Gender | Ethnicity | Race | Tumor_Focality | FIGO_stage | Age | Histologic_Grade_FIGO | Path_Stage_Primary_Tumor-pT | Path_Stage_Reg_Lymph_Nodes-pN | Clin_Stage_Dist_Mets-cM | Path_Stage_Dist_Mets-pM | tumor_Stage-Pathological | Histologic_type | Tumor_Site | Tumor_Size_cm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United States | 38.88 | Female | Not-Hispanic or Latino | White | Unifocal | IA | 64.0 | FIGO grade 1 | pT1a (FIGO IA) | pN0 | cM0 | Staging Incomplete | Stage I | Endometrioid | Anterior endometrium | 2.9 |
| 1 | United States | 39.76 | Female | Not-Hispanic or Latino | White | Unifocal | IA | 58.0 | FIGO grade 1 | pT1a (FIGO IA) | pNX | cM0 | Staging Incomplete | Stage IV | Endometrioid | Posterior endometrium | 3.5 |
| 2 | United States | 51.19 | Female | Not-Hispanic or Latino | White | Unifocal | IA | 50.0 | FIGO grade 2 | pT1a (FIGO IA) | pN0 | cM0 | Staging Incomplete | Stage I | Endometrioid | Other, specify | 4.5 |
| 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Carcinosarcoma | NaN | NaN |
| 4 | United States | 32.69 | Female | Not-Hispanic or Latino | White | Unifocal | IA | 75.0 | FIGO grade 2 | pT1a (FIGO IA) | pNX | cM0 | No pathologic evidence of distant metastasis | Stage I | Endometrioid | Other, specify | 3.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99 | Ukraine | 29.40 | Female | NaN | NaN | Unifocal | IA | 75.0 | FIGO grade 3 | pT1a (FIGO IA) | pNX | cM0 | Staging Incomplete | Stage I | Endometrioid | Other, specify | 4.2 |
| 100 | Ukraine | 35.42 | Female | NaN | NaN | Unifocal | II | 74.0 | FIGO grade 2 | pT2 (FIGO II) | pN0 | cM0 | Staging Incomplete | Stage II | Endometrioid | Other, specify | 1.5 |
| 101 | United States | 24.32 | Female | Not-Hispanic or Latino | Black or African American | Unifocal | II | 85.0 | NaN | pT2 (FIGO II) | pN0 | Staging Incomplete | Staging Incomplete | Stage II | Serous | Other, specify | 3.8 |
| 102 | Ukraine | 34.06 | Female | NaN | NaN | Unifocal | IA | 70.0 | NaN | pT1a (FIGO IA) | pN0 | cM0 | Staging Incomplete | Stage I | Serous | Other, specify | 5.0 |
| 103 | Ukraine | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Serous | NaN | NaN |
104 rows × 17 columns
BDI-Viz Heatmap
Interactive Heatmap: Click on a heatmap cell to select candidates.
Candidate Manipulation:
Use the Accept Match, Reject Match, and Discard Column buttons to manage matching candidates.
Undo or redo actions with the Undo and Redo buttons.
Value Comparisons: Explore similar candidate values in detail.
Detailed Analysis: Examine value distributions and schema descriptions for selected attributes.
Filtering Options:
Filter by candidate data type using the Candidate Type Selector.
Identify similar source columns with the Similar Sources Slider (based on embeddings).
Adjust threshold values with the Candidate Threshold Slider to refine matches.
Note: This result is generated by a finetuned language model, BDI-Viz can also be used with other matching methods as long as we have the top-k candidates.
[3]:
heatmap_manager = BDISchemaMatchingHeatMap(
source,
target=target,
top_k=20
)
heatmap_manager.plot_heatmap()
[3]:
Passing Curated Results to BDI-Kit
BDI-Viz seamlessly integrates as an extension of BDI-Kit. By passing results from BDI-Viz to BDI-Kit, you can retain and utilize all manually updated candidates for further processing.
If you notice that some columns are still not properly aligned, you can return to the BDI-Viz Heatmap and continue refining the matching candidates as needed.
[4]:
two_phase_viz = TwoPhaseSchemaMatcher(top_k_matcher=heatmap_manager)
column_mappings = match_schema(source, target=target, method=two_phase_viz)
column_mappings
[4]:
| source | target | |
|---|---|---|
| 0 | Country | country_of_birth |
| 1 | BMI | demographics |
| 2 | Gender | gender |
| 3 | Ethnicity | ethnicity |
| 4 | Race | race |
| 5 | Tumor_Focality | tumor_focality |
| 6 | FIGO_stage | irs_stage |
| 7 | Age | weight |
| 8 | Histologic_Grade_FIGO | histologic_progression_type |
| 9 | Path_Stage_Primary_Tumor-pT | margin_distance |
| 10 | Path_Stage_Reg_Lymph_Nodes-pN | peripancreatic_lymph_nodes_tested |
| 11 | Clin_Stage_Dist_Mets-cM | inrg_stage |
| 12 | Path_Stage_Dist_Mets-pM | masaoka_stage |
| 13 | tumor_Stage-Pathological | ajcc_pathologic_t |
| 14 | Histologic_type | history_of_tumor_type |
| 15 | Tumor_Site | tumor_shape |
| 16 | Tumor_Size_cm | tumor_depth |
Groundtruth Lookups
Source Column |
Target Column (GDC Schema) |
Matching Type |
Notes |
|---|---|---|---|
Country |
country_of_birth |
Exact match in values, semantically similar names |
|
BMI |
bmi |
Exact match in name |
|
Gender |
gender |
Exact match in name and values |
|
Ethnicity |
ethnicity |
Exact match in name and values |
|
Race |
race |
Exact match in name and values |
|
Tumor_Focality |
tumor_focality |
Exact match in name and values |
|
FIGO_stage |
figo_stage |
Exact match in name, semantically similar values |
(e.g., “IA” to “Stage IA”) |
Age |
age_at_diagnosis / age_at_index |
Semantically similar names |
“weight” incorrectly returned as top result by matcher; unit mismatch (days vs. years) |
Histologic_Grade_FIGO |
tumor_grade |
Semantically similar names, values |
(e.g., “FIGO grade 1” to “G1”) |
Path_Stage_Primary_Tumor-pT |
ajcc_pathologic_t |
Semantically similar names, values |
(e.g., “pT1 (FIGO I)” to “T1”) |
Path_Stage_Reg_Lymph_Nodes-pN |
ajcc_pathologic_n |
Semantically similar names, values |
(e.g., “pNX” to “NX”) |
Clin_Stage_Dist_Mets-cM |
ajcc_clinical_m |
Semantically similar names, values |
(e.g., “cM0” to “cM0 (i+)”) |
Path_Stage_Dist_Mets-pM |
ajcc_pathologic_m |
Semantically similar names, values |
(e.g., “pM1” to “M1”) |
tumor_Stage-Pathological |
ajcc_pathologic_stage |
Semantically similar names, exact values |
(e.g., “Stage I” to “Stage I”); requires knowledge of use-case-specific standards |
Histologic_type |
primary_diagnosis |
Unrelated names, semantically similar values |
(e.g., “Serous” to “Serous adenocarcinofibroma”) |
Tumor_Site |
site_of_resection_or_biopsy |
Unrelated names, semantically similar values |
(e.g., “Posterior endometrium” to “Endometrium”) |
Tumor_Size_cm |
tumor_largest_dimension_diameter |
Semantically similar names |