Enforce pre-defined validation constraints¶
In a previous guide, you defined validation constraints ad-hoc when initializing Curator
objects.
Often, you want to enforce a pre-defined set of validation constraints, like, e.g., the CELLxGENE curator (Curate AnnData based on the CELLxGENE schema).
This guide shows how to subclass Curator
to enforce pre-defined constraints.
Define a custom curator¶
Consider the example of electronic health records (EHR). We want to ensure that
every record has the fields
disease
,phenotype
,developmental_stage
, andage
values for these fields map against specific versions of pre-defined ontologies
The following implementation achieves the goal by subclassing DataFrameCurator
.
import bionty as bt
import pandas as pd
from lamindb.core import DataFrameCurator, Record, logger
from lamindb.core.types import UPathStr, FieldAttr
__version__ = "0.1.0"
# Curate these columns against the specified fields
DEFAULT_CATEGORICALS = {
"disease": bt.Disease.name,
"phenotype": bt.Phenotype.name,
"developmental_stage": bt.DevelopmentalStage.name,
}
# If columns or values are missing, we substitute with these defaults
DEFAULT_VALUES = {
"disease": "normal",
"development_stage": "unknown",
"phenotype": "unknown",
}
# Map values onto the following ontology versions
DEFAULT_SOURCES = {
"disease": bt.Source.get(
entity="bionty.Disease", name="mondo", version="2023-04-04"
),
"developmental_stage": bt.Source.get(
entity="bionty.DevelopmentalStage", name="hsapdv", version="2020-03-10"
),
"phenotype": bt.Source.get(
entity="bionty.Phenotype", name="hp", version="2023-06-17", organism="human"
),
}
class EHRCurator(DataFrameCurator):
"""Custom curation flow for electronic health record data."""
def __init__(
self,
data: pd.DataFrame | UPathStr,
categoricals: dict[str, FieldAttr] = DEFAULT_CATEGORICALS,
*,
defaults: dict[str, str] = None,
sources: dict[str, Record] = DEFAULT_SOURCES,
organism="human",
):
self.data = data
if defaults:
for col, default in defaults.items():
if col not in self.data.columns:
self.data[col] = default
else:
self.data[col].fillna(default, inplace=True)
super().__init__(
df=self.data, categoricals=categoricals, sources=sources, organism=organism
)
def validate(self, organism: str | None = None) -> bool:
"""Validates the internal EHR standard."""
missing_columns = {"disease", "phenotype", "developmental_stage", "age"} - set(
self.data.columns
)
if missing_columns:
logger.error(
f"Columns {', '.join(map(repr, missing_columns))} are missing but required."
)
return False
return DataFrameCurator.validate(self, organism)
Use the custom curator¶
!lamin init --storage ./subclass-curator --schema bionty
→ connected lamindb: testuser1/subclass-curator
import lamindb as ln
import bionty as bt
import pandas as pd
from ehrcurator import EHRCurator
ln.track("2XEr2IA4n1w40000")
→ connected lamindb: testuser1/subclass-curator
→ notebook imports: bionty==0.52.0 ehrcurator lamindb==0.76.14 pandas==2.2.3
→ created Transform('2XEr2IA4'), started new Run('TmYZQ4V4') at 2024-10-21 07:29:35 UTC
# create example DataFrame that has all mandatory columns but one ('patient_age') is wrongly named
data = {
'disease': ['Alzheimer disease', 'Diabetes mellitus', 'Breast cancer', 'Hypertension', 'Asthma'],
'phenotype': ['Cognitive decline', 'Hyperglycemia', 'Tumor growth', 'Increased blood pressure', 'Airway inflammation'],
'developmental_stage': ['Adult', 'Adult', 'Adult', 'Adult', 'Child'],
'patient_age': [70, 55, 60, 65, 12],
}
df = pd.DataFrame(data)
df
Show code cell output
disease | phenotype | developmental_stage | patient_age | |
---|---|---|---|---|
0 | Alzheimer disease | Cognitive decline | Adult | 70 |
1 | Diabetes mellitus | Hyperglycemia | Adult | 55 |
2 | Breast cancer | Tumor growth | Adult | 60 |
3 | Hypertension | Increased blood pressure | Adult | 65 |
4 | Asthma | Airway inflammation | Child | 12 |
ehrcurator = EHRCurator(df)
ehrcurator.validate()
Show code cell output
✓ added 3 records with Feature.name for columns: 'disease', 'phenotype', 'developmental_stage'
✗ Columns 'age' are missing but required.
False
# Fix the name of wrongly spelled column
df.columns = df.columns.str.replace("patient_age", "age")
ehrcurator.validate()
Show code cell output
• saving validated records of 'disease'
✓ added 4 records from public with Disease.name for disease: 'Alzheimer disease', 'diabetes mellitus', 'breast cancer', 'asthma'
• saving validated records of 'phenotype'
✓ added 3 records from public with Phenotype.name for phenotype: 'Mental deterioration', 'Hyperglycemia', 'Increased blood pressure'
• saving validated records of 'developmental_stage'
• mapping disease on Disease.name
! 1 terms is not validated: 'Hypertension'
→ fix typos, remove non-existent values, or save terms via .add_new_from('disease')
• mapping phenotype on Phenotype.name
! 2 terms are not validated: 'Tumor growth', 'Airway inflammation'
→ fix typos, remove non-existent values, or save terms via .add_new_from('phenotype')
• mapping developmental_stage on DevelopmentalStage.name
! 2 terms are not validated: 'Adult', 'Child'
→ fix typos, remove non-existent values, or save terms via .add_new_from('developmental_stage')
False
# Use lookup objects to curate the values
disease_lo = bt.Disease.public().lookup()
phenotype_lo = bt.Phenotype.public().lookup()
developmental_stage_lo = bt.DevelopmentalStage.public().lookup()
df["disease"] = df["disease"].replace({"Hypertension": disease_lo.hypertensive_disorder.name})
df["phenotype"] = df["phenotype"].replace({
"Tumor growth": phenotype_lo.neoplasm.name,
"Airway inflammation": phenotype_lo.bronchitis.name}
)
df["developmental_stage"] = df["developmental_stage"].replace({
"Adult": developmental_stage_lo.adolescent_stage.name,
"Child": developmental_stage_lo.child_stage.name
})
ehrcurator.validate()
Show code cell output
• saving validated records of 'disease'
• saving validated records of 'phenotype'
• saving validated records of 'developmental_stage'
✓ disease is validated against Disease.name
✓ phenotype is validated against Phenotype.name
✓ developmental_stage is validated against DevelopmentalStage.name
True
Show code cell content
!rm -rf subclass-curator
!lamin delete --force subclass-curator
• deleting instance testuser1/subclass-curator