Jun 17, 2025 - 5 MIN READ

General Delta Table Processing

A Python OOP approach for creating and updating Databricks Delta Tables — building a reusable, metadata-driven processing class based on the four pillars of object-oriented programming.

Radek Řezáč

Python is an object-oriented programming language, and it supports the four main pillars of OOP: encapsulation, inheritance, polymorphism, and abstraction. Applying these principles to Databricks Delta table processing results in a framework that is reusable, testable, and easy to extend — without duplicating code for every new data object.

Why OOP for Delta Processing?

Ad-hoc Delta Lake scripts are the path of least resistance — write a notebook, read a Parquet file, run a merge. But as the number of objects grows, you end up maintaining dozens of near-identical notebooks with subtle differences that are impossible to govern or test systematically.

An OOP approach replaces one-off scripts with a composable framework:

Encapsulation — each class owns its own state and behaviour; internal implementation can change without affecting callers
Inheritance — a base processing class provides common behaviour; subclasses override only what differs
Polymorphism — the same interface (e.g., process()) behaves appropriately for different object types (full load, incremental, CDC)
Abstraction — callers work with a high-level interface; they don't need to know whether the underlying write is an overwrite, append, or MERGE

Core Architecture

A general Delta table processing framework built on these principles consists of four cooperating classes:

MetadataClass

Responsible for all communication with the Metadata database. Loads:

System-level configuration — environment settings, connection parameters, storage paths
Object-level metadata — table definitions, schema expectations, processing rules
File-level metadata — newly ingested Parquet files awaiting processing

Also writes audit records back to the Metadata database on completion: processing status, row counts, error messages, timestamps.

FileLoaderClass

Reads the Parquet files identified by MetadataClass into a Spark DataFrame.

class FileLoaderClass:
    def __init__(self, metadata):
        self.metadata = metadata

    def load(self) -> DataFrame:
        return spark.read.parquet(self.metadata.file_path)

Abstracts file access from processing logic. Changes to source format or storage layout are isolated here.

DeltaProcessorClass

Contains the core write logic. Selects and executes the appropriate strategy based on metadata configuration:

Full Load — truncates and reloads the target Delta table entirely
Incremental Load — appends new records based on a watermark or partition key
CDC Merge — merges changes (inserts, updates, deletes) using MERGE INTO

class DeltaProcessorClass:
    def __init__(self, metadata, df: DataFrame):
        self.metadata = metadata
        self.df = df

    def process(self):
        mode = self.metadata.processing_mode
        if mode == "FULL":
            self._full_load()
        elif mode == "INCREMENTAL":
            self._incremental_load()
        elif mode == "CDC":
            self._cdc_merge()

    def _full_load(self):
        self.df.write.format("delta").mode("overwrite").saveAsTable(self.metadata.target_table)

    def _incremental_load(self):
        self.df.write.format("delta").mode("append").saveAsTable(self.metadata.target_table)

    def _cdc_merge(self):
        from delta.tables import DeltaTable
        target = DeltaTable.forName(spark, self.metadata.target_table)
        target.alias("t").merge(
            self.df.alias("s"),
            f"t.{self.metadata.key_column} = s.{self.metadata.key_column}"
        ).whenMatchedUpdateAll() \
         .whenNotMatchedInsertAll() \
         .execute()

ProcessingClass

The top-level orchestrator and single entry point for pipeline execution. Composes the other classes:

class ProcessingClass:
    def __init__(self, object_id: str):
        self.metadata = MetadataClass(object_id)
        self.loader = FileLoaderClass(self.metadata)
        self.processor = DeltaProcessorClass(self.metadata, self.loader.load())

    def run(self):
        try:
            self.processor.process()
            self.metadata.write_audit(status="SUCCESS", rows=self.processor.rows_written)
        except Exception as e:
            self.metadata.write_audit(status="FAILED", error=str(e))
            raise

To customise behaviour for a specific object type, subclass ProcessingClass and override only the relevant methods — without touching the base framework.

Running the Pipeline

# Invoke for a specific data object
pipeline = ProcessingClass(object_id="SALES_ORDERS")
pipeline.run()

Adding a new data object requires only a metadata record in the Metadata database — no new code.

Key Benefits

Zero-touch onboarding — new objects are registered in the Metadata DB; the framework handles the rest
Full audit trail — every run writes status, row counts, and timestamps to the Metadata DB
Isolated concerns — bugs in file loading don't affect processing logic, and vice versa
Extensible — new processing modes are added by extending DeltaProcessorClass, not modifying it
Consistent — all objects follow the same execution pattern, simplifying monitoring and operations

Dynamic Management Views (DMVs)

How to use DMVs to create documentation for a Power BI data model — querying tables, columns, measures, relationships, and dependencies in DAX Studio and SSMS.

Microsoft Fabric as an All-in-One Analytics Solution

A deep dive into Microsoft Fabric — OneLake, Workspaces, Lakehouse, Warehouse, Apache Spark, and how it compares to Data Mesh and decentralised approaches.