General Delta Table Processing
A Python OOP approach for creating and updating Databricks Delta Tables — building a reusable, metadata-driven processing class based on the four pillars of object-oriented programming.
Radek Řezáč
Python is an object-oriented programming language, and it supports the four main pillars of OOP: encapsulation, inheritance, polymorphism, and abstraction. Applying these principles to Databricks Delta table processing results in a framework that is reusable, testable, and easy to extend — without duplicating code for every new data object.
Why OOP for Delta Processing?
Ad-hoc Delta Lake scripts are the path of least resistance — write a notebook, read a Parquet file, run a merge. But as the number of objects grows, you end up maintaining dozens of near-identical notebooks with subtle differences that are impossible to govern or test systematically.
An OOP approach replaces one-off scripts with a composable framework:
- Encapsulation — each class owns its own state and behaviour; internal implementation can change without affecting callers
- Inheritance — a base processing class provides common behaviour; subclasses override only what differs
- Polymorphism — the same interface (e.g.,
process()) behaves appropriately for different object types (full load, incremental, CDC) - Abstraction — callers work with a high-level interface; they don't need to know whether the underlying write is an overwrite, append, or MERGE
Core Architecture
A general Delta table processing framework built on these principles consists of four cooperating classes:
MetadataClass
Responsible for all communication with the Metadata database. Loads:
- System-level configuration — environment settings, connection parameters, storage paths
- Object-level metadata — table definitions, schema expectations, processing rules
- File-level metadata — newly ingested Parquet files awaiting processing
Also writes audit records back to the Metadata database on completion: processing status, row counts, error messages, timestamps.
FileLoaderClass
Reads the Parquet files identified by MetadataClass into a Spark DataFrame.
class FileLoaderClass:
def __init__(self, metadata):
self.metadata = metadata
def load(self) -> DataFrame:
return spark.read.parquet(self.metadata.file_path)
Abstracts file access from processing logic. Changes to source format or storage layout are isolated here.
DeltaProcessorClass
Contains the core write logic. Selects and executes the appropriate strategy based on metadata configuration:
- Full Load — truncates and reloads the target Delta table entirely
- Incremental Load — appends new records based on a watermark or partition key
- CDC Merge — merges changes (inserts, updates, deletes) using
MERGE INTO
class DeltaProcessorClass:
def __init__(self, metadata, df: DataFrame):
self.metadata = metadata
self.df = df
def process(self):
mode = self.metadata.processing_mode
if mode == "FULL":
self._full_load()
elif mode == "INCREMENTAL":
self._incremental_load()
elif mode == "CDC":
self._cdc_merge()
def _full_load(self):
self.df.write.format("delta").mode("overwrite").saveAsTable(self.metadata.target_table)
def _incremental_load(self):
self.df.write.format("delta").mode("append").saveAsTable(self.metadata.target_table)
def _cdc_merge(self):
from delta.tables import DeltaTable
target = DeltaTable.forName(spark, self.metadata.target_table)
target.alias("t").merge(
self.df.alias("s"),
f"t.{self.metadata.key_column} = s.{self.metadata.key_column}"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
ProcessingClass
The top-level orchestrator and single entry point for pipeline execution. Composes the other classes:
class ProcessingClass:
def __init__(self, object_id: str):
self.metadata = MetadataClass(object_id)
self.loader = FileLoaderClass(self.metadata)
self.processor = DeltaProcessorClass(self.metadata, self.loader.load())
def run(self):
try:
self.processor.process()
self.metadata.write_audit(status="SUCCESS", rows=self.processor.rows_written)
except Exception as e:
self.metadata.write_audit(status="FAILED", error=str(e))
raise
To customise behaviour for a specific object type, subclass ProcessingClass and override only the relevant methods — without touching the base framework.
Running the Pipeline
# Invoke for a specific data object
pipeline = ProcessingClass(object_id="SALES_ORDERS")
pipeline.run()
Adding a new data object requires only a metadata record in the Metadata database — no new code.
Key Benefits
- Zero-touch onboarding — new objects are registered in the Metadata DB; the framework handles the rest
- Full audit trail — every run writes status, row counts, and timestamps to the Metadata DB
- Isolated concerns — bugs in file loading don't affect processing logic, and vice versa
- Extensible — new processing modes are added by extending
DeltaProcessorClass, not modifying it - Consistent — all objects follow the same execution pattern, simplifying monitoring and operations
Dynamic Management Views (DMVs)
How to use DMVs to create documentation for a Power BI data model — querying tables, columns, measures, relationships, and dependencies in DAX Studio and SSMS.
Microsoft Fabric as an All-in-One Analytics Solution
A deep dive into Microsoft Fabric — OneLake, Workspaces, Lakehouse, Warehouse, Apache Spark, and how it compares to Data Mesh and decentralised approaches.