Reliable data replication in the face of schema drift
This text discusses techniques to ensure reliable data replication in the face of schema drift. Schema drift occurs when an application's data model evolves, and its columns, tables, and data types change. Reliable data replication involves preserving original data values and ensuring smooth passage from source to destination even as the source schema changes. Two methods for this are net-additive data integration and live updating. Net-additive data integration avoids pipeline breakages while ensuring that data is faithfully reproduced in the destination, with columns or tables never removed despite schema changes. Live updating matches the data model in the destination with the data model in the source, dispensing with retention of old schema elements. Another technique discussed is history mode, which keeps track of changes to row values by retaining current and all previous versions of all rows in a table. The text also covers accommodating data type changes through assigning a new data type that is inclusive enough to handle both old and new values in a column. Reliable data replication is crucial for maintaining a functional data pipeline, especially when dealing with large volumes of data and the need to examine historical data.
Company
Fivetran
Date published
March 3, 2022
Author(s)
Meel Velliste
Word count
851
Language
English
Hacker News points
None found.