Slowly Changing Dimensions in Data Science
In this blog post, Michael Kaminsky discusses a common pitfall in data science when using cloud data warehouses for machine learning tasks. The issue arises with untracked slowly changing dimensions, which can lead to poor real-world predictive accuracy of models. An example is provided where the goal is to predict customer churn in a subscription business. Two main pitfalls are identified: incorrect setup and the problem of slowly changing dimensions. To address these issues, Kaminsky suggests using a log format for data storage that tracks all changes made to different values in the database. Fivetran's "history mode" feature is mentioned as a tool to automatically record and capture changes to slowly changing dimensions, allowing for better model generalization from training to production.
Company
Fivetran
Date published
Feb. 26, 2021
Author(s)
Michael Kaminsky
Word count
1568
Language
English
Hacker News points
None found.