ML Troubleshooting Is Too Hard Today (But It Doesn’t Have To Be That Way)

Company

Arize

Date Published

Feb. 24, 2022

Author

Aparna Dhinakaran

Word count

1929

Language

English

Hacker News points

None

URL

arize.com/blog/toward-better-ml-troubleshooting

Summary

ML Troubleshooting Is Too Hard Today (But It Doesn’t Have To Be That Way) The stakes for model performance are higher than ever as teams deploy more models into production, and mistakes costlier. A modern approach to ML troubleshooting is needed, shifting from no monitoring to full stack ML observability. Monitoring, at its core, requires data on system performance, which must be made storable, accessible, and displayable. To monitor model performance, one must begin with a prediction and actual, comparing them using the right metric. The correct metric depends on the use case, such as recall, false negative rate, or mean absolute percentage error for fraud models, and mean squared error for demand forecasting models. Establishing thresholds is crucial to determine when a good accuracy rate has become bad enough. Machine learning practitioners must rely on relative metrics and establish a baseline performance to define what is considered good enough. Monitoring alone is not enough, as it's essential to have a modern approach to assessing and troubleshooting model performance, including full stack ML observability with ML performance tracing.