Sentry For Data: Error Monitoring with PySpark

Company

Sentry

Date Published

Nov. 12, 2019

Author

Abhijeet Prasad

Word count

644

Language

English

Hacker News points

None

URL

blog.sentry.io/2019/11/12/sentry-for-data-error-monitoring-with-pyspark

Summary

Sentry for Data is a new initiative by Abhijeet Prasad and Mike Clarke, which aims to bridge the gap between error monitoring and observability tooling for data tools, highlighting that logs are insufficient for quick and efficient debugging. The team built out monitoring solutions using Sentry for popular data tools like Apache Beam and Apache Airflow, but now focuses on integrating Sentry with PySpark, a Python API for Apache Spark. This integration provides full context events in Sentry that can be tracked, assigned, and grouped, containing metadata and breadcrumbs to help isolate the source of errors. The PySpark integration works out of the box for various execution environments and can be customized based on setup needs. To get started, install the Sentry Python SDK and initialize it before creating a SparkContext/SparkSession with the SparkIntegration. Instrumenting both driver and worker clusters is necessary to gain comprehensive insight into error occurrences.