THE CHALLENGE
At the time, our banking customer had no central ETL monitoring and management solution for their data warehouse. They did not have visibility into the jobs that were scheduled or underway or which jobs had failed. They had no method for automatically reinitiating jobs that had failed. They were also unable to pinpoint the reason a job failed.
They needed a centralized dashboard to handle their business side as well as the ability to monitor and manage hundreds of attributes around the various jobs, including job status, job flows, errors and failures, successful completions, failure reports, ticket handling, and more. Our customer also needed the ability to retrigger failed jobs on an ad-hoc basis along with process lineage visibility.
THE SOLUTION
EveryIT created a centralized ETL monitoring and management solution, including multiple MVPs within just seven months. The solution enables our customer to monitor incoming jobs from several data pipelines from multiple source systems in different formats. The solution delivers insights on job status including run time, success or failure status, reason of failure, and other key measures.
EveryIT also created a tool to automatically handle job failures. When a job fails, the tool creates a failure report with in-depth details including who initiated the job, how long the job had been running, the reason for failure, and other details. The solution then automatically generates a ticket that is escalated to the appropriate party.
ETL monitoring and management solution architecture overview
THE RESULTS
The ETL monitoring and management solution provides our customer with a central dashboard for monitoring as well as in-depth visibility and control through error detection, error enrichment, event triggering services, unified monitoring service, unified metadata, SLA impact, and recovery. Our customer also benefits from enhancements in their existing components like Airflow, Job Server, Ingestion, Computes, and DaaS. The solution from EveryIT delivered the following benefits:
- Increased productivity by 43%.
- Reduced losses from job failures by 76%.
- Cut required headcount in half, from 16 to 8.
- Increased observability of the entire process by 87%.
TECHNOLOGIES USED
Java 1.8, Spring Boot, Spring Cloud, Spring MVP, Spring Data, Spring Security, Kafka, ELK, Filebeat, Healthbeat, Kubernetes, Docker, Jenkins, React js, and Kibana.