THE CHALLENGE
The Imagine Communications Research and Development data team challenge was twofold. First, they had to reduce prohibitive overhead and operational costs associated with developing a new product on Azure using Microsoft SQL Server. Second, they needed to deliver a new analytics platform built to leverage their existing analytic tools, provide instant elasticity, integrate multiple data sources, and reduce time to deliver new products and services.
This analytics system transforms TV ratings data provided by Nielsen and ComScore spanning more than 200 geographic markets in a single serverless Data Lake.
THE SOLUTION
After a thorough consultation and evaluation, the EveryIT Big Data team proposed a solution built on AWS cloud-native services, comprised of Athena, Glue, Redshift, and Redshift Spectrum. The project also included the easy deployment of Redshift clusters and event-driven pipelines to allow Imagine Communications to move large amounts of daily generated data to a Redshift cluster.
EveryIT also implemented tuning and revision of the former workflow. Changes included:
- Historical data and real-time data provided by Nielsen and ComScore, the rating agencies, is now stored in S3, consisting of more than 1 million files and more than 50 TB of uncompressed data.
- The input data comes in a mainframe-like format, so a Lambda/Python function was used to convert to S3 in JSON format. Lambda functions were used for high concurrency data processing for S3 files and were invoked more than 1 million times to process more than 800,000 files in one hour.
- To save money, the S3/JSON data was converted to Parquet format using Spark which runs in Fargate containers.
- The Glue Catalog has more than 2,000 tables that is queried by Athena.
- The Parquet-formatted data stored in S3 was moved to Redshift Spectrum and Redshift Clusters via staging tables.
- The Redshift cluster, which is a 2-node RA3 configuration, houses more than 20TB of raw data for 4 years.
- The ETL into the Redshift cluster is done by Python/SQL using Fargate containers.
- The Reporting UI now communicates to the Redshift spectrum and Redshift cluster through C# API.
By implementing the AWS managed services stack and architectural advancements, the EveryIT Big Data team successfully delivered a scalable, truly elastic, and highly performant data solution that also met Imagine Communications’ stringent success criteria. By moving to the AWS cloud platform, the overall cost to operate the new solution dropped from an average of $200,000 per month to an average of $5,000 per month.
KNOWLEDGE TRANSFER
During the engagement with Imagine Communications, the EveryIT team also provided education and training, so that the Imagine Communications Data Analytics Team could take ownership of the newly developed environment. As a practice, the EveryIT Big Data Team developed and delivered a runbook, so that Imagine Communications could not only manage the newly provided data platform, but also add to the solution in the future.
BENEFITS
Migrating from Microsoft SQL Server on Azure to Redshift on Amazon Web Services provided Imagine Communications with increased performance, high scalability, and faster data processing by leveraging a sophisticated data lake solution.