adesso Blog

Architecture

2. May 2024 By Yannik Rust

Big Data Streaming with AWS: Architecture Design

In the age of digital innovation and data-driven decision-making, the efficient processing and analysis of data has become a decisive competitive advantage. Especially in dynamic environments where real-time information can make the difference between success and failure, big data streaming has become an indispensable tool.

Amazon Web Services (AWS) offers a wide range of services and tools that enable companies to collect, process and analyse data in real time. From capturing large amounts of data to generating valuable insights in real time, AWS offers a robust infrastructure and scalable solutions for streaming use cases.

In this blog post, we'll take a closer look at architectural design for streaming with AWS with a focus on big data. We will explore the various components and services that contribute to developing a powerful and reliable streaming architecture and discuss best practices and recommendations for optimising your streaming solution.

Big Data

Big data refers to the vast amounts of data generated in our modern, digitised world. This data includes not only structured information that can be easily organised in traditional databases, but also unstructured data such as text, images, videos and audio files. The special thing about big data is not only the sheer volume of data, but also the variety of sources from which it originates.

Big data is often characterised by the 3Vs:

Volume refers to the large amount of data that is generated daily.
Velocity describes the speed at which this data is generated, collected and processed.
Variety refers to the different types of data that come from different sources.

These characteristics make big data a challenge, but also a great opportunity for companies and organisations, as it can provide valuable insights that are often not possible with traditional analysis methods.

Streaming

Streaming data is an essential component in the age of real-time communication and analysis. Unlike static data sets that are processed in batch systems, streaming data is generated, transmitted and processed continuously and in real time. These data streams can come from a variety of sources, including IoT devices, sensors, social media, web applications and more. The ability to process and analyse this data in real time allows companies to gain instant insights into rapidly evolving events and trends. By utilising streaming data, companies can react faster and make more informed decisions.

Big data streaming use cases

Big data streaming offers a variety of use cases in different industries:

Network monitoring and security: continuously analysing network data in real time makes it possible to detect suspicious activity, identify security threats and respond quickly.
Monitoring of financial transactions: In the financial industry, Big Data streaming enables real-time monitoring of transactions to detect and prevent fraudulent activity by immediately identifying and investigating suspicious patterns.
Real-time analysis of customer data in retail: By analysing streaming data on purchasing behaviour, retailers can understand their customers' behaviour in real time and make personalised offers and recommendations to increase customer satisfaction and boost sales.

Real-time log analysis and error detection: By analysing log data in real time, companies can identify potential problems and errors in applications and systems to quickly implement solutions and maximise uptime.

Streaming architecture with AWS services

In order to make streaming data usable for analyses, it must first be collected, stored and, if necessary, transformed. There are a variety of options and technologies to achieve these goals. Both open source products and platform-as-a-service or software-as-a-service solutions can be used here. The areas of application are not limited to the cloud. Depending on the use case, data protection guidelines and expected costs, it may also make more sense to process streaming data on on-premise platforms. Nevertheless, cloud services such as AWS offer considerable advantages in terms of availability, scalability and the range of managed services. AWS also offers a wide variety of services for streaming use cases, which can be used for a wide range of process steps in the processing of streaming data. In the following section, I will present an example of an architectural design that fulfils the above purpose. This is an example and should not be considered a final solution for every streaming use case. There are often multiple options when selecting services and careful consideration should be given to which services are best suited for the desired outcome.

Below is the example architecture shown as a diagram. Four AWS Managed Services are used here. These services and the associated process steps are explained in more detail in the following sections.

The example architecture as a diagram, source: AWS.amazon.com

Kinesis - Data acquisition

In order to make streaming data accessible for analysis purposes, it is essential to capture data from the corresponding source systems. In this context, the AWS Kinesis Data Firehose service proves to be a valuable tool. Firehose is a powerful service that easily and seamlessly captures, transforms and delivers streaming data to various AWS storage and analytics destinations. Users can capture data from a variety of sources in real time without having to worry about infrastructure management or scalability. The sources can be other AWS services such as CloudWatch, EventBridge, Kinesis Data Stream or Lambda. In addition, the Firehose API can be addressed directly via an SDK to connect external software and infrastructure such as IoT or network devices.

Due to its seamless scalability, Kinesis Firehose is also ideal for processing large amounts of data. In addition, the service enables easy configuration of data deliveries to various AWS storage locations such as Amazon S3, Amazon Redshift, Amazon Elasticsearch Service and Amazon OpenSearch Service, making the data immediately available for analyses, reports and other business applications.

In our example architecture, the data is captured by AWS Kinesis Data Firehose and stored per incoming event as a JSON file in an S3 bucket. Storage is time-partitioned, with the time of data receipt being used as the file name for storage. This makes it easier to process the various events retrospectively. The standardised Hadoop format is often used as the storage format:

	
	/year=2023/month=03/day=26/hour09.json

S3 - Storage

As mentioned above, the Amazon Simple Storage Service (S3) is used for storage. S3 is a highly scalable cloud storage service from AWS that stores large amounts of data securely and cost-effectively. S3 provides a reliable infrastructure that allows users to upload and store data in the form of objects of any quantity and size. This service is ideal for big data and streaming data as it offers high availability and longevity and enables easy integration with other AWS services.

For big data applications, S3 offers the ability to efficiently store and access large amounts of data, which is an important prerequisite for analysing and processing data. Seamless integration with AWS Kinesis Services allows streaming data to be stored directly in S3.

There are various architectural principles for storing large volumes of unstructured data, as is the case with big data. One of the most widespread is the concept of data lakes. A data lake is a centralised storage environment that enables companies to store a variety of structured and unstructured data in its native format without having to transform it beforehand.

In addition, it makes sense to introduce different levels/zones in a data lake depending on the quality of the data. The following 3 zones are usually used:

1. landing zone (capture zone): In this zone, raw data is collected from various sources and stored in its native format. The data is stored here unchanged, without any structural changes or transformations being made. The landing zone serves as a gateway for new data and enables quick and easy data capture.
2. staging zone (preparation zone): In this zone, the raw data from the landing zone is temporarily stored and processed in order to prepare it for analyses and other processing. Here, data can be cleansed, structured and enriched with metadata to improve its quality and accessibility.
3. data consumption zone: This zone is the main area where data analysts, scientists and other users can access the prepared data and use it for analyses, reports, machine learning and other business applications. Here, the data is presented in a formatted and user-friendly state to maximise its value and support informed decision-making.

When storing big data streaming data, the application of the data lake and the 3-zone architecture is of great importance. The AWS S3 service offers various functionalities for implementing these concepts. A bucket can be created for each zone, which makes separation possible. In our example, the data from Kinesis Data Firehose was stored in the landing zone. Here it can now be processed and enriched by other services.

Glue - Transformation

In order to make the raw data from the landing zone usable for further analysis steps, it must be transformed. Transformation is the step in which the extracted data is converted into the desired target format or schema in order to prepare it for analyses, reports or other applications. Various types of transformations can be carried out, such as cleansing data, merging data from different sources, aggregating data, converting data formats or enriching data with additional information.

Various services are available in the AWS Cloud for this step, one of the most comprehensive being AWS Glue. With Glue, users can extract data from various sources, transform it and load it into various AWS storage and analysis targets without having to worry about provisioning and managing the infrastructure. The service offers pre-built connectors for common data sources and targets as well as a visual user interface for creating and executing ETL jobs.

Glue offers a variety of functions to simplify data processing. Among other things, Glue crawlers can be used to create data catalogues on S3 in order to obtain an overview of the available data. Python and Pyspark scripts can be created for the transformation and executed serverless via Glue and scheduled with Glue jobs. Glue Studio provides a graphical interface that simplifies the creation, execution and monitoring of extraction, transformation and loading (ETL) jobs in AWS Glue. Here, data transformation workflows can be visually created and seamlessly executed on the Apache Spark-based, serverless ETL engine of AWS Glue.

Glue Studio, source: AWS.amazon.com

For the reference architecture, it makes sense to create transformation jobs with Glue Studio. These jobs can be saved on AWS and executed periodically. The execution of the jobs can be time-controlled or event-controlled, for example when new data arrives. Monitoring information is stored in AWS Glue and is available for analysis purposes. After the transformation, the data is stored in the second (staging) or third (consumption) zone in S3.

Quicksight - evaluation

Once the streaming data has been prepared for analysis by the AWS Glue Job and stored in a suitable format in the Consumption Zone, the "QuickSight" analysis and dashboard service can now be used to gain relevant insights from the data. QuickSight enables users to create interactive dashboards and reports to analyse and present data from different sources without the need for complex coding or queries. The service offers a wide range of visualisation options, including charts, tables, maps and much more, to present data in an appealing and meaningful way.

AWS Quicksight, source: https://www.joulica.io/blog/realtime-analytics-with-amazon-quicksight-an-amazon-connect-use-case

Thanks to the seamless integration with other AWS services such as S3, RDS and Redshift, users can import data directly from these sources and analyse it in QuickSight. The transformed streaming data can be visualised in dashboards using this service. If the underlying data changes in real time, the dashboards are automatically adjusted and users can be informed of changes or KPI breaches via generated alerts.

Conclusion

The reference architecture for a big data streaming use case presented here illustrates the extensive possibilities of AWS services for efficiently processing large volumes of real-time data and making it available for analysis purposes without companies having to worry about managing and scaling the underlying infrastructure. Reliability, scalability and seamless integration with other AWS services are key benefits of AWS Managed Services. However, building such an architecture based exclusively on AWS services also harbours the risk of vendor lock-in. Companies that opt for these services may find it difficult to migrate their infrastructure to other cloud platforms, which could limit their flexibility and create a long-term dependency on AWS.

Would you like to find out more about exciting topics from the world of adesso? Then take a look at our previous blog posts.

Also interesting:

Author Yannik Rust

Yannik Rust is a Data Engineer n the Line of Business Data and Analytics at adesso.

Category:	Architecture
Tags:	AWS Big Data