Replayability
Replayability, in data streaming, refers to the capability to reprocess or re-ingest data that has already been processed. It’s like having a “do-over” for your data, allowing you to handle situations where the initial processing failed, or when you need to adapt to changes and ensure data integrity over time.
Why is Replayability Crucial?
- Error Handling: When unexpected issues occur during data processing (e.g., network glitches, software bugs, downstream service unavailability), replayability allows you to retry the processing of the affected data once the issue is resolved, ensuring no data is lost or incompletely processed.
- Data Consistency: In complex data streaming pipelines, maintaining data consistency across different stages and systems can be challenging. Replayability helps to synchronize and correct data discrepancies, ensuring that the final processed data is accurate and uniform.
- Schema Evolution and Changes: Data schemas can evolve over time. When changes occur, you might need to reprocess historical data with the updated schema or apply new transformations. Replayability makes this adaptation process efficient.
- Testing and Development: Developers can leverage replayability to test new features, bug fixes, or different processing logic on real data without impacting the live production environment. By replaying existing data through the modified pipeline, they can verify its correctness and stability.
Technical Implementation Strategies for Replayability:
- Idempotent Operations: Designing your processing logic to be idempotent is crucial for replayability. An idempotent operation produces the same result whether it’s executed once or multiple times. This prevents unintended side effects (like duplicate entries or incorrect state updates) when replaying data. Examples include setting a value rather than incrementing it, or checking for the existence of a record before creating it.
- Logging and Auditing: Maintaining detailed logs of processed data, timestamps, and actions taken is essential for tracking the flow of data and identifying points of failure. Audit trails help in understanding what happened to your data and when, making it easier to diagnose issues and plan for reprocessing.
- Checkpointing: This technique involves periodically saving the current state or progress of data processing. If a failure occurs, the system can resume processing from the last saved checkpoint rather than starting from the beginning of the data stream. This significantly reduces the amount of data that needs to be reprocessed.
- Backfilling: In scenarios where historical data needs to be updated or processed with new logic, backfilling allows you to go back to the original data source and re-ingest and reprocess the data as needed.
AWS data streaming services, particularly Amazon Kinesis, are designed with replayability in mind.
Amazon Kinesis for Streaming Data
Let’s delve into Amazon Kinesis, a powerful suite of services designed for handling various aspects of data streaming within AWS. Think of Kinesis as your comprehensive toolkit for managing data that’s in continuous motion.
As you mentioned, Kinesis isn’t just one service; it’s a collection, each addressing a specific need in the data streaming pipeline:
- Amazon Kinesis Data Streams (KDS) - The Ingestion Engine: This service is primarily responsible for capturing and ingesting high-volume, real-time streaming data. It’s designed to be highly scalable and dynamic, capable of handling fluctuating data rates and massive throughput. We’ll explore Kinesis Data Streams in detail in the upcoming lecture.
- Amazon Kinesis Firehose (now Amazon Data Firehose): Kinesis Firehose is a fully managed service focused on reliably delivering streaming data to various destinations. These destinations can include S3 for data lakes, Amazon Redshift for data warehousing, Elasticsearch Service for search and analytics, and even other AWS services or third-party HTTP endpoints. Firehose takes the raw streaming data and ensures it lands where you need it for storage, batch processing, or further analysis.
- Amazon Kinesis Data Analytics (now Amazon Managed Service for Apache Flink) - The Real-Time Analyst: This service allows you to analyze streaming data in real time using standard SQL queries or Apache Flink for more advanced stream processing. It enables you to perform complex analytics, create real-time dashboards, and trigger alerts based on patterns and insights derived from the live data flowing through Kinesis Data Streams or being delivered by Kinesis Firehose.
Given its focus on streaming data, Amazon Kinesis is ideal for a wide range of real-time data processing scenarios:
- Real-time Analytics: Analyzing data as it arrives, enabling immediate insights and decision-making. This is crucial for applications like monitoring website activity, tracking application performance, or understanding sensor data.
- Internet of Things (IoT) Data Ingestion and Processing: Capturing and processing the continuous streams of data generated by connected devices in real time for analysis, control, and automation.
- Security and Fraud Detection: Analyzing streaming data of transactions or system logs in real time to identify anomalies and potential fraudulent activities, allowing for immediate responses.
- Real-time Customer Behavior Analysis: Tracking user interactions on websites or applications to understand behavior patterns and enable immediate personalization or recommendations.
In all these use cases, the ability to handle and process data in real time is paramount. Amazon Kinesis provides the tools to capture, transport, analyze, and act upon these continuous data streams effectively.
Amazon Kinesis Data Streams
Let’s break down Amazon Kinesis Data Streams (KDS) – the foundational service for capturing and ingesting real-time data streams.
Producers: The Data Originators
Producers are the sources that generate and write data to your Kinesis Data Stream. These can be various devices, applications, or services producing a continuous flow of data. AWS offers several ways for producers to interact with KDS:
- AWS SDKs: Provide the most flexibility, allowing you to build highly customized producers. You have fine-grained control over aspects like batching, error handling, and retry mechanisms.
- Kinesis Producer Library (KPL): Designed for high-throughput applications, the KPL optimizes data writing to Kinesis. It handles complex tasks like batching, compression, and retries efficiently, often leading to better performance than using the raw SDK.
- Kinesis Agent: A pre-built, easy-to-configure application ideal for collecting log data from servers and streaming it to Kinesis without requiring any custom coding.
Writing to the Data Stream
When producers send data to a Kinesis Data Stream, the following happens:
- Data Records: Producers format the data into data records. Each record is a unit of data (up to 1 MB in size) and can be structured (e.g., JSON) or unstructured.
- Partition Key: Each data record includes a partition key, which is a string used by Kinesis to determine which shard the record will be written to. The partition key ensures that records with the same key are consistently routed to the same shard, maintaining order within that key’s sequence.
- Shards - The Processing Units: The Kinesis Data Stream is composed of one or more shards. Shards are the fundamental units of throughput and parallelism in KDS.
Shards: The Heart of Kinesis Data Streams
- Processing Capacity: Each shard provides a fixed capacity:
- In-throughput: 1 megabyte per second OR 1,000 records per second (whichever limit is reached first).
- Out-throughput: 2 megabytes per second.
- Scalability: The total capacity of your Kinesis Data Stream is determined by the number of shards it contains. More shards mean higher throughput for both ingestion and processing.
- Parallel Processing: Shards enable parallel processing of data. Records with different partition keys can be processed by different shards concurrently.
- Data Retention: You can configure the retention period for data in your Kinesis Data Stream, ranging from a minimum of 24 hours up to a maximum of 365 days. The default is 24 hours.
- Durability: Data in Kinesis Data Streams is replicated across multiple Availability Zones (AZs) to ensure high availability and fault tolerance.
- Immutability: Once data is written to a Kinesis Data Stream, it cannot be deleted or altered within the retention period.
Visually:
/attachments/Pasted-image-20250514224114.png)
Scaling Kinesis Data Streams
Kinesis Data Streams offers two primary ways to manage capacity:
- Manual Shard Management (Provisioned Mode): You manually specify the number of shards for your stream. You can dynamically add or remove shards as your throughput requirements change. You are billed hourly based on the number of shards provisioned.
- Auto Scaling (On-Demand Mode): Kinesis Data Streams automatically scales the number of shards up or down based on the throughput of your stream over the last 30 days. You don’t need to provision shards manually. You are billed based on the actual data ingested and retrieved. On-Demand mode starts with a default write capacity of 4 MB/s or 4,000 records/s per AWS account per region.
Consumers: Processing the Stream Data
Consumers are applications or services that read and process data from a Kinesis Data Stream. AWS provides several options for building consumers:
- AWS SDKs: Allow you to build custom consumer applications with fine-grained control over data retrieval and processing.
- Kinesis Client Library (KCL): Simplifies the development of robust and scalable consumer applications. The KCL handles complex tasks like shard discovery, load balancing across multiple consumer instances, and checkpointing.
- Amazon Data Firehose: A fully managed service that can act as a consumer, taking data from a Kinesis Data Stream and delivering it to various destinations (S3, Redshift, etc.) efficiently and reliably.
- Amazon Managed Apache Flink (formerly Kinesis Data Analytics): Enables real-time analytics on Kinesis Data Streams using SQL or Apache Flink applications.
- AWS Lambda: You can configure a Lambda function to be triggered by new records arriving in a Kinesis Data Stream. This provides a serverless and automatically scaling way to perform lightweight, event-driven processing of streaming data. We will demonstrate this in an upcoming lecture.
Throughput and Latency
Okay, let’s break down the crucial concepts of throughput and latency in the context of data streaming, particularly with Amazon Kinesis.
Throughput: The Data Volume
Throughput refers to the amount of data that is successfully ingested into (write throughput) or retrieved from (read throughput) your data stream over a specific period, typically measured in megabytes per second (Mbps) or records per second.
- Real-World Measurement: Throughput is an actual, observed rate of data flow within your stream. It reflects the real-world capacity being utilized.
- Proportional to Shards (in Kinesis Data Streams): In Kinesis Data Streams, the total throughput of your stream is directly proportional to the number of shards you have provisioned. Each shard provides a specific ingest and egress capacity. Increasing the number of shards increases the overall throughput capacity of the stream, allowing it to handle higher data volumes.
- Scaling for High Volume: When your data volume increases, you’ll likely need to increase the number of shards in your Kinesis Data Stream (either manually in provisioned mode or automatically in on-demand mode) to maintain sufficient throughput and avoid exceeding the capacity of your stream.
- Contrast with Bandwidth: It’s important to distinguish throughput from bandwidth. Bandwidth often refers to the theoretical maximum data transfer rate of a network connection or a component. Throughput, on the other hand, is the actual rate achieved, which can be limited by various factors beyond just bandwidth, such as processing capacity, application logic, and network conditions.
Latency: The Data Delay
Latency, in general terms, is the delay between the initiation of an action and the availability of its result. In the context of AWS Kinesis, the relevant metric is propagation delay or end-to-end latency.
- Kinesis Propagation Delay: This is the time it takes for a data record to be written to the Kinesis Data Stream by a producer and subsequently become available for processing by a consumer application.
- Influenced by Consumer Polling Interval: The primary factor influencing latency in Kinesis Data Streams is how frequently the consumer application checks (polls) each shard for new data.
- AWS Recommendation: AWS recommends that consumers poll each shard in the stream at least once per second.
- Kinesis Client Library (KCL) Default: The Kinesis Client Library (KCL), a common way to build Kinesis consumers, defaults to polling each shard every second. This default setting typically ensures an average propagation delay of under one second, aligning with AWS best practices for low latency.
- Reducing Latency by Increasing Polling Frequency: For applications with very strict low-latency requirements, you can potentially reduce the propagation delay by increasing the polling frequency of the consumer (i.e., polling more often than once per second). This allows the consumer to discover and process new records more quickly.
- Considerations for Increased Polling: While increasing polling frequency can lower latency, it’s crucial to manage this adjustment carefully. Polling too aggressively can potentially put an excessive load on the Kinesis Data Stream and might lead to hitting API rate limits if not handled efficiently. You need to ensure your consumer application and the Kinesis stream can handle the increased polling frequency without performance degradation or exceeding service limits.
In summary, throughput measures the volume of data flow, and scaling Kinesis Data Streams (by adjusting the number of shards) is the primary way to manage it. Latency measures the delay in data availability for consumers, and it’s primarily influenced by the consumer’s polling frequency, with the default and recommended setting aiming for sub-second latency. Understanding and optimizing these two factors are critical for designing and operating efficient and responsive data streaming applications using Amazon Kinesis.
Creating a Data Stream (Hands-on)
Let’s walk through the steps of creating a Kinesis Data Stream in the AWS console where we want to:
- set up first a data stream
- put some records
- see also how this can be consumed.
We want to then also process this directly with a lambda function.
Here’s the step:
- Access Amazon Kinesis: Search for “Kinesis” in the AWS Management Console and select Amazon Kinesis.
- Navigate to Data Streams: In the Kinesis dashboard, on the left-hand menu, click on Data Streams.
- Create Data Stream: Click the Create data stream button.
- Configure Stream Details:
- Data stream name: Enter a descriptive name for your stream (e.g.,
my-first-data-stream). Remember the naming conventions: uppercase, lowercase letters, numbers, hyphens (-), and underscores (_). - Capacity mode: You have two options:
- On-demand: This is suitable for unpredictable or variable data streaming rates. Kinesis automatically scales capacity based on demand.
- Provisioned: You manually specify the number of shards your stream will have. This is better when you have a more predictable data flow and want to control costs and capacity. For this demonstration, let’s choose Provisioned.
- Number of shards: Enter the initial number of shards you want for your stream. For demonstration purposes, let’s use 2 shards. You can use the Shard estimator to get a recommendation based on your expected throughput (number of records and average record size).
- Data stream name: Enter a descriptive name for your stream (e.g.,
- Review and Create: Review your settings. You’ll see information about pricing (per shard hour and per PUT payload unit). Click Create data stream:
/attachments/Pasted-image-20250517173432.png)
- Monitor Stream Creation: You’ll be redirected to the list of your data streams. The status of your newly created stream will initially be
Creating. Once it becomesActive, your stream is ready for use. This usually takes a few minutes. - Explore Stream Details: Once the stream is active, click on its name (
my-first-data-stream) to view its details. You’ll see:- Applications tab:
- Overview: General information about the stream, its ARN, and status.
- Producers: Information and tools for sending data to the stream (we’ll explore this later).
- Consumers: Information and options for reading and processing data from the stream (we’ll also explore this).
- Monitoring tab: Metrics and graphs showing the activity of your stream (ingestion, consumption, errors, etc.). You won’t see any data here yet as we haven’t sent any.
- Configuration tab: Settings like capacity mode, number of shards, and data retention period. You can edit these settings even after the stream is created.
- Enhanced fan-out tab: A feature for optimizing data consumption by multiple consumers (we’ll discuss this later).
- Applications tab:
You have now successfully created a Kinesis Data Stream. The next steps will involve putting data into this stream and then consuming and processing that data.
Enhanced Fan-Out for Kinesis Consumers
Let’s break down the concept of enhanced fan-out for Amazon Kinesis Data Streams consumers. Let’s first of all clarify the terminology:
- Fan-Out: Refers to the scenario where a single data source (in our case, a Kinesis Data Stream) distributes data to multiple independent consumers or destinations. The goal is to enable multiple applications to process the same stream of data for different purposes:
/attachments/Pasted-image-20250517174204.png)
- Fan-In: Involves aggregating or combining data from multiple sources (e.g., multiple Kinesis streams, IoT devices) into a single destination for consolidated processing or analysis:
/attachments/Pasted-image-20250517174239.png)
Our focus here is on fan-out and how to optimize it for Kinesis Data Streams.
Traditionally, when multiple consumers read from a Kinesis Data Stream, they share the available read throughput of 2 MB per second per shard. This can become a bottleneck as the number of consumers increases. Each consumer competes for the same limited read capacity, potentially leading to:
- Reduced Throughput per Consumer: As more consumers are added, each gets a smaller slice of the shared throughput, slowing down their individual processing.
- Scalability Issues: Adding more consumers can degrade the performance of all existing consumers.
- Increased Latency: Competition for read throughput can increase the time it takes for consumers to receive and process new data.
Enhanced fan-out is a feature in Amazon Kinesis Data Streams designed to address these limitations. It works by establishing a dedicated, real-time HTTP/2 connection between the data stream and each registered consumer. Key benefits of this feature:
- Dedicated Read Throughput: Each consumer registered with enhanced fan-out receives its own dedicated read throughput of 2 MB per second per shard. This throughput is independent of the number of other consumers reading from the same stream. You can have up to 20 registered consumers, and each will get its full dedicated throughput.
- Improved Scalability: Adding new consumers no longer impacts the read performance of existing consumers. Each new consumer gets its own dedicated capacity.
- Reduced Latency: Enhanced fan-out pushes data to consumers over persistent HTTP/2 connections, resulting in significantly lower latency (around 70 milliseconds) compared to the standard polling mechanism (around 200 milliseconds). Data is delivered to consumers as soon as it’s available in the shard.
- Simplified Application Development: Developers don’t need to implement complex logic for sharing and managing read throughput among multiple consumers. The dedicated throughput per consumer simplifies application design.
Enhanced fan-out is particularly beneficial in scenarios with:
- A High Number of Concurrent Consumers: When you have multiple independent applications or services that need to process the same Kinesis data stream concurrently.
- High Throughput Requirements per Consumer: If individual consumers need to process data at a high rate without being limited by shared throughput.
- Low Latency Requirements: Applications that require near real-time processing of streaming data will benefit from the reduced latency offered by enhanced fan-out.
While enhanced fan-out offers significant performance and scalability benefits, it also comes with additional costs. You are charged per consumer registered for enhanced fan-out and for the data retrieved by these consumers. Therefore, it’s important to weigh the benefits against the cost, especially if you have a large number of consumers or high data retrieval volumes.
Pull and Consume Data from Stream (Hands-on) (TODO: rifai questa lezione)
Let’s walk through the steps of manually putting records into your Kinesis Data Stream using the AWS Command Line Interface (CLI) and then pulling and consuming those records.
Prerequisites:
- AWS CLI Installed and Configured: Ensure you have the AWS CLI installed and configured with your AWS credentials and the correct region where you created your Kinesis Data Stream.
- Stream Name: You’ll need the exact name of your Kinesis Data Stream (e.g.,
my-first-data-stream).
Step 1: Put Records into the Kinesis Data Stream (Manual)
We’ll use the aws kinesis put-record command to manually add records to your stream. Remember that each record requires a StreamName, a PartitionKey, and Data (which must be base64 encoded).
-
Put Record 1 (Partition Key:
key1, Data:data entry one)aws kinesis put-record --stream-name my-first-data-stream --partition-key "PartitionKey" --data $(echo "Data Entry 1" | base64)You should see an output similar to this, indicating the
ShardIdandSequenceNumber:/attachments/Pasted-image-20250517183647.png)
-
Put Record 2 (Partition Key:
PartitionKey, Data:Data Entry 1)aws kinesis put-record --stream-name my-first-data-stream --partition-key "PartitionKey" --data $(echo "Data Entry 1" | base64)This will likely go to the same
ShardIdas record 1 because we used the samePartitionKey. -
Put Record 3 (Partition Key:
PartitionKey2, Data:Data Entry 2)aws kinesis put-record --stream-name my-first-data-stream --partition-key "PartitionKey2" --data $(echo "Data Entry 2" | base64)This might go to a different
ShardId(e.g.,shardId-000000000000) depending on the partition key distribution. -
Put Record 4 (Partition Key:
key2, Data:data entry four)aws kinesis put-record --stream-name my-first-data-stream --partition-key "PartitionKey3" --data $(echo "Data Entry 4" | base64)This should go to the same
ShardIdas record 3. -
Put Record 5 (Partition Key:
anotherkey, Data:data entry five)aws kinesis put-record --stream-name my-first-data-stream --partition-key anotherkey --data $(echo "data entry five" | base64)This might go to yet another
ShardId. -
Put Record 6 (Partition Key:
key1, Data:data entry six)aws kinesis put-record --stream-name my-first-data-stream --partition-key key1 --data $(echo "data entry six" | base64)This will go to the same
ShardIdas records 1 and 2.
Pull and Consume Data from the Stream
To read data from the stream, we first need to get a shard iterator. A shard iterator specifies the position in a shard from which to start reading data.
-
Get Shard Iterator for Shard ID
shardId-000000000000(usingTRIM_HORIZONto get the oldest data):aws kinesis get-shard-iterator --stream-name my-first-data-stream --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZONThe output will contain a
ShardIteratorstring. Copy this value. -
Get Records from Shard ID
shardId-000000000000using the Shard Iterator:Bash
aws kinesis get-records --shard-iterator "YOUR_SHARD_ITERATOR_VALUE"Replace
"YOUR_SHARD_ITERATOR_VALUE"with the actual iterator you copied. The output will contain arecordsarray. Each record will haveDatawhich is base64 encoded. You’ll also seeNextShardIterator. -
Decode the Base64 Data: You’ll need to decode the
Datafield from base64 to see the original data. You can use a tool likebase64 -don Linux/macOS or an online base64 decoder. For example:Bash
echo "BASE64_ENCODED_DATA" | base64 -d -
Get Shard Iterator for Shard ID
shardId-000000000001(usingLATESTto get new data):Bash
aws kinesis get-shard-iterator --stream-name my-first-data-stream --shard-id shardId-000000000001 --shard-iterator-type LATESTCopy the
ShardIterator. Any new data put into the stream after this iterator was created will be retrieved. -
Get Records from Shard ID
shardId-000000000001using the Shard Iterator:Bash
aws kinesis get-records --shard-iterator "YOUR_NEW_SHARD_ITERATOR_VALUE"If you put more records after getting this
LATESTiterator, they will appear in the output.
Key Observations:
- Records with the same
--partition-keytend to go to the same shard, ensuring order within that key’s sequence. - Different
PartitionKeyvalues can result in records being written to different shards, enabling parallelism. - You need a
ShardIteratorto start reading data from a specific point in a shard. - The
Datain the records is base64 encoded. - The
get-recordsresponse provides aNextShardIteratorwhich you use in subsequent calls to continue reading data from the shard. Shard iterators have a limited lifespan (around 5 minutes).
This manual interaction demonstrates the fundamental mechanics of putting data into and pulling data from a Kinesis Data Stream. In a real-world scenario, you would typically have automated producers and consumers interacting with the stream programmatically. In the next step, we’ll explore how to create a Lambda function to consume data from this stream in real time.
Calling a Lambda function from Amazon Kinesis (Hands-on)
Let’s walk through the steps to connect a Lambda function to your Kinesis Data Stream, enabling real-time processing of incoming data:
-
Create an IAM Role for the Lambda Function:
- Navigate to the IAM service in the AWS console.
- Click on Roles and then Create role.
- Select AWS service as the trusted entity and choose Lambda as the service that will use this role.
- Attach the following policies:
- AmazonKinesisReadOnlyAccess: Allows the Lambda function to read data from your Kinesis Data Stream.
- AmazonS3FullAccess: Grants the Lambda function permission to write data to S3 buckets. In a production environment, it’s recommended to restrict this to only the necessary permissions on the specific bucket.
- Give the role a descriptive name (e.g.,
LambdaKinesisRole) and create it.
-
Create the Lambda Function:
- Navigate to the Lambda service in the AWS console.
- Click on Create function and choose Author from scratch.
- Give your function a name (e.g.,
myKinesisFunction). - Select a runtime (e.g., Python 3.12).
- Under Permissions, choose Use an existing role and select the IAM role you created in the previous step (
LambdaKinesisRole). - Create the function.
/attachments/Pasted-image-20250526205845.png)
-
Add a Kinesis Trigger to the Lambda Function:
- In the Lambda function editor, in the Function overview section, click Add trigger.
- Select Kinesis as the trigger source.
- Choose your Kinesis Data Stream from the Kinesis stream dropdown.
- Leave the Consumer field blank for now.
- You can adjust the Batch size if needed. The default is usually fine.
- For Starting position, choose Latest to process only new records that arrive in the stream after the trigger is created.
- Add the trigger.
/attachments/Pasted-image-20250526205948.png)
-
Add the Lambda Function Code:
- In the Lambda function editor, in the Code source section, replace the default code with your Python code:
import base64 import boto3 import json import os from datetime import datetime s3 = boto3.client('s3') BUCKET_NAME = 'out-first-bucket-66543name' def lambda_handler(event, context): for record in event['Records']: # Kinesis data is base64 encoded so decode here payload = base64.b64decode(record['kinesis']['data']) print(f"Decoded payload: {payload}") # Construct a filename based on the event timestamp and partition key partition_key = record['kinesis']['partitionKey'] timestamp = datetime.utcnow().strftime("%Y-%m-%d-%H%M%S") filename = f'{partition_key}-{timestamp}.txt' # Put the data into the s3 bucket s3.put_object(Bucket=BUCKET_NAME, Key=filename, Body=payload) -
Deploy the Lambda Function:
- Click the Deploy button in the Lambda function editor to save your code and configurations.
-
Test the Integration:
- Use the AWS CLI (click on CloudShell button in the top-bar) to put some records into your Kinesis Data Stream (as we did in the previous lesson). For example run
aws kinesis put-record --stream-name myfirstdatastream --partition-key "PartitionKey" --data $(echo "Data Entry 6" | base64)3 times to put 3 records. - Check your S3 bucket. You should see files appearing in the bucket, each containing the data from the Kinesis records. The filename will be based on the timestamp of when the Lambda function processed the record:
/attachments/Pasted-image-20250526212006.png)
- Use the AWS CLI (click on CloudShell button in the top-bar) to put some records into your Kinesis Data Stream (as we did in the previous lesson). For example run
This setup demonstrates how to create a Lambda function that is triggered by new data arriving in your Kinesis Data Stream. The Lambda function processes the data (in this case, decoding it and writing it to S3), enabling real-time stream processing.
Common Issues & Troubleshooting
Let’s go through some of the common issues you might encounter when working with Amazon Kinesis Data Streams, along with their causes and troubleshooting steps. These points are important for understanding the operational aspects and for potential exam scenarios.
1. Slow Write Rates / Throughput Exceptions (Producer Side)
- Issue: Producers are unable to write data to the stream at the desired rate, leading to throttled requests and slow ingestion.
- Cause:
- Exceeding the stream’s provisioned write capacity (MB/s or records/s per shard).
- Hitting API rate limits for stream-level operations (e.g.,
CreateStream,DeleteStreamhave lower limits thanPutRecord).
- Troubleshooting:
- Monitor Throughput Exceptions: Check Kinesis metrics for
WriteThroughputExceededorPutRecord.Successvs.PutRecord.Throttledto confirm throttling. - Increase Shard Count: If you’re using Provisioned capacity mode, increase the number of shards in your stream to increase its overall write capacity.
- Optimize Producer Batching: For high-throughput scenarios, batch records together (using
PutRecordsAPI or KPL) to reduce the number of API calls, which can help stay within API rate limits and improve efficiency. - Consider On-Demand Mode: If your workload is unpredictable, switch to On-Demand capacity mode to automatically scale capacity.
- Monitor Throughput Exceptions: Check Kinesis metrics for
2. Uneven Data Distribution / Hot Shards (Producer Side)
- Issue: Data is not evenly distributed across shards, leading to one or more “hot shards” that receive a disproportionately high volume of data, causing them to hit their capacity limits while other shards are underutilized.
- Cause: Poor choice of partition keys, leading to many records having the same (or very few unique) partition keys.
- Troubleshooting:
- Effective Partition Key Strategy: Choose a partition key that ensures an even distribution of data across all shards. This often means using a high-cardinality attribute (e.g., user ID, device ID, or a composite key).
- Monitor Shard-Level Metrics: Regularly check shard-level metrics in CloudWatch to identify hot shards (e.g.,
IncomingBytes,IncomingRecordsper shard). - Resharding: If hot shards are identified, you might need to perform resharding (splitting a hot shard or merging underutilized shards) to redistribute the data.
3. Slow Read Rates (Consumer Side)
- Issue: Consumers are unable to read data from the stream at the desired rate, causing them to fall behind.
- Cause:
- Exceeding the shard’s read capacity (2 MB/s per shard) when using standard consumers.
- Too low a
Limitvalue inGetRecordsAPI calls (maximum number of records to return).
- Troubleshooting:
- Increase Shard Count: Similar to write issues, if the aggregate read demand exceeds the stream’s total read capacity, increasing shards will help.
- Increase
GetRecordsLimit: Ensure your consumer applications are requesting a sufficient number of records perGetRecordscall (up to 10,000 records or 10 MB, whichever is smaller). Revert to system defaults if you’ve set it too low. - Optimize Processing Logic: Ensure the consumer’s processing logic is efficient and not causing bottlenecks.
- Use Enhanced Fan-Out: For multiple concurrent consumers needing high throughput, enable Enhanced Fan-Out. This provides each registered consumer with its own dedicated 2 MB/s per shard read throughput, eliminating competition.
4. GetRecords Returns Empty Array
- Issue: A
GetRecordsAPI call returns an emptyRecordsarray, even when you expect data. - Cause:
- No Data in Shard: There might genuinely be no new data written to that shard since the last
GetRecordscall. - Retention Period: Data might have passed its retention period and is no longer available in the stream.
- Shard Iterator Position: The
ShardIteratormight be pointing to a position where there’s no data immediately available. Subsequent calls with theNextShardIteratorwill eventually pick up data if it arrives.
- No Data in Shard: There might genuinely be no new data written to that shard since the last
- Troubleshooting:
- This is generally not an error. Consumers should continuously poll for new records using the
NextShardIteratorreturned in eachGetRecordsresponse. - Kinesis Client Library (KCL): The KCL automatically handles these empty responses and continues polling, simplifying consumer development. If you’re using the AWS SDK directly, you’ll need to implement this polling logic yourself.
- This is generally not an error. Consumers should continuously poll for new records using the
5. Skipped Records
- Issue: Some data records are missed by the consumer application.
- Cause: Typically, unhandled exceptions within the consumer’s
processRecordslogic (especially when using KCL or custom consumers). If an exception occurs, the consumer might not checkpoint its progress correctly, leading to it re-processing from an earlier point or missing records if the iterator advances too far. - Troubleshooting:
- Robust Exception Handling: Implement comprehensive
try-exceptblocks around your processing logic to gracefully handle errors. - Correct Checkpointing: Ensure that consumers properly checkpoint their progress (sequence numbers) after successfully processing a batch of records. This allows them to resume from the last successfully processed point.
- Robust Exception Handling: Implement comprehensive
6. Shard Iterator Expired Exception
- Issue: The
ShardIteratorused in aGetRecordscall is no longer valid. - Cause:
- Inactivity: The
ShardIteratorhas not been used for more than 5 minutes (its lifespan). - Incorrect
NextShardIteratorUsage: Not using theNextShardIteratorreturned by the previousGetRecordscall. - DynamoDB Write Capacity (for KCL): If using KCL, the underlying DynamoDB table (used by KCL for checkpointing) might not have enough write capacity, preventing KCL from updating checkpoints, which can then lead to iterators expiring.
- Inactivity: The
- Troubleshooting:
- Continuously Use
NextShardIterator: Always use theNextShardIteratorreturned by the previousGetRecordscall for subsequent calls. - Increase DynamoDB Write Capacity: If using KCL and suspecting DynamoDB throttling, increase the write capacity units (WCUs) of the KCL’s checkpointing table.
- Restart from
TRIM_HORIZONorLATEST(with caution): In recovery scenarios, if a consumer has completely lost its state or an iterator expires, you might need to re-initialize an iterator fromTRIM_HORIZON(oldest data) orLATEST(newest data), depending on whether you can tolerate reprocessing or data loss.
- Continuously Use
7. Consumers Falling Behind
- Issue: Consumers are processing data slower than it’s being ingested, causing the
MillisBehindLatestorGetRecords.IteratorAgeMillisecondsmetrics to continuously increase. - Cause:
- Insufficient Consumer Resources: The consumer application doesn’t have enough compute resources (CPU, memory) to process the data fast enough.
- Inefficient Processing Logic: The code within the consumer is too slow or inefficient.
- Insufficient Shard Capacity: The stream’s total read capacity is not enough for all consumers.
- Troubleshooting:
- Monitor
MillisBehindLatest: Use CloudWatch to track theMillisBehindLatestmetric for your consumers.- Spiky increases: May indicate transient network issues or temporary upstream bursts.
- Steady increases: Indicate a persistent problem that needs attention.
- Increase Retention Period: As a temporary measure, increase the Kinesis stream’s data retention period to avoid data loss while you troubleshoot.
- Scale Consumer Resources: Increase the compute power (e.g., number of Lambda invocations, EC2 instance size) of your consumer application.
- Optimize Consumer Code: Improve the efficiency of your processing logic.
- Increase Shard Count: If multiple consumers are sharing throughput, or even a single consumer needs more capacity, consider increasing the number of shards or using Enhanced Fan-Out.
- Monitor
8. KMS Master Key Permissions Error
- Issue: Inability to read or write to an encrypted Kinesis Data Stream.
- Cause: The IAM role or user attempting to interact with the encrypted stream lacks the necessary permissions for the AWS Key Management Service (KMS) key used to encrypt/decrypt the data.
- Troubleshooting:
- KMS Key Policy: Ensure the KMS Key Policy grants the necessary
kms:Decrypt(for consumers) andkms:Encrypt(for producers) permissions to the IAM principal interacting with the stream. - IAM Policy: Ensure the IAM policy attached to the producer/consumer has permissions to use the KMS key (e.g.,
kms:Encrypt,kms:Decrypt,kms:GenerateDataKey). - Cross-Account Access: If accessing from a different account, ensure both the KMS key policy and the IAM role/user policy are configured correctly for cross-account access.
- KMS Key Policy: Ensure the KMS Key Policy grants the necessary
Kinesis Firehose
Amazon Kinesis Data Firehose (Kinesis Firehose) is a fully managed AWS service designed for near real-time data processing and loading. It simplifies the process of capturing, transforming, and loading data streams into various data stores without requiring extensive setup or manual infrastructure management.
Key features and how it works:
- Fully Managed: Unlike traditional data processing systems, Kinesis Firehose handles infrastructure management, including scaling, automatically. You simply define data sources, destinations, and any transformations needed, and the service manages the rest.
- Automatic Scaling: It automatically adjusts resources to handle increased incoming data volumes, ensuring continuous data processing without manual intervention.
- Data Producers: Kinesis Firehose integrates easily with various AWS services and applications as data producers. Common sources include:
- Kinesis Data Streams: Often used as a primary source.
- AWS CloudWatch Logs: For forwarding logs for further processing and analysis.
- AWS IoT: For publishing data directly from connected devices.
- CloudWatch: For routing detailed event data for real-time analytics.
- Buffering Mechanism: A key difference from Kinesis Data Streams is its buffering mechanism. Data is collected first, either by time intervals (e.g., configurable time) or file size limits (up to 128MB). Once these limits are reached, records are batched together and then sent to consumers. This makes it more efficient and cost-effective for near real-time processing, as it reduces the number of API calls.
- Data Transformations: Kinesis Firehose allows for on-the-fly data transformations using AWS Lambda. You can reformat, filter, or modify data before it’s loaded into target systems, which is useful for meeting specific storage or analytical tool requirements.
- Reliability and Fallback: If there are issues during data processing, Kinesis Firehose can route problematic data to a fallback S3 bucket. This allows for data integrity and reliability, enabling retries if something doesn’t work as expected. You can also send original records to an S3 bucket for archiving.
- Supported Destinations: Kinesis Firehose can deliver data to various destinations, including:
- Amazon S3: A primary storage option.
- Amazon Redshift: Data is first put into an S3 bucket and then moved to Redshift using a copy command.
- Amazon OpenSearch Service
- Third-party services: Such as Splunk or MongoDB.
- Compression and Encryption: It supports out-of-the-box data compression and encryption capabilities, making data storage and transfer more efficient and secure.
/attachments/Pasted-image-20250526223411.png)
Kinesis Firehose is well-suited for:
- Real-time Analytics (Near Real-time): Collecting and loading large volumes of log, event, or clickstream data for analysis where near real-time insights are sufficient.
- Log Data and Event Data Capturing: Centralizing logs from various sources into a data lake or analytics platform.
- Data Transformation and Enrichment: Applying simple or custom transformations to streaming data before it lands in its final destination.
- Streamlined ETL: Automating the Extract, Transform, Load process for streaming data without managing servers.
Key Features Summary:
- Near Real-time Processing: Achieved through buffering and batching for efficiency.
- Diverse Data Formats and Transformations: Supports various structured and unstructured formats (e.g., Parquet, ORC, JSON, CSV). Seamless integration with AWS Lambda for custom transformations, and built-in compression and format conversion.
- Automatic Scaling: Adjusts resources dynamically to handle fluctuating data volumes without manual intervention.
- Cost-Effective: Pay-as-you-go pricing based on the volume of data processed, leveraging batching for efficient resource utilization.
- High Reliability: Automatic retries and fallback to S3 for undelivered records.
Kinesis Firehose vs. Kinesis Data Streams: A Comparison
When deciding between Kinesis Data Streams (KDS) and Kinesis Firehose, it’s important to understand their core differences, especially in terms of management, real-time capabilities, and data handling.
Amazon Kinesis Data Streams (KDS):
- Setup and Management:
- Requires a more manual and extensive setup. You have to manage underlying infrastructure, like sharding.
- Involves more manual intervention and often requires custom coding for producers and consumers.
- Real-time Capabilities:
- Offers true real-time processing.
- Achieves very low latency: typically around 200 milliseconds, or as low as 70 milliseconds with the enhanced fan-out feature.
- Data Storage:
- Stores data for a configurable retention period (default is 24 hours, can be extended up to 365 days).
Amazon Kinesis Data Firehose (Kinesis Firehose):
- Setup and Management:
- It’s a fully managed service, meaning AWS handles all the infrastructure, scaling, and maintenance for you.
- Designed for simplicity and efficient data delivery, requiring minimal setup.
- Automatically scales to handle varying data volumes.
- Real-time Capabilities:
- Operates in near real-time. This is due to its buffering mechanism, which collects data in batches (either by time intervals or file size) before delivery.
- More cost-efficient due to batching, as it reduces the number of API calls and optimizes resource usage.
- Data Storage:
- Does not store data itself; it directly delivers data to its specified destinations.
- Built-in Features:
- Supports on-the-fly data transformations using AWS Lambda.
- Offers out-of-the-box compression and encryption.
- Includes a fallback mechanism to an S3 bucket for failed deliveries or archiving original records, enhancing reliability.
In essence, if your priority is the lowest possible latency and fine-grained control over your data streams, even if it means more manual setup, Kinesis Data Streams is the way to go. If you need a simple, fully managed, and cost-effective solution for near real-time data loading with built-in transformations and automatic scaling, Kinesis Firehose is the better choice.
Creating Data Firehose Stream (Hands-on)
This section describes the practical steps for setting up and testing a Kinesis Firehose stream.
Before starting, it’s useful to check the cost calculation. Kinesis Firehose pricing is generally consumption-based, with tiers. For instance, the first 500TB per month might cost around $0.03 per gigabyte, though prices can vary by region. For small amounts of data, the cost will be minimal.
/attachments/Pasted-image-20250527213013.png)
Setting Up the Firehose Stream:
- Access the Service: From the AWS console, select “Data Firehose” and then choose “Create a firehose stream.”
- Choose Source and Destination:
- Source: The most common source is Amazon Kinesis Data Streams. You can select an existing Kinesis Data Stream (e.g., one set up in a previous lecture) from the list.
- Destination: You can choose from various destinations like OpenSearch, Amazon Redshift, Amazon S3, Snowflake, or Splunk. For this example, Amazon S3 is chosen for simplicity, to verify data arrival in an S3 bucket.
- Stream Naming:
- Give your stream a descriptive name (e.g., “KDS-Test-S3-5837”).
- Data Transformation, Conversion, and Compression (Optional):
- Kinesis Firehose offers built-in options to transform records, convert formats (e.g., to Apache Parquet, Apache Orc), and compress records.
- You can enable data transformation using AWS Lambda, convert record format, or decompress source records by checking the respective boxes. (In this hands-on, these options are initially skipped for simplicity).
- Destination Settings (S3 Specific):
- S3 Bucket Selection: Choose an existing S3 bucket or create a new one to serve as the destination.
- Prefixing (Optional): You can enable dynamic partitioning, add a prefix (e.g., for organizing data within the bucket), or an error output prefix. Timezone-based prefixes are also an option. (For this example, default settings are kept).
- Buffering Hints:
- Configure the buffer size (from 1MB to 128MB) and buffer interval (from 0 to 900 seconds).
- The data will be batched and delivered when either the size limit or the time interval is reached, whichever comes first. (Default settings are typically used for initial setup).
- Compression and Encryption (Optional):
- You can easily enable compression and server-side encryption for data records. (These are skipped in this example).
- Tags (Optional):
- Add tags for resource organization and cost allocation. (Skipped for this example).
- Create Stream: Proceed to create the Firehose stream with the configured settings.
Testing the Stream:
- Monitor Stream Status: After creation, the stream status will indicate if it’s active. You can monitor data ingestion metrics (bytes read, etc.) from the stream’s details page:
/attachments/Pasted-image-20250527213543.png)
- Send Demo Data: Kinesis Firehose provides a feature to send demo data directly from the console to test the stream’s functionality:
/attachments/Pasted-image-20250527213633.png)
- Verify Data in S3:
- After sending demo data for a few seconds, stop the demo data flow.
- Navigate to the designated S3 bucket.
- You will see new objects (files) created directly within the bucket, as no additional folder or partitioning was specified.
- You can download these files and open them to confirm that the demo data has arrived successfully.
/attachments/Pasted-image-20250527213702.png)
This basic setup demonstrates the core functionality of Kinesis Firehose in delivering data from a Kinesis Data Stream to an S3 bucket. Further lectures will explore adding transformations and other advanced options.
Data Firehose - Transformations with Lambda (Hands-on)
This section details how to integrate AWS Lambda for real-time data transformations within a Kinesis Firehose stream.
- Accessing Transformation Settings:
- Select your Firehose stream from the Kinesis Firehose console.
- Navigate to the “Configurations” tab.
- Look for the “Transform and convert records” section.
- Enabling Data Transformation:
- Click “Edit” within the “Transform and convert records” section.
- Check the box to “Enable data transformation using AWS Lambda.”
/attachments/Pasted-image-20250527223537.png)
- Creating a New Lambda Function (or Selecting an Existing One):
- Browse or Create New Function:
- You can browse for an existing Lambda function or choose to create a new one directly from this interface. Creating a new one is often simpler for a fresh setup.
- Choose a Blueprint:
- When creating a new function, select a pre-built blueprint. A common choice for Firehose integration is “General Amazon Data Firehose Processing” or a similar template that processes records sent to a Firehose stream. These blueprints provide a basic structure for handling Firehose events.
- Function Naming and Runtime:
- Give your Lambda function a descriptive name (e.g., “FirehoseFunction”).
- Select your preferred runtime (e.g., Python).
- Execution Role:
- Choose an execution role for the Lambda function. This role must have the necessary permissions to execute the Lambda function and to write to your S3 destination bucket. You can either create a new role or select an existing one with appropriate permissions (like LambdaKinesisRole that we created in a previous lesson).
- Review Code (Optional):
- The blueprint provides default code that decodes input records and encodes the output. You can customize this code to perform your specific transformations (e.g., data cleansing, enrichment, format conversion). For this example, the default simple processing will be used.
- Create Function:
- Click “Create function.”
/attachments/Pasted-image-20250527225445.png)
- Click “Create function.”
- Browse or Create New Function:
- Associating Lambda Function with Firehose Stream:
- Once the Lambda function is created and deployed, go back to your Firehose stream’s configuration.
- Under “Transform and convert records”, select the newly created Lambda function from the list. It might take a short while for the function to appear in the browse list, so a refresh might be needed.
- You can also adjust buffer size and interval if needed.
- Save the changes to your Firehose stream.
/attachments/Pasted-image-20250527225653.png)
- Testing the Transformation:
- Send Data to Kinesis Data Stream (Source): Since the Kinesis Data Stream is the source for Firehose, use the AWS CloudShell or your preferred method to send records to your Kinesis Data Stream with
aws kinesis put-recordcommand. Ensure you are using the correct Kinesis Data Stream name. - Wait for Processing: Kinesis Firehose operates in “near real-time,” so it will take a few minutes for the data to be buffered, processed by Lambda, and delivered to the S3 bucket.
- Verify Transformed Data in S3:
- Navigate to your S3 destination bucket.
- You should now observe new files appearing. If dynamic partitioning was enabled during stream creation, the data will be organized into folders (e.g., by year, month, day):
/attachments/Pasted-image-20250527232247.png)
- Download and open a file to verify that the Lambda function has processed the data. Even with the default Lambda blueprint, you should see the data structured as processed by the function.
- Send Data to Kinesis Data Stream (Source): Since the Kinesis Data Stream is the source for Firehose, use the AWS CloudShell or your preferred method to send records to your Kinesis Data Stream with
- Cleanup (Important):
- After completing your hands-on exercise, remember to delete your resources to avoid incurring unnecessary costs.
- Delete the Firehose Stream:
- Go to the Kinesis Firehose console, select your stream, and choose “Delete.” You will need to confirm the deletion by typing the stream name.
- Delete the Kinesis Data Stream:
- Go to the Kinesis Data Streams console, select your stream, and choose “Delete.” Confirm the deletion.
- Delete the Lambda Function:
- Go to the Lambda console, find your function, and delete it.
- Delete the S3 Bucket (Optional): If you created a new S3 bucket solely for this exercise, you might also consider deleting it.
This process demonstrates how Kinesis Firehose, in conjunction with AWS Lambda, provides a powerful and flexible way to perform on-the-fly transformations before delivering data to its final destination.
Amazon Managed Service for Apache Flink
Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics for Apache Flink, and often referred to as MSF or AMS Flink) is a fully managed service that allows you to query and process streaming data in real-time. It enables the creation of real-time analytics applications using Apache Flink, a powerful open-source stream processing framework, without the need to manage the underlying infrastructure.
Key Features and Use Cases:
- Fully Managed and Serverless: AWS handles all the underlying resources, scaling, and integration, allowing users to focus solely on application logic. It automatically scales based on workload demand.
- Real-time Analytics: Ideal for any scenario requiring real-time processing of data streams, such as:
- Calculating real-time metrics and aggregations for monitoring.
- Monitoring website traffic and user activity in real-time.
- Implementing streaming ETLs (Extract, Transform, Load) for continuous data processing.
- Querying and Application Development:
- Data can be queried using SQL.
- Applications can be built using Python, Scala, or Java.
How it Works: Amazon Managed Service for Apache Flink integrates seamlessly with various data sources and destinations.
- Flink Sources (Data Ingestion): Data can be ingested from:
- Amazon Kinesis Data Streams (KDS): A widely used real-time streaming service.
- Amazon Managed Streaming for Apache Kafka (MSK): A managed service for Apache Kafka clusters.
- Amazon S3: For batch processing or historical data.
- Custom Data Sources: Accessible via Apache Flink connectors or APIs.
- Real-time Data Processing:
- Once ingested, data streams are processed on the fly using Flink’s fast and reliable streaming engine.
- This includes operations like filtering, aggregating, and enriching data.
- It operates in near real-time with very low latency, enabling rapid analytics.
- Stateful Computations: MSF excels at maintaining and updating a state based on incoming data, crucial for use cases like anomaly detection.
- Checkpoints and Snapshots: Periodically creates checkpoints and snapshots to ensure fault tolerance and data integrity, allowing recovery from failures.
- Anomaly Detection: Users can implement algorithms within their applications to detect unusual patterns or deviations, triggering alerts or automated responses (e.g., in fraud detection).
- Event-Driven Actions: Define actions based on processed data.
- AWS Service Integration: Integrates with other AWS services like AWS Lambda for executing custom code, enabling diverse data processing scenarios.
- Flink Sinks (Data Destinations): Processed data can be sent to:
- Amazon S3: For long-term storage or further analysis.
- Amazon Kinesis Data Streams: For re-streaming processed data.
- Amazon MSK: For further Kafka-based processing.
- Custom Data Sources/Analytical Tools: For analysis or visualization.
/attachments/Pasted-image-20250528000748.png)
Pricing structure. Amazon Managed Service for Apache Flink uses a “pay-as-you-go” consumption-based pricing model with no upfront costs.
- Processing Units (CPU-hours):
- The core unit of charge is the Kinesis Processing Unit (KPU).
- One KPU consists of one vCPU and 4 GB of memory.
- You are charged based on the number of KPUs used per hour.
- Every application requires one additional KPU for orchestration purposes.
- KPUs automatically scale based on application needs, but users can also manually provision a desired number for more control.
- Data Storage:
- Charged based on the amount of data stored per month (in GB-months) if your application uses stateful storage.
- Interactive Mode (Flink Studio):
- If using the interactive development environment (Flink Studio notebooks), an additional charge of two KPUs applies for the interactive development.
- The same storage costs apply for interactive mode.
Amazon MSK
Amazon Managed Streaming for Apache Kafka (or just Amazon MSK) is a fully managed service that simplifies running Apache Kafka clusters. It offers an alternative to Kinesis for handling real-time data streams, particularly for users who prefer or require Apache Kafka.
Core Components of MSK Infrastructure:
- Kafka Brokers: These are the servers responsible for storing and processing data within your streams. They are the workhorses that provide compute power and handle data movement.
- Zookeepers: These act as the “brain” of the operation, managing the cluster’s state and performing critical operational tasks, such as leader election among brokers.
- Clusters: An MSK cluster is a collection of broker nodes working together to manage data streams.
- Availability Zones: MSK distributes brokers across multiple Availability Zones to ensure high data availability and resilience against network failures.
- Amazon EBS Volumes: Data is durably and securely stored on Amazon EBS volumes, replicated across different brokers and Availability Zones to protect against data loss and enable quick recovery from failures.
Key Differences and Configuration with Kinesis:
- Message Size Limit: A significant difference is the message size limit. MSK allows messages up to 10 MB, significantly larger than Kinesis’s 1 MB limit. This makes MSK suitable for applications requiring larger data payloads.
- Customization vs. Convenience:
- MSK: Offers more granular control and customization options for Kafka clusters. This is beneficial for complex setups that require fine-tuning for specific application needs and performance optimization. It provides flexibility but comes with a slightly more complicated setup and management overhead.
- Kinesis: Provides a more straightforward, fully managed, and generally easier-to-use experience, handling more configurations automatically.
- Data Organization:
- Kinesis: Data is organized into streams and shards. Throughput can be adjusted by splitting or merging shards.
- MSK (Kafka): Data is managed using topics and partitions. Scaling involves adding partitions, but once added, partitions generally cannot be removed, making it a bit more rigid and requiring more careful management compared to Kinesis.
Producers and Consumers:
- Producers: Applications or sensors that send data to MSK.
- Consumers: Applications or endpoints that read and process data from MSK. MSK simplifies the real-time data flow from producers to consumers.
Access Control and Security. Both Kinesis Data Streams and Amazon MSK offer robust security features, with some differences in access control models.
- Encryption:
- Both support in-flight TLS encryption.
- MSK also has the option for plain text communication.
- Both support KMS encryption for data at rest.
- Access Control Models:
- MSK: Offers three authentication and authorization models:
- Mutual TLS with Kafka ACLs: For topic-level authorization.
- Username and Password with Kafka ACLs: Another authentication mechanism.
- IAM Access Control: Integrates with AWS IAM for simplified authentication and authorization management within the AWS ecosystem.
- Kinesis Data Streams: Primarily uses IAM policies for both authentication and authorization, simplifying access control management.
- MSK: Offers three authentication and authorization models:
When to Choose MSK vs. Kinesis:
- Choose Kinesis when:
- You need a more straightforward, simpler setup, and easier management.
- Your data payloads are within the 1 MB message limit.
- You prefer a highly managed service with less operational overhead.
- Choose MSK when:
- Your application requires a higher message size limit (over 1 MB).
- You need more granular control and flexibility over your streaming infrastructure and configurations.
- You are already familiar with or have existing applications built on Apache Kafka.
- Your use case demands advanced Kafka-specific features or optimizations.
MSK Connect & MSK Serverless
Amazon MSK offers additional features and deployment options to further simplify managing Apache Kafka. Two notable ones are MSK Connect and MSK Serverless.
MSK Connect
MSK Connect is a fully managed service that simplifies the deployment and management of Kafka Connect connectors within Amazon MSK.
- What is Kafka Connect?
- Kafka Connect is an open-source component of Apache Kafka.
- Its primary purpose is to simplify the integration of Kafka with other data systems, such as databases, search indexes, and file systems.
- It enables streaming data into and out of Kafka without the need for writing custom integration code.
- How MSK Connect Helps:
- Fully Managed: MSK Connect manages the Kafka Connect infrastructure for you, eliminating the operational complexity of deploying, monitoring, and scaling Kafka Connect clusters.
- Simplified Data Integration: It streamlines the process of integrating Kafka with various data sources and sinks.
- Scalability: Allows for easy scaling of your data integrations as your data volumes grow, without requiring manual management of the underlying Connect infrastructure.
- Wide Range of Connectors: Supports a wide range of pre-built and custom connectors, facilitating common scenarios like moving data to Amazon S3, Apache Flink, or other systems.
MSK Serverless
MSK Serverless is an innovative cluster type within Amazon MSK designed to further simplify Kafka operations by providing a serverless experience.
- Eliminates Capacity Management:
- The key benefit is that it removes the need to provision, manage, or scale Kafka cluster capacity.
- Resources (compute and storage) are automatically provisioned and scaled dynamically based on your workload requirements.
- Ideal for Varying Workloads:
- It’s perfectly suited for use cases with varying or unpredictable data volumes, as it automatically adjusts to the experienced load.
- Pay-Per-Use Pricing:
- You only pay for the actual throughput (data read and written) and storage consumed, aligning costs directly with usage.
- Hands-Off Operation:
- Offers a significantly more hands-off approach to managing Kafka clusters, reducing operational overhead.
- Retains Kafka Features:
- Despite its serverless nature, MSK Serverless retains the powerful features and Apache Kafka APIs that users expect, allowing seamless migration of existing Kafka applications.