AWS Lambda
AWS Lambda is a serverless compute service that allows you to run code without provisioning or managing servers. You only pay for the compute time you consume, making it a cost-effective solution for various tasks.
Key Characteristics of AWS Lambda:
- Serverless: You don’t need to manage the underlying infrastructure. AWS handles all the server management, patching, and scaling.
- Automatic Scaling: Lambda automatically scales your application by running code in response to each trigger. You don’t need to worry about capacity planning.
- Pay-as-you-go: You are charged based on the number of requests and the compute time your code consumes, measured in milliseconds.
- Language Support: Lambda supports various programming languages, including Python, Node.js, Java, Go, C#, and Ruby.
- Event-Driven: Lambda functions are typically triggered by events from other AWS services or custom applications.
Use Cases for Data Ingestion and Processing. AWS Lambda is highly versatile for data-related tasks:
- Real-time Event Processing:
- S3 Event Handling: Trigger a Lambda function when a file is uploaded to an S3 bucket. The function can then process the file (e.g., data validation, format conversion, data enrichment) and move it to another location or store the processed data. This enables event-driven data ingestion pipelines.
- Kinesis Stream Processing: Continuously process data streams from services like Kinesis Data Streams. Lambda can read data in batches as it arrives and perform real-time analytics, transformations, or load the data into a data store.
- Automated Task Workflows: Trigger Lambda functions based on schedules or events to automate data-related workflows, such as data archiving, log analysis, or database maintenance.
Advantages of Using AWS Lambda for Data Tasks:
- Scalability: Handles varying workloads automatically without manual intervention.
- Cost Efficiency: You only pay for the compute resources used during function execution.
- Simplicity: Serverless nature reduces operational overhead.
- Statelessness (by default): Each invocation of a Lambda function is independent and doesn’t retain state from previous executions. While you can integrate with external services to manage state, the core Lambda execution is stateless.
Example Use Cases in Detail:
- S3 Event-Driven Ingestion:
- An S3 bucket is configured to send notifications when a new object (file) is created.
- This S3 notification acts as a trigger for a Lambda function.
- The Lambda function receives event data containing information about the uploaded file (e.g., bucket name, file key).
- The function’s code can then read the file from S3, perform data processing (cleaning, transformation), and potentially write the processed data to another S3 bucket, a database, or another AWS service.
- Real-time Kinesis Data Stream Processing:
- A Kinesis Data Stream receives a continuous flow of data (e.g., from IoT devices).
- A Lambda function is configured as a trigger for this Kinesis stream.
- Lambda polls the stream for new records and invokes the function with batches of records (default batch size is 100).
- The Lambda function’s code processes each record in the batch (e.g., performs real-time analytics, filtering, transformations) and can then store the results in a database, data lake, or trigger other actions.
- Lambda automatically scales the number of function instances based on the stream’s throughput.
In the next lecture, we will implement the S3 event-driven ingestion use case with AWS Lambda hands-on.
Event-Driven Ingestion with AWS Lambda (Hands-on)
Alright, let’s break down the hands-on implementation of event-driven data ingestion using AWS Lambda:
- Create Source and Destination S3 Buckets:
- Create a new S3 bucket that will serve as the source for your file uploads. Let’s call it
source-bucket-23094. - Identify the destination S3 bucket where you want the uploaded files to be moved (
our-first-bucket-66543).
- Create a new S3 bucket that will serve as the source for your file uploads. Let’s call it
- Create the AWS Lambda Function:
- Navigate to the AWS Lambda service in the AWS console.
- Click Create function.
- Choose Author from scratch.
- Provide a Function name (e.g.,
my-event-function). - Select the Runtime (e.g., Python 3.x).
- Under Permissions, choose Create a new role with basic Lambda permissions. You will modify this role later.
- Click Create function.
- Add the Lambda Function Code:
- In the Lambda function editor, in the Code source section, replace the default code with your Python code. Ensure you modify the
destination_bucket_namevariable in your code to match the name of your destination S3 bucket. The source bucket name will be dynamically obtained from the event trigger:import boto3 s3_client = boto3.client('s3') def lambda_handler(event, context): # Source bucket source_bucket = event['Records'][0]['s3']['bucket']['name'] # File key in source bucket source_key = event['Records'][0]['s3']['bucket']['key'] # Destination bucket destination_bucket = 'our-first-bucket-66543' # Copy object to the destination bucket s3_client.copy_object(Bucket=destination_bucket, CopySource={'Bucket': source_bucket, 'Key': source_key}, Key=source_key ) # Optionally, delete the file from the source bucket after copying # s3_client.delete_object(Bucket=source_bucket, Key=source_key)
- In the Lambda function editor, in the Code source section, replace the default code with your Python code. Ensure you modify the
- Modify the Lambda Function’s IAM Role Permissions:
- In the Lambda function editor, navigate to the Configuration tab and then Permissions.
- Click on the Role name link. This will open the IAM console in a new tab.
- Click Add permissions and then Attach policies.
- Search for and select the AmazonS3FullAccess policy (for simplicity in this demonstration). In a production environment, it’s recommended to grant more granular permissions, allowing the Lambda function only the necessary
s3:GetObject,s3:PutObject, and optionallys3:DeleteObjectpermissions on the specific source and destination buckets. Then click on Add permissions. This is what you see:/attachments/Pasted-image-20250514083923.png)
- Close the IAM tab.
- Add an S3 Trigger to the Lambda Function:
- Back in the Lambda function editor, in the Function overview section, click Add trigger:
/attachments/Pasted-image-20250514084031.png)
- Select S3 from the trigger sources and, in the Bucket dropdown, choose the name of your source S3 bucket (
source-bucket-youruniqueid). - For Event type, select All object create events. You could also choose just
PUTevents if you only want to trigger on new file uploads. - You can optionally add a Prefix or Suffix to filter the S3 events (e.g., only trigger for files in a specific folder or with a
.csvextension). Leave these blank for now. - Acknowledge the message about potential recursive invocations and click Add.
/attachments/Pasted-image-20250514084312.png)
- Back in the Lambda function editor, in the Function overview section, click Add trigger:
- Deploy the Lambda Function:
- Click the Deploy button in the Lambda function editor to save your code and configurations.
- Test the Event-Driven Ingestion:
- Navigate to your source S3 bucket (
source-bucket-23904) in the S3 console. - Upload a test file (e.g., the
customers.csvfile you’ve been using). - Wait for a short period (typically less than a minute).
- Navigate to your destination S3 bucket (
our-first-bucket-66543). You should see the uploaded file now present in this bucket, indicating that the Lambda function was triggered by the S3 event and successfully moved the file.
- Navigate to your source S3 bucket (
- Monitor the Lambda Function (Optional):
- In the Lambda function editor, go to the Monitor tab.
- You should see invocation metrics, including the number of invocations, duration, and any errors. This helps you verify that the function was triggered and executed successfully:
/attachments/Pasted-image-20250514090154.png)
Lambda Layers
Lambda Layers offer a mechanism to manage code and dependencies independently from your actual Lambda function code. This allows you to centrally manage shared code, libraries, custom runtimes, and other dependencies across multiple Lambda functions, simplifying updates and reducing redundancy.
What is a Lambda Layer? A Lambda Layer is essentially a ZIP file that contains various components that your Lambda function can use during runtime. These components can include:
- Additional code (e.g., utility functions).
- Libraries (e.g., Python packages, Java JAR files).
- Custom runtimes.
- Dependencies.
- Configuration files.
These layers are separate from your main function’s deployment package but are made available to the function’s execution environment.
Lambda Without Layers vs. With Layers:
- Without Layers: In a traditional setup, each Lambda function contains all its code and dependencies within its deployment package:
If multiple functions use the same libraries or shared code, this code is duplicated across each function. Updating these shared components requires updating each function individually, which can be cumbersome and error-prone. The size of each function’s deployment package can also become large. - With Layers: By using Lambda Layers, you can package these shared dependencies into a layer. Multiple Lambda functions can then reference and utilize this layer:
/attachments/Pasted-image-20250514210910.png)
How Lambda Layers Work:
- Package Layer Content: First, you create a ZIP file containing the code, libraries, or other dependencies you want to include in the layer. This is similar to preparing a deployment package for a function.
- Create the Lambda Layer: You then upload this ZIP file to AWS Lambda and register it as a new layer. You’ll need to specify a name for the layer and the compatible runtimes (e.g., Python 3.9).
- Add Layer to Function: In the configuration of your Lambda function, you specify which layers to include. You can add multiple layers to a single function. Lambda will then make the content of these layers available in the function’s execution environment during runtime.
- Function Access: During execution, your Lambda function code can access the content of the attached layers as if it were part of its own deployment package. For example, if a layer contains a Python library, you can import and use it in your Python code.
Benefits of Lambda Layers:
- Dependency Sharing: Layers allow you to share libraries and common code across multiple Lambda functions within the same AWS account. This reduces redundancy and simplifies management.
- Separation of Concerns: You can separate your core function logic from its dependencies. This makes it easier to manage and update the function code and its dependencies independently.
- Simplified Updates: When you need to update a shared library or dependency, you only need to update the layer. All functions using that layer will automatically use the updated version (after they are redeployed or on the next invocation, depending on the update strategy).
- Reduced Deployment Package Size: By moving dependencies into layers, you can significantly reduce the size of your individual Lambda function deployment packages, which can lead to faster deployment times and potentially better cold start performance.
In essence, Lambda Layers provide a powerful mechanism for modularizing your serverless applications, promoting code reuse, and simplifying the management of dependencies in your Lambda functions.