AWS S3 - Basics

AWS S3 is a core AWS service that provides cost-effective and simple object storage. It’s essential because it’s the main storage solution in AWS.

Here’s a breakdown of the key concepts:

Buckets: These are containers for your files (objects). Each bucket is created in a specific AWS region, where the data is physically stored. Choosing a region can impact latency (how quickly data is accessed) and compliance requirements.
Globally Unique Names: Every S3 bucket must have a globally unique name across all AWS regions and accounts.
Naming Conventions: Bucket names must be between 3 and 63 characters long and can only include lowercase letters, numbers, dots, and hyphens. They must start and end with a letter or a number and cannot be IP addresses.
Objects and Keys: Objects are the actual files you store in S3. Each object is identified by a key. The key represents the full path of the object within the bucket. For example, if you upload example.txt directly, the key is example.txt. If you upload it into a folder named documents, the key is documents/example.txt. The bucket name is not part of the key.
Use Cases: S3 has diverse applications, including backup and recovery, website hosting, application storage, data archiving, and building data lakes for analytics. It can store various types of data (structured, unstructured, semi-structured). You can even query some file types like CSV using Amazon Athena.
Storage Classes: AWS offers different storage classes designed for various use cases, with slightly different features and availability:
- Durability: All S3 storage classes offer extremely high durability (often referred to as “eleven nines” or 99.999999999% annually), meaning it’s very unlikely to lose an object.
- Availability: This refers to the percentage of time the service is operational and you can access your data. Availability can vary between storage classes.
- Lifecycle Rules: You can set up rules to automatically change the storage class of objects based on their age. This helps optimize costs by moving less frequently accessed data to cheaper storage over time.
Versioning: This feature allows you to keep previous versions of your files. If you accidentally modify or delete a file, you can recover an earlier version, providing an extra layer of data protection.

Create a Bucket in S3 (Hands-on)

This part of the video demonstrates how to create an S3 bucket in the AWS Management Console.

Here are the key steps and points:

Accessing S3: You can find the S3 service by using the search bar at the top of the AWS console. You can also add it to your favorites for quick access.
Bucket Overview: Once in the S3 service, you’ll see a list of your existing buckets (if any).
Creating a New Bucket: To create a new bucket, click the “Create bucket” button.
Choosing a Region: You need to select the AWS region where you want your bucket to be located. This decision can be based on factors like proximity to users (for lower latency), pricing differences between regions, and compliance requirements. The example uses North Virginia.
Choosing a Bucket Name: You need to provide a globally unique name for your bucket. The system will check if the name is already taken. If you try to use a name that exists, you’ll get an error. The name must also follow the specified naming conventions (3-63 characters, lowercase letters, numbers, dots, hyphens, must start and end with a letter or number, and cannot be an IP address). The example shows the process of trying and adjusting a bucket name to meet these requirements.
Bucket Configuration (Initial): The bucket creation process also presents options for:
- Bucket Versioning: This can be enabled during bucket creation or later.
- Tags: These are key-value pairs that you can assign to your bucket (and other AWS resources) for cost tracking and organizational purposes.
- Encryption: Options for encrypting your data at rest are available.
Creating the Bucket: After configuring the bucket, you can click “Create bucket” to finalize the process.
Viewing Bucket Details: Once created, you can view the details of your new bucket.

The next step in the video will be to demonstrate how to upload objects (files) into the newly created bucket.

Uploading files to S3 (Hands-on)

This part of the video demonstrates how to upload files to an S3 bucket.

Here’s a summary:

Accessing Buckets: You can view your existing buckets in the S3 service. You can sort them by creation date to easily find the most recently created ones.
Uploading Files: To upload files, select the desired bucket. You have the option to:
- Create folders to organize your files.
- Upload files directly into the bucket or into a specific folder.
Upload Methods: You can upload files by:
- Clicking the “Add files” button and selecting them from your computer.
- Dragging and dropping files from your computer into the upload area.
File Information: After uploading, you can select a file to view its details, including the region, owner, size, type, and key. The key represents the full path of the object within the bucket.

The next part of the video will discuss data ingestion in more detail.

Streaming vs Batch Ingestion

This part of the video provides an overview of different data ingestion methods:

Streaming Ingestion: This method involves ingesting data in real-time. It’s used when data is time-sensitive and requires immediate processing, such as in fraud detection systems. While crucial for specific use cases, it’s generally more complex and expensive than batch ingestion. Amazon Kinesis is a service used for streaming ingestion.
Batch Ingestion: This method involves ingesting data in larger, periodic batches. It’s suitable for scenarios where data isn’t time-critical. Batch ingestion is simpler and more cost-effective. AWS Glue is the most common tool used for batch ingestion.

The video will now focus on AWS Glue for batch data ingestion.

AWS Glue

This part of the video introduces AWS Glue, a central service for data engineering on AWS.

Here’s a summary of AWS Glue’s key features and concepts:

Fully Managed ETL Service: AWS Glue is a serverless Extract, Transform, Load (ETL) service that simplifies the process of moving and transforming data between different data stores.
Visual Interface: Glue provides a user-friendly visual interface where you can create ETL jobs using a drag-and-drop approach. This allows you to connect to various data sources and sinks and apply pre-built transformations.
Integration with AWS Services and Pre-built Transformations: Glue seamlessly integrates with many AWS analytical and data storage services, including S3, Amazon Redshift, and other databases. Furthermore, Glue offers a library of pre-built transformations that you can use in your ETL workflows, making data manipulation easier:
Automatic Script Generation: Behind the visual interface, Glue automatically generates and executes scripts (using Apache Spark) to perform the ETL tasks. You don’t need to manage the underlying infrastructure.
Serverless and Scalable: Glue is a serverless service, meaning AWS manages the Spark clusters and infrastructure for you. It’s also highly scalable, allowing you to process large volumes of data.
Pay-as-you-go Pricing: You only pay for the compute resources consumed while your ETL jobs are running. The cost depends on the duration and compute power used.
Customization: While the visual interface is convenient, you can also edit the generated scripts for more advanced and custom transformations and data loading logic.
Glue Data Catalog: This is another crucial component of AWS Glue. It acts as a centralized metadata repository.
Schema Discovery (Crawlers): You can use Glue Crawlers to automatically discover the schema of your data stored in sources like S3 (e.g., CSV files). Crawlers analyze the data, infer the column names, data types, and file format, and store this metadata in the Glue Data Catalog.
Schema Definition in ETL Jobs: You can also define the schema of your data within your ETL jobs.
Querying Data with Athena: Once the metadata is in the Glue Data Catalog, you can use Amazon Athena, a serverless query service, to run SQL queries directly against your data in S3 without moving it. This makes it easy to analyze and visualize data in its original location.
Integration with Other Services: The Glue Data Catalog also makes the metadata available to other AWS services like Amazon Redshift and QuickSight for querying and analysis.
Scheduling and Triggers: You can schedule Glue ETL jobs and Crawlers to run at specific intervals or trigger them based on events.
Incremental Loads: Glue ETL jobs can be configured to perform incremental loads, processing only new or changed data since the last run. Similarly, Crawlers can be set to incrementally update the schema based on new data.
Cost Management: Understanding Glue’s pricing model and utilizing features like incremental loads and efficient job design are important for managing costs.

The video will next delve into the practical aspects of using AWS Glue and discuss cost considerations.

Setting Up Crawlers (Hands-on)

This lecture explains how to use AWS Glue crawlers to automatically discover the structure of data stored in AWS S3 buckets and create metadata in the Glue Data Catalog. This metadata allows you to query the data using SQL-like tools such as AWS Athena without needing to understand the underlying data format or location.

First, you navigate to the AWS Glue service in the AWS Management Console, ensuring you are in the same region as your S3 bucket. In the Glue console, under Data Catalog, you will find Databases, Tables, and Crawlers, where

Databases are containers for organizing tables.
Tables represent the metadata of your data, such as column names and data types, pointing to the actual data location in S3.
Crawlers are the tools that automatically scan your data source (like an S3 bucket), infer the schema, and create or update tables in the Glue Data Catalog.

To set up a crawler:

Go to Crawlers and click New Crawler.
Give your crawler a name and an optional description.
Choose Not yet for creating a table, as the crawler will create it for the first time.
Click Add a data source and select S3.
Browse and select the specific S3 bucket or folder containing your data. It’s important that all files within the selected path have the same data format and schema for the crawler to create an accurate table. You can choose to crawl all subfolders or only new ones in subsequent runs.
Configure the IAM role that the crawler will use to access your S3 bucket. You can either choose an existing role or create a new one, which will automatically have the necessary permissions.
In the Configure output section, you need to specify a database where the table will be stored. If you haven’t created one yet, you can create a new database.
The table name will default to the folder name in S3, but you can add an optional prefix.
Set the Schedule for the crawler. For this hands-on exercise, it’s set to On demand for manual execution. You can also schedule it to run hourly, daily, weekly, or monthly.
Review your settings and click Create Crawler.

Once the crawler is created, you can run it. You can monitor its status in the crawlers overview. After a short period (around a minute or two), the crawler will complete its run.

To see the results, navigate to the Databases section in the Glue Data Catalog and select the database you created. You should now see a new table listed. This table contains the metadata inferred by the crawler, including the location of the data in S3, the data format (e.g., CSV), and the schema (column names and data types):

By selecting the table, you can verify the inferred schema and data types:

This metadata now allows you to query the data in your S3 bucket using services like AWS Athena, which will be covered in the following lectures. Importantly, the crawler only creates metadata; it does not move or duplicate your actual data in S3.

Quartz 4

Explorer

02. Data Ingestion