Glue Costs

This lecture provides an overview of the cost structure for various AWS Glue components, which is crucial for managing expenses, especially when following along with hands-on exercises.

Glue Crawlers

  • Pricing: You are billed an hourly rate based on the number of Data Processing Units (DPUs) used.
  • DPUs: These represent the compute power allocated to execute crawler tasks. A single DPU provides 4 vCPU and 16 GB of memory.
  • Billing Increment: Billing occurs in one-second increments.
  • Minimum Charge: There is a ten-minute minimum charge per crawler run.

Glue Data Catalog

  • Free Tier: The first 1 million objects stored in the Data Catalog are free. For typical training purposes, it’s unlikely you will exceed the 1 million object limit.
  • Beyond Free Tier: For anything exceeding 1 million objects, the cost is $1 per 100,000 objects per month.

Glue ETL Jobs

  • Pricing: Similar to crawlers, you are billed an hourly rate for the DPUs used to run the ETL job. The current rate mentioned is around $0.44 per DPU-Hour (though this can change, so it’s important to verify on the AWS website).
  • Billing Increment:
    • For Glue versions 0.8 and 0.9: Billing is in one-second increments with a ten-minute minimum.
    • For Glue version 2.0 and later: Billing is in one-minute increments with a one-minute minimum.
  • How many DPUs are used?
    • Apache Spark Jobs (underlying ETL): Minimum of 2 DPUs, with a default of 10 DPUs. For training, it’s advisable to reduce this default to the minimum if your workload allows.
    • Spark Streaming: Default and minimum of 2 DPUs.
    • Ray Jobs (Preview, for ML/AI): Use M-DPUs (Memory-Optimized DPUs) with 32GB memory (compared to standard 16GB). Default of 6 M-DPUs, minimum of 2 M-DPUs.
    • Python Shell Jobs: Suitable for simpler tasks not requiring distributed Spark compute. Can use a lower number of DPUs.

Glue Interactive Sessions (for Notebook Development)

  • Pricing: Billed based on the duration the session is active and the number of DPUs used for code execution.
  • Billing Increment: 1-minute minimum billing.
  • DPU Allocation: Minimum of 2 DPUs, with a default of 5 DPUs.
  • Idle Timeout: You can configure idle timeouts to automatically shut down sessions that are not in use to prevent unnecessary charges. It’s crucial to be mindful of how long these sessions remain active.

Cost Examples

  • ETL Job (Apache Spark):
    • Run time: 15 minutes (0.25 hours)
    • DPUs used: 6
    • Cost: 0.25 hours * 6 DPUs * 0.66
  • Interactive Session:
    • Active time: 24 minutes (24/60 = 0.4 hours)
    • DPUs used: 5 (default)
    • Cost: 0.4 hours * 5 DPUs * 0.88

AWS Budgets

To help manage and control your AWS costs, it’s recommended to set up budgets. In the following lectures, you will learn how to create a zero-spend budget and a $5 budget to receive alerts and stay within your desired spending limits, especially when following along with the hands-on exercises.

AWS Budget

AWS Budgets, found in the AWS Cost and Billing center, lets you set spending limits for your AWS account. You can also create alarms to receive email notifications when these limits are exceeded. This is a useful way to keep track of your AWS costs and avoid surprises, especially when you are practicing with different services.

You can configure budgets based on:

  • Actual Cost: This tracks your real spending as it happens.
  • Forecasted Cost: This predicts your future spending, allowing you to get alerts before you go over budget.

There are different kinds of budgets you can create:

  • Cost Budget: This is the most common type and allows you to set a specific monetary limit for your spending.
  • Usage Budget: For some services, you can set limits on how much you use them. This is an indirect way to control costs.
  • Savings Plans and Reservations Budget: If you are using Savings Plans or Reservations to save money, you can track how well these are being used.

For our purposes, the cost budget will be the most important.

Generally, using budgets for notifications is free. However, there’s a limit of two free action-enabled budgets (which can trigger automated actions). After that, they cost $0.10 per day. Since we will only be using notifications, this will be free for us.

Now that we understand AWS Budgets, let’s move on to setting one up in your AWS account.

Setting Up Budgets (Hands-on)

Now, let’s see how to set up budgets in your AWS account. In the AWS console, on the home page, you can often see a cost and usage dashboard. Clicking on the cost information will take you to the Cost Explorer, which is part of the Billing and Cost Management service.

In the Cost Explorer, you can view your cost data broken down by month and service. You can also see a more detailed table view and filter the data on the right side. Common filters include:

  • Services: To see the cost incurred by specific AWS services (e.g., Athena).
  • Region: To filter costs by AWS region.
  • Tags: If you’ve assigned tags to your AWS resources, you can filter costs based on these tags. This is useful for categorizing costs by department, business unit, or project.

To set up budgets, navigate to the Budgets section within the Billing and Cost Management service. You might see a list of existing budgets if you’ve created any before. In this example, two budgets are already set up:

  • Zero Spend Budget: This budget is designed to alert you if your spending exceeds $0.01. This helps you stay aware of any costs incurred, even small ones, and can indicate when you might be going beyond free tier limits or using paid services.
  • 5 USD Budget: This is a budget set at a $5.00 limit to help control spending on services like Glue, which can incur costs.

The budget overview shows information like the amount already used, the forecasted amount, and the current versus budgeted spending.

To create a new budget, click on the Create budget button. You have two options:

  1. Use a template: This can simplify the process for common budget types.
  2. Customize budget: This allows for more granular control over the budget settings.

Let’s walk through creating both the “Zero Spend Budget” using a template and the “5 USD Budget” using the customized approach.

Creating a Zero Spend Budget (using a template):

  1. Click on Create budget and choose the Zero spend budget template.
  2. Give your budget a name (it needs to be unique).
  3. Enter the email address where you want to receive notifications when the budget is exceeded.
  4. Click Create Budget.

Creating a 5 USD Budget (customized approach):

  1. Click on Create budget and choose Customize budget.
  2. Select the Cost budget type, as we want to track the actual cost incurred. Click Next.
  3. Give your budget a name (e.g., “5 USD Budget”).
  4. Set the Period to Monthly and ensure it’s a Recurring budget that does not expire. The Starting month will be the current month.
  5. Leave the Budgeting method as Fixed.
  6. Enter your desired budget amount (e.g., 5 USD).
  7. In the Configure budget scope section, you can choose to exclude specific services, but in this case, we want to monitor all services, so we’ll leave the default settings. Click Next.
  8. In the Configure alerts section, you can add alert thresholds. For the 5 USD budget, using a percentage of the budget might be useful. For example, you can set an alert at 80% of the budget (which would be $4.00).
  9. Enter the email address where you want to receive the alert notifications. Click Next.
  10. In the Configure actions (optional) section, we will choose No actions as we only want to receive notifications, which keeps the budget free. Click Next.
  11. Review your budget settings and click Create budget.

Once created, you will see your new budget listed. With these budgets set up, you will be notified if your AWS spending exceeds 5.00 in a month. This helps you stay within your desired spending limits as we move on to explore services like Glue in the next lecture.

Run Glue ETL Jobs (Hands-on)

In this lecture, we will create a simple ETL (Extract, Transform, Load) job in AWS Glue to move data from one location in S3 to another.

First, let’s look at our S3 bucket. We have a CSV file in our initial bucket. We want to move this data to a new location. For this demonstration, a new folder named “target-customers” has been created within the same bucket.

Now, let’s navigate to the AWS Glue service. In the Glue console, we’ll go to ETL jobs and create a new Visual ETL job. The visual ETL editor allows us to create jobs using a drag-and-drop interface with nodes representing sources, transformations, and targets.

  1. Name the Job: Give your job a descriptive name, such as “Our First ETL job”.
  2. Add a Source: Click the plus icon to add a node and select S3 as the source.
    • Configure the S3 source node (you can rename it for clarity, e.g., “S3 source”).
    • Specify the S3 path to your CSV file using the browse option and selecting the file.
    • Click Infer schema. Glue will automatically try to understand the structure of your CSV file, similar to how a crawler works. Interestingly, this process can also create metadata in the Glue Data Catalog.
  3. (Optional) Add Transformations: For this initial job, we will skip any transformations and focus on ingestion. Glue offers various transformation options like SQL queries, data quality checks, and aggregations, which we can explore later.
  4. Add a Target: Add another node by clicking the plus icon and select S3 as the target.
    • Connect the output of the source node to the input of the target node by dragging the endpoint of the source node to the starting point of the target node.
    • Configure the S3 target node. The Format is crucial here. Notice that the default format is Parquet. Glue automatically suggests converting the data to Parquet, which is a columnar format highly efficient for analytical workloads. We will accept this default.
    • Select the Target S3 location by browsing and choosing the “target-customers” folder we created in our bucket.
    • Configure Data Catalog update options. We want to “Create a table in the Data Catalog and on subsequent runs, update schema and add new partitions”.
    • Choose the Database (e.g., “customers”).
    • Enter a Table name (e.g., “destination-customers”). This will create a table in the Glue Data Catalog pointing to the Parquet data in the target location.
  5. Save the Job: Before running, save your Glue ETL job.
  6. Configure Job Details: Go to the Job details tab.
    • IAM role: You need to select an IAM role that grants Glue permission to access your S3 source and target locations. We will initially use the role created for the crawler.
    • Type: Use Spark.
    • Glue version: You can leave this at the default.
    • Worker type: We will use the default worker type.
    • Number of workers (DPUs): The default might be 10. To reduce costs, we will set this to the minimum allowed, which is 2.
    • Job bookmark: This feature is for incremental loads and will be discussed later.
  7. Run the Job (Initial Attempt and Troubleshooting): Run the job. You might encounter a “S3 access denied” error. This indicates that the IAM role assigned to the Glue job does not have sufficient permissions to read from the source bucket and write to the target bucket.
  8. Grant Necessary Permissions: To fix this, navigate to the IAM service.
    • Find the IAM role assigned to your Glue job.
    • Attach a policy to this role. For simplicity in this learning environment, you can attach the pre-built AmazonS3FullAccess policy. In a production environment, it’s recommended to grant more granular permissions.
    • Attach the policy to the role.
  9. Rerun the Job: Once the IAM role has the necessary S3 permissions, go back to the Glue job and rerun it.
  10. Verify the Output: After a few minutes, the job should succeed. To confirm:
    • Go to your S3 bucket and navigate to the “target-customers” folder. You should see the Parquet file(s) containing the moved and converted data.
    • Go to the Glue Data Catalog, under the “customers” database. You should see the “destination-customers” table, with its schema inferred from the Parquet data.

In the next lecture, we will learn how to schedule these Glue ETL jobs and also the Glue crawlers.

Scheduling crawlers & ETL Jobs

In the last lecture, we created an ETL job. If you look at the details, you can see the run history. You can manually run the job again, but you can also schedule it to run automatically.

To schedule an ETL job, navigate to Schedules tab. You can create a schedule to run the job daily or monthly. For example, you can schedule a job to run daily at 3:30 AM or on the first of the month at 3:30 AM. You can also add multiple schedules if needed, but usually, a single schedule is sufficient. For demonstration purposes, we will create a schedule and then delete it to avoid incurring costs.

To remove a schedule, select it and choose the delete option.

You can also schedule crawlers in a similar way. Go to the crawler overview and select the crawler you want to schedule. You can edit the crawler properties and navigate to the scheduling section. Here, you can set the frequency (e.g., daily) and the time for the crawler to run. For this example, we will not set up a schedule but just demonstrate how it can be done for both crawlers and ETL jobs.

Stateful vs Stateless

In this lecture, we will discuss the concepts of stateful and stateless systems, particularly in the context of data ingestion. This is a relevant topic for the exam. We will cover the theory and then look at practical examples of stateful versus stateless approaches.

Stateful System: A stateful system remembers the state or context of past interactions with a user or client. For example, when you log into a website and set preferences, the website remembers these preferences for future visits. The benefit of a stateful system is an improved user experience because previous interactions are recalled and used for subsequent requests.  

Stateless System: In contrast, a stateless system treats each new request independently, completely ignoring any previous interactions. For instance, in a RESTful API, you need to provide all necessary information with each request because no session data or prior interactions are stored. Every new request is executed in isolation.

These concepts also apply to data ingestion:

  • Stateful Data Ingestion: A stateful data ingestion system remembers what has been done in previous ingestion events. It might use timestamps, offsets, or processing status to track which data has already been loaded or processed. This allows it to only load or process new data that hasn’t been handled before.
  • Stateless Data Ingestion: Stateless ingestion, on the other hand, involves loading or processing all data again in each run, without remembering previous operations. While seemingly less efficient in some cases, it can be simpler as it doesn’t require storing any information about past runs.

AWS offers both stateful and stateless options for data ingestion:

  • Amazon Kinesis Data Streams: For real-time data streaming, maintaining state is crucial. Kinesis needs to track the sequence number of processed data to ensure that ingestion can resume correctly after failures or to avoid reprocessing data.  
  • AWS Data Pipelines: This service orchestrates workflows for both stateful and stateless data ingestion, depending on how you configure your pipeline.
  • AWS Glue: Glue ETL jobs can be configured for stateful ingestion using bookmarks. Bookmarks allow Glue to track the progress of previous runs and only process new data that has appeared since the last run. This is particularly useful when dealing with continuously arriving files in an S3 bucket.  

In the next lecture, we will explore a practical demonstration of how to implement stateful data ingestion using AWS Glue and its bookmark feature.

Stateless Data Ingestion in Glue (Hands-on)

Let’s examine stateless data ingestion in AWS Glue. We previously created an ETL job to load data from a “documents” folder to the “target-customers” folder in our S3 bucket.

Now, let’s simulate a scenario where a new file (“customers2.csv”) is added to our source location. If we rerun our existing Glue ETL job, it will process all the data in the source location again, including the data it has already processed in previous runs.

To demonstrate this, we will:

  1. Upload a new file: We upload “customers2.csv” to our source S3 location. This file contains additional customer records.
  2. Rerun the Glue ETL job: We manually trigger the same ETL job that we ran previously.
  3. Verify the data in the target: We can use Amazon Athena to query the table in our Glue Data Catalog that was created by the ETL job.

When we query the table after the second ETL job run, we will observe that the records from the first file have been loaded again, resulting in duplicate entries. For example, if the first file had 11 records and the second had 10, the table will now contain 32 records (11 original + 11 reloaded + 10 new).

This behavior is characteristic of stateless data ingestion. The ETL job does not remember which data it has processed before and therefore reprocesses everything in the source location each time it runs.

In many real-world scenarios, this is not the desired behavior. We often want to perform an incremental load, where only the new or updated data is processed. To achieve this in AWS Glue, we can use a feature called bookmarks, which allows us to implement stateful data ingestion. We will explore how to use Glue bookmarks in the next lecture.

Stateful Ingestion with Bookmarks (Hands-on)

Let’s summarize how to use bookmarks to create stateful data ingestion in AWS Glue.

To implement stateful data ingestion using bookmarks in AWS Glue:

  1. Enable Bookmarks: In your Glue ETL job’s settings (under Job details), enable bookmarks. When enabled, Glue remembers previously processed data. If a file is modified (rows added or data changed), bookmarks will also process these changes. If disabled, Glue will ignore state information.
  2. Ensure a Clean Target: Before the first run with bookmarks enabled, ensure your target location (e.g., an S3 bucket) is empty.
  3. Run the ETL Job with the Initial File: Remove “customers2.csv” file and then run the ETL job with the first file. Glue will now store the state information about this file.
  4. Add a New File to the Source: Add again the new file “customers2.csv” to the source location.
  5. Run the ETL Job Again: Rerun the same ETL job. Because bookmarks are enabled, Glue will only process the new file and any modifications to the original file. It will not reprocess the original data.
  6. Verify the Results: After the second run, the target location should contain data from both files, but without duplicates. The query results should reflect the combined data, not the duplicated data that would occur in a stateless ingestion.

In the example discussed, if the first file had 11 records and the second had 10, the final table will contain 21 unique records. This demonstrates a successful incremental load, where only the new data is processed.

Glue Transformations (ETL)

Let’s explore the important transformation capabilities available in AWS Glue as part of the ETL (Extract, Transform, Load) process.

The general ETL flow in Glue involves:

  1. Extract: Connecting to various data sources to extract data. Glue supports numerous data sources.
  2. Transform: Processing and modifying the extracted data.
  3. Load: Writing the transformed data into a target data store.
  4. Data Catalog Update (Optional according to Data Catalog update options): Updating the Glue Data Catalog with metadata about the loaded data, including creating new tables or updating existing ones and specifying the database.

After extracting the data, you can apply various transformations, including:

  • Basic Operations: Filtering rows, joining datasets, and performing different types of aggregations.
  • Advanced Transformations:
    • FindMatches ML: This machine learning-powered transform helps identify duplicate or matching records across different datasets, even when the data isn’t perfectly consistent (e.g., due to spelling errors or structural differences). This is useful for tasks like matching product catalogs from different sources without a common identifier. Another use case is to identify and remove duplicate records within a single dataset, which can be helpful when data is inconsistent or incomplete.
  • Data Privacy and Security:
    • Detect PII (Personally Identifiable Information): This transform scans your data to identify and manage sensitive information according to privacy regulations like GDPR. It can detect various types of PII, such as names, social security numbers, and credit card details. Once detected, you can configure actions to protect the data, such as tokenization or removal.
  • Data Format Conversion: A very common ETL task is changing the format of data. For example, converting CSV files from the source into the more analytically efficient Parquet format in the target. Parquet is a columnar storage format that significantly improves query performance for analytical workloads. Glue supports transformations between common data formats like JSON, Avro, and others.

After performing the necessary transformations, the data is loaded into the target data store. This could be:

  • A database.
  • An Amazon Redshift data warehouse.
  • Back into an Amazon S3 bucket.

If the target is S3, you can then use services like Amazon Athena or Amazon Redshift Spectrum to query the data directly in S3 without needing to load it into a separate database.

These transformation capabilities in AWS Glue provide a robust set of tools for preparing and moving your data for various analytical and operational needs.

Glue Data Quality (Hands-on)

Let’s explore the data quality feature within AWS Glue, which helps improve the reliability and integrity of your data. We’ll go through a hands-on example to understand how it works.

Glue Data Quality allows you to define a set of rules to ensure your data meets your expectations. You can create custom rules specific to your data and also leverage machine learning algorithms to detect unusual patterns beyond traditional rule-based checks. You can apply Glue Data Quality to data at rest (e.g., in S3) or to data in transit within your ETL pipelines.

In this example, we will use a simple ETL job that reads data from an S3 bucket and loads it into another target bucket. We will add a new node to this job: Evaluate Data Quality.

  1. Add the “Evaluate Data Quality” Node: In your Glue ETL job editor, search for “data quality” and select Evaluate data quality. This will add the node to your canvas. Connect the output of your data source node to the input of this new “Evaluate data quality” node.
  2. Configure the Data Quality Rules: Select the “Evaluate data quality” node and navigate to the rule set editor.
    • You have two main options:
      • Anomaly Detection: Glue can automatically scan your data and generate observations about potential anomalies.
      • Rule Set Editor: This allows you to define specific rules based on your data expectations. We will use this option.
    • In the rule set editor, you can define conditions. For example:
      • columnCount > 10 (checks if the number of columns is greater than ten).
      • completeness BETWEEN 0.4 AND 0.8 (checks if the completeness of a specific column is within a certain range).
    • Using the Rule Helper: If you find it easier, you can use the helper to add rules. For instance, to check the column count:
      • Select “ColumnCount” from the rules.
      • Set the condition (e.g., == 7 if your data has seven columns).
    • Previewing Rules: Glue allows you to preview how your rules would apply to your data in the Data preview tag just below the drag-and-drop interface. This helps in verifying if your rules are correctly configured. For example, a rule columnCount == 10 would fail on a dataset with seven columns. You would then adjust the rule to columnCount == 7 to make it pass:
    • Adding Multiple Rules: You can add multiple rules by separating them with commas. You can check data types (e.g., columnDataType("name") == "date"), check for the existence of columns (columnExists), and use more dynamic rules with functions like avg(last(10)):
  3. Configure Output Options: Still in the “Evaluate data quality” node’s configuration, you can specify what to do with the results:
    • Original data: You can choose to output the original data along with the quality evaluation.
    • Data quality results: You can configure Glue to output the data quality results (e.g., the outcome of each rule) to a specific location, such as another S3 bucket. This allows you to monitor and analyze your data quality metrics.
  4. Define Actions on Rule Failure: Select again the Evaluate Data Quality node. In its settings, you can define what should happen if the data quality rules fail. For example, you can set the job to Fail job without loading the target data. This is crucial for preventing the loading of low-quality data into your target and can help save resources. If the rules fail, the job will stop, and you can potentially trigger other actions as part of a workflow (which will be discussed later in the course):

By using the “Evaluate data quality” feature in AWS Glue, you can build robust ETL pipelines that not only move and transform data but also ensure its quality and reliability. This helps maintain data integrity and improves the overall value of your data.

Glue Workflows

AWS Glue Workflows provide an orchestration capability within Glue to manage the execution of your crawlers and jobs. This is an ideal way to manage your Glue operations by coordinating the different components.

Within a Glue Workflow, you can connect:

  • Crawlers: To discover and catalog your data.
  • Jobs: Your ETL processes to transform and load data.
  • Triggers: To define the conditions under which crawlers and jobs are executed. Triggers can be based on the successful or failed completion of a previous crawler or job, allowing you to create dependencies between tasks.

For more complex orchestration that involves services beyond Glue, AWS Step Functions would be a more comprehensive solution. However, for managing primarily Glue components, Glue Workflows are well-suited.

You can create Glue Workflows visually through the AWS console, making it easy to understand the flow of tasks. You can configure the components and triggers within the visual interface and monitor the execution of the entire workflow. Workflows can also be created from Glue Blueprints, manually through the interface, or using the AWS Glue API.

Triggers are a key aspect of Glue Workflows. They define when a job or crawler should start. Triggers can initiate both jobs and crawlers and are activated upon the completion (success or failure) of a preceding job or crawler.

Each Glue Workflow is initiated by a primary trigger, also known as a start trigger. These come in different types:

  • Schedule Triggers: These triggers start the workflow at regular intervals defined by a schedule. You can specify the schedule using options like daily, weekly, monthly, or custom periods using cron expressions.
  • On-demand Triggers: These triggers allow you to manually start the workflow from the Glue console.
  • EventBridge Event: You can configure a rule to schedule the workflow or to trigger it in response to specific events captured by Amazon EventBridge. This allows for more flexible and event-driven orchestration. For example, you could trigger a workflow when a new file is uploaded to an S3 bucket.

On-demand triggers offer flexibility and can be combined with other AWS services like Lambda functions or EventBridge rules to initiate workflows based on external events or custom logic.

Glue Workflows (Hands-on)

Let’s walk through the practical use of AWS Glue Workflows:

  1. Access Glue: Navigate to the AWS Glue service in the AWS console. In the left-hand menu, select Workflows (orchestration).
  2. Create a New Workflow: Click on Add workflow. Give your workflow a name (e.g., “Workflow_one”). You can also configure optional settings like maximum concurrency and add tags. Click Create workflow:
  3. Visualize the Workflow: Once created, select your workflow to open the graphical interface.
  4. Add a Trigger: Every workflow starts with a trigger. Click Add trigger. You can choose an existing trigger or create a new one. Let’s create a new On-demand trigger named “Trigger_On_Demand”. With an on-demand trigger, you can manually start the workflow from the Glue console. This type of trigger can also be initiated by other AWS services like EventBridge or Lambda. Click Add.
  5. Add a Job: Now, let’s add a Glue job to our workflow. Click Add node and select Job. Choose your existing ETL job (e.g., “Our First ETL job”) and click Add. This adds the job node to the workflow.
  6. Add a Trigger Based on Job Completion: To chain another action after the job, select the ETL job node. Under Actions, choose Add trigger. This will create a new trigger linked to the job’s events. You can configure the trigger to start after the job succeeds or fails. Let’s name this trigger “Second_trigger” and set it to trigger After any watched event where the watched event is the successful completion of the previous job. Click Add.
  7. Add a Crawler: Let’s say you want to run a crawler after the ETL job succeeds. Choose Add node and select Crawler. Choose one of your existing crawlers and click Add. Now, your workflow will run the job, and upon successful completion, it will trigger the selected crawler.
  8. Configure Trigger Events (Optional): You can modify the event that triggers a subsequent action. For example, if you select the “Our First ETL job”, you can click on Edit watched event in the Action menu to trigger on job failure instead of success:
  9. Add More Components (Optional): You can continue adding more jobs, crawlers, and triggers to build a more complex workflow. You can also add multiple dependencies and parallel execution paths.
  10. Run the Workflow: You can run the workflow manually using the on-demand trigger by selecting the initial trigger node and choosing Run workflow. Alternatively, if you had set up a scheduled trigger or an EventBridge trigger, the workflow would run automatically based on those configurations.
  11. Delete the Workflow (Cleanup): To avoid incurring unnecessary costs from scheduled triggers, it’s good practice to delete the workflow if you were just experimenting. Select your workflow and choose Delete workflow from the Actions menu.

This demonstrates the basic steps of creating and configuring a Glue Workflow to orchestrate your Glue jobs and crawlers. The visual interface makes it easy to manage dependencies and build automated data processing pipelines.

Glue Job Types

When creating AWS Glue jobs, you have several options for the underlying execution engine, which are often referred to as different job types. The choice of engine depends on your specific data processing needs and can impact performance and cost.

Here’s an overview of the different Glue job types:

  • Visual ETL: When you use the visual interface (drag-and-drop) to create Glue jobs, the underlying engine is automatically selected for you based on the chosen data sources and transformations. Typically, this defaults to Spark, and you don’t have direct control over changing it within the visual editor.
  • Notebook: You can interactively develop and experiment with data processing code using Glue notebooks. Similar to visual ETL, the engine for notebooks is usually pre-selected, often Spark.
  • Authoring Script (Custom Code): When you write your Glue job script from scratch, you have more control over selecting the execution engine. The available options include:
    • Spark: This is the most common engine for large-scale data processing workflows in Glue. It leverages the distributed computing capabilities of Apache Spark to handle big data workloads. You can scale the processing capacity from 2 to 100 Data Processing Units (DPUs). One DPU consists of 4 vCPUs and 16GB of memory. You are billed based on DPU hours, with a minimum of 2 DPUs for scaling. Glue also offers Spark Streaming ETLs for real-time data analysis and processing from streaming sources like Kinesis Data Streams and Kafka, with similar DPU options.
    • Python Shell: This engine is designed for simpler, smaller-scale data processing tasks where the distributed power of Spark isn’t necessary. It allows you to run Python scripts in a managed environment with significantly fewer resources, ranging from 0.0625 to 1 DPU. This can be a cost-effective option for lightweight tasks.
    • Ray: This is a more recently added engine that utilizes the Ray framework, an open-source Python framework for building distributed applications. Ray allows you to use your familiar Python skills and libraries (potentially from machine learning or data science) and scale them for large datasets in a multi-node environment. It’s well-suited for tasks like parallel data transformations, machine learning ingestion, and distributed batch inference. Ray jobs run on new Graviton-based EC2 worker types, which are exclusive to this engine.

Beyond the choice of engine, Glue also offers different execution types for your jobs:

  • Standard: This is the default execution type where jobs are run as soon as they are triggered.
  • Flex: This is a cost-effective alternative for less time-sensitive ETL workloads. Jobs with flex execution are queued and run when resources are available, which might introduce some delay in start times. However, this can lead to significant cost savings, making it suitable for tasks like weekly data transformations that don’t have strict time constraints.

Glue Job Types (Hands-on)

Let’s take a quick look at how the different Glue job types manifest in the AWS console:

  1. Visual ETL Job: When you create a job using the visual editor, the engine is automatically determined based on your chosen sources and transformations. If you navigate to an existing visual ETL job and go to the Job details tab, you’ll see the Type field. In most cases for visual jobs, this will be set to Spark. You can adjust parameters like the language, worker type, and the number of workers (DPUs). Remember that for Spark jobs, the minimum number of workers is typically 2. You can also enable Flex execution for cost savings on less time-critical workloads:
  2. Script Editor Job: If you choose to write your Glue job script directly using the script editor, you have more control over the engine. When creating a new job, you can select the job type (e.g., Spark, Python shell, or Ray) before you start writing your code: This allows you to choose the engine that best suits your needs. For instance, if you wanted to use fewer resources than the minimum for Spark, you could choose Python shell. Or, if you wanted to leverage the Ray framework, you could select Ray as the job type and then write your Python script accordingly.

Partitioning

Let’s discuss partitioning in the context of AWS Glue, which is a crucial performance optimization technique, especially when querying large datasets stored in locations like Amazon S3. While primarily beneficial for query performance, partitioning also plays a role in optimizing Glue’s data processing.

Partitioning involves organizing your data into separate file locations based on specific conditions. A common partitioning strategy is by date. For example, you might have an S3 bucket with a structure like:

s3://your-bucket/data/year=2023/
s3://your-bucket/data/year=2024/
s3://your-bucket/data/year=2025/

Within each year folder, you might have further partitions by month or day.

The primary advantage of partitioning for querying is that when you have a query with a filter condition on the partitioned column (e.g., WHERE year = 2023), query engines like Amazon Athena can skip scanning the data in all other partitions (e.g., the folders for 2024 and 2025). This selective scanning significantly reduces the amount of data read (I/O operations), leading to faster query execution and improved response times.

For ETL operations in Glue, when data is organized into partitions, Glue can often process each partition independently. This can improve data manageability and potentially parallelize processing for different partitions, leading to better performance.

Furthermore, partitioning can result in cost savings, particularly when querying data with services like Athena, which charges based on the amount of data scanned. By reducing the scanned data, partitioning directly contributes to lower query costs.

When using AWS Glue, you typically define partitioning when setting up your ETL job, specifying how the data should be organized in the target location. Glue crawlers can also automatically recognize and work with data that is already partitioned in S3, provided the directory structure follows a common partitioning scheme like key=value (as seen in the year example above).

In S3, the partition structure is often reflected in the directory organization, making it a key aspect of your data lake design for both efficient processing and querying.

AWS Glue DataBrew

AWS Glue DataBrew is a visual data preparation tool that allows you to clean and normalize data without writing any code. If you’re familiar with Power Query in Excel or Power BI, you’ll find the interface and concepts quite similar.  

The primary purpose of DataBrew is to provide a user-friendly visual environment where you can preview your data and apply a series of data transformation steps. This is particularly useful for:

  • Data Cleaning: Handling missing values, correcting inconsistencies, and standardizing formats.
  • Data Restructuring: Reshaping data, splitting and combining columns, and pivoting tables.
  • Data Pre-processing: Preparing data for ETL pipelines or machine learning models.

DataBrew is part of the broader AWS Glue suite, which is designed for data integration. It offers over 250 pre-built transformations that simplify the process of modifying and transforming data. The no-code visual approach makes it accessible to users without extensive programming skills.

Key benefits of using DataBrew include:

  • Ease of Use: The graphical interface allows anyone to perform complex data transformations visually.
  • Automation: You can schedule your data preparation jobs to automate repetitive tasks, ensuring consistent and well-prepared data.
  • Integration: DataBrew seamlessly integrates with other AWS services like S3 for data storage, Redshift for data warehousing, and AWS Lake Formation for security and governance. It also works with AWS IAM for managing permissions.

The DataBrew interface provides several key features:

  • Data Preview: You can see a sample of your data, allowing you to understand its structure and identify potential issues.
  • Column Profiles: DataBrew generates statistics and visualizations for each column, such as the number of distinct values, data distribution, mean, mode, and the presence of null values. This helps you quickly get an overview of your data and spot inconsistencies.
  • Project: A project in DataBrew is where you configure your transformation tasks. You associate a dataset with a project and then add a series of transformation steps.
  • Steps: These are the individual transformations you apply to your data within a project. Examples include splitting columns, rounding numbers, changing text case, joining datasets, and grouping data.
  • Recipe: A recipe is an ordered sequence of the transformation steps you’ve applied in a project. It represents the entire data preparation workflow. You can save recipes and reuse them across different projects and datasets.
  • Job: A job is the execution of a recipe on a specific dataset. When creating a job, you specify the output location (e.g., an S3 bucket) for the transformed data. Jobs can be scheduled for automated execution.
  • Data Profiling: You can run profiling jobs on your entire dataset to get a comprehensive analysis, including statistics and visualizations for all columns. This helps in understanding the overall quality and characteristics of your data.

In the next lecture, we will delve deeper into some specific and important data transformations available in DataBrew that are useful for reshaping data and gaining different perspectives, especially when preparing it for visualization or machine learning.

AWS Glue DataBrew - Transformations

Before we dive into the practical demonstration of AWS Glue DataBrew, let’s discuss some key data transformations that are fundamental in data preparation and data engineering. Understanding these concepts is crucial for effectively manipulating data for reporting, analysis, and other downstream processes.

Here are some important transformations:

  • Nest to Map: This transformation combines multiple columns into a single column containing a map (or dictionary) of key-value pairs. The column names become the keys, and the corresponding cell values become the values. For example, columns “name”, “age”, “city” with values “Alice”, 30, “New York” would be transformed into a single column with the value {"name": "Alice", "age": 30, "city": "New York"}:
  • Nest to Array: Similar to “Nest to Map,” this combines multiple columns into a single column, but instead of a map, it creates an array of the column values. Using the same example, the result would be a single column with the value ["Alice", 30, "New York"]:
  • Nest to Struct: This function is very similar to “Nest to Map” but guarantees the retention of the exact data type and the order of the original columns. Use this when data type and order precision are critical.
  • Unnest Array: This transformation takes a column containing an array and expands it into multiple columns. Each element of the array in a row becomes the value in a new column for that row. For an array column ["Alice", 30, "New York"], it would create new columns (potentially named based on index or you can specify names) with these values:
  • Unnest Map: This is the reverse of “Nest to Map.” It takes a column containing a map (key-value pairs) and expands it into multiple columns. The keys of the map become the new column names, and the corresponding values populate the cells.
  • Pivot: This transformation rotates data from rows into columns. You select a “pivot column” (e.g., Quarter) and a “value column” (e.g., Sales). The unique values in the pivot column become new column headers, and the values from the value column are distributed accordingly. This is useful for creating more report-like table structures:
  • Unpivot: This is the opposite of “Pivot.” It transforms columns back into rows. You select the columns you want to “unpivot,” and their names become values in a new “attribute” column, while their original values go into a new “value” column. This is often preferred for data visualization as it creates a longer, narrower table that is easier to work with in charting tools:
  • Transpose: This transformation swaps the rows and columns of a dataset. What were column headers become values in a new column, and the original row values become the new column headers:

In addition to these, some other very common and generally important transformations include:

  • Join: Combining two datasets based on a common column (key).
  • Split Column: Dividing a single column into multiple columns based on a delimiter.
  • Filter: Selecting rows based on specific conditions.
  • Sort: Ordering the rows of a dataset based on one or more columns (often an intermediate step for other transformations or for visual inspection).
  • Convert Data Type: Changing the data type of a column (e.g., string to datetime, string to number).
  • Aggregate: Summarizing data (e.g., calculating counts, sums, averages).

AWS Glue DataBew (Hands-on)

Let’s get hands-on with AWS Glue DataBrew to see its visual data preparation capabilities in action.

  1. Access AWS Glue DataBrew: Search for “DataBrew” in the AWS Management Console. You’ll find the service for visual data preparation.
  2. Pricing Overview: Before starting, it’s good to understand the pricing. DataBrew offers:
    • Interactive Sessions: Billed at $1 per session (first 40 sessions are free for new users). A session lasts for 30 minutes. This is what we’ll use for our interactive exploration.
    • DataBrew Jobs: For production use and scheduled runs, billing is $0.48 per node per hour, broken down to the minute. You can configure the number of nodes.
  3. Create a Sample Project:
    • Navigate to Projects in the DataBrew console and click Create project.
    • Give your project a name.
    • Choose a Dataset. You can select from your existing datasets or connect to a new one (e.g., from S3). For this example, you can choose any sample dataset provided (like the chess game moves dataset).
    • Select an IAM role that DataBrew can use to access your data. You can create a new role if needed. Provide a prefix for the role name (e.g., “Test2”) and click Create project.
  4. Explore the Interactive Session:
    • DataBrew will provision resources and load your dataset into an interactive session. This might take a minute. Once ready, you’ll see the data preview in the middle of the screen:
    • Column Details and Profile: When you select a column, you’ll see details like data type and statistics in the column profile pane. You can also navigate to the Profile tab to run a full data profile for more in-depth analysis of data quality and distribution. The Schema tab provides information about the data structure.
  5. Apply Transformations Interactively:
    • Transform from Column Selection: Select a column in the data preview. A menu will appear with suggested transformations. For example, if you select a string column, you might see options to convert case, trim whitespace, or replace values:
    • Add Steps Manually: You can also click Add step to browse the extensive list of available transformations.
    • Example Transformations:
      • Delete Column: Select the “created_at” column (or any column you want to remove), choose Delete, and click Apply. This step will be added to your recipe.
      • Apply a Function: Select the “turns” column (or a numeric column), choose FUNCTION in the top bar, and then select a mathematical function like ADD. Enter a value to add (e.g., 2) and click Preview changes to see the effect. If you’re satisfied, click Apply.
  6. View the Recipe: The Recipe pane on the left will show the sequence of transformation steps you’ve applied. You can edit or delete steps here.
  7. Create a Job (for Scheduled Execution and Output):
    • Click Create job.
    • Give your job a name (e.g., “ML-Job1”).
    • Output: Configure the output location (e.g., an S3 bucket) and the desired file format (e.g., CSV with a specific delimiter and compression).
    • Advanced job settings: Adjust the capacity (number of nodes) as needed to manage costs.
    • Associated schedules (Optional): You can attach a schedule to automate the job. Click Create schedule if you don’t have one. Define the schedule name (e.g., “daily”), frequency (e.g., specific days like Tuesday and Wednesday), and time intervals (e.g., every 8 hours or at a specific time like 23:00). Click Add.
    • IAM Role: Select an IAM role for the job to access resources.
    • Click Create and run job to save and immediately execute the job.
  8. Publish Recipe (for Reusability): You can click Publish recipe to save the sequence of transformations so you can reuse it on other datasets or in different projects.
  9. Clean Up (Delete Projects): To avoid unnecessary costs, especially from lingering interactive sessions, navigate back to the Projects page. Select the project(s) you created and click Delete. Confirm the deletion.

This hands-on exercise provides a basic understanding of how to use AWS Glue DataBrew for interactive data exploration, transformation, and job scheduling. Remember to clean up your resources to avoid unexpected charges.