Aws Athena - Overview

AWS Athena is an interactive, serverless query service that allows you to analyze data directly in Amazon S3 using standard SQL. It can query various file formats, including CSV, Avro, and Parquet, as long as their metadata is defined in the AWS Glue Data Catalog.

Here’s a breakdown of how it works:

  1. Data in S3: Your data resides in S3 buckets.
  2. Glue Data Catalog: You use AWS Glue crawlers to infer the schema of your data in S3 and create tables in the Glue Data Catalog. These tables contain the metadata (structure and format) of your data.
  3. Athena Querying: You can then use the tables in the Glue Data Catalog to run SQL queries on your data through the Athena interface. Athena uses the metadata to understand how to read and process the data in S3.
  4. Integration with other services: Athena can be used as a data source for other AWS services, like QuickSight for data visualization and report building. It can also connect to various applications using ODBC drivers.

Key benefits and use cases of AWS Athena include:

  • Serverless: You don’t need to manage any infrastructure. AWS handles provisioning and scaling automatically.
  • Pay-as-you-go: You are charged based on the amount of data scanned by your queries and the number of queries you run.
  • High Performance and Scalability: Athena is designed to provide fast query results even with large datasets.
  • Log Analysis: Analyze logs stored in S3 using SQL.
  • Ad-hoc Analysis: Perform quick data exploration directly on files in S3.
  • Data Lake Querying: Enables querying data within a data lake architecture built on S3.
  • Integration with Streaming Data: Can be integrated with services like Amazon Kinesis (though this will be discussed later).

The next lecture will demonstrate how to use Athena in practice to query the table created by the Glue crawler, illustrating the connection between the data catalog and the querying process.

Query data using Athena (Hands-on)

This lecture demonstrates how to use AWS Athena to query the data whose metadata was created by the Glue crawler in the previous step.

  1. Accessing Athena: From the Glue console, you can navigate to Athena by selecting your database (e.g., “customers”) and choosing “View data” or by selecting the table and under “Actions,” choosing “View data in Athena.” You might be prompted to confirm that you want to go to Athena.
  2. Athena Interface: This will take you to the Athena query editor. You might see a message about needing to set up a query result location in S3:
  3. Setting Up Query Result Location:
    • Click on “Edit settings.”
    • Choose an S3 bucket to store the results of your Athena queries. You can either create a new bucket or reuse the same bucket where your data is stored. It’s recommended to select the top-level bucket.
    • Athena will typically create a new folder within the selected bucket to store the query results.
    • Click “Save.”
  4. Querying Your Data:
    • In the Athena query editor, on the left side, you will see the databases and tables from your Glue Data Catalog.
    • Expand your database (e.g., “customers”) and then the table (e.g., the one corresponding to your S3 folder, like “documents”). This will show you the columns and their data types inferred by the crawler:
    • You can preview the table data by selecting the three dots next to the table name and choosing “Preview table.” This will run a basic SELECT * FROM "AwsDataCatalog"."customers"."table-documents" limit 10; query.
    • You can then write and execute standard SQL queries against your table. For example, to find the average of an “age” column, you can write:
    SELECT ROUND(AVG(age), 2) AS average_age
    FROM customers.your_table_name;
    • Execute the query by pressing Ctrl+Enter or clicking the “Run query” button. The results will be displayed in the editor, and a copy of the results will be stored in the S3 location you configured.

This demonstration shows how Athena uses the metadata in the Glue Data Catalog to query the actual data stored in your S3 bucket using familiar SQL syntax.

Federated Queries

Federated queries in AWS Athena extend its querying capabilities beyond data stored in S3. This advanced feature allows you to run SQL queries that span across data residing in various types of data stores, providing a unified interface for data analysis.

With federated queries, you can query:

  • Relational data sources: Such as Amazon RDS (MySQL, PostgreSQL, etc.) and other JDBC-compliant databases.
  • Non-relational data sources: Like Amazon DynamoDB and Amazon DocumentDB.
  • Object data sources: Including Amazon S3 (which is the standard Athena use case).
  • Custom data sources: You can even create your own connectors to access unique data stores.

The key to this functionality lies in federated data sources, which act as connectors. These connectors are essentially pieces of code that translate between the specific data source and the Athena query engine. Think of them as extensions that allow Athena to understand and interact with different database systems or data formats.

AWS provides several pre-built data source connectors, including:

  • Amazon CloudWatch Logs
  • Amazon DynamoDB
  • Amazon DocumentDB
  • Amazon RDS (for various database engines)
  • Other JDBC-compliant relational databases (like MySQL and PostgreSQL)

The primary benefit of federated queries is the ability to combine data from multiple, disparate sources for analysis without the need for complex ETL (Extract, Transform, Load) processes.

Consider a scenario where you need to analyze a specific customer’s purchase history. The purchase details might be stored in an RDS database, product information in another RDS database, and detailed customer profiles along with website interaction data in DynamoDB. Instead of building a pipeline to move and combine this data, you could use an Athena federated query to join data across these different sources directly. This allows for quick, ad-hoc analysis and provides a holistic view of the data.

In essence, federated queries in Athena offer a powerful way to break down data silos and gain insights from a wider range of data sources using familiar SQL.

Performance & Cost

This lecture focuses on strategies to optimize both the performance and cost of using AWS Athena.

Cost Optimization

  • Pay-per-query and data scanned: You are charged only for the queries you run and the amount of data scanned by each query. Failed queries do not incur charges.
  • Reduce data scanned: The primary way to reduce cost is to minimize the amount of data Athena needs to process for each query.
  • Reserved capacity: If you have predictable, high usage, you can consider reserved capacity for potential cost savings.

Performance Optimization (TODO: da risentire??)

  • Data Partitioning:
    • Structuring your data in S3 into logical partitions (e.g., by date, region) allows Athena to use partition pruning. Partition pruning eliminates irrelevant partitions before query processing, significantly reducing the amount of data scanned and improving performance.
    • Metadata about these partitions is stored in the Glue Data Catalog.
    • Athena makes a GetPartitions call to the Glue Data Catalog before pruning.
  • Partition Projection:
    • For tables with a large number of partitions, retrieving all partition metadata from Glue can impact performance.
    • Partition projection automates partition management and speeds up query processing for highly partitioned tables.
    • Instead of relying on the Glue Data Catalog for partition information, Athena calculates the partitions based on a defined range and projection type (e.g., for date or numerical sequences).
    • This eliminates the need to manually specify partitions in Glue and avoids the overhead of fetching a large number of partitions during query planning.
  • Glue Data Catalog Partition Indexes:
    • For tables with many partitions, retrieving all of them for query planning can be time-consuming.
    • Partition indexes in the Glue Data Catalog allow Athena to fetch only a subset of relevant partitions, improving query planning and reducing runtime.
  • Query Result Reuse:
    • Athena stores the results of queries in S3 as CSV files.
    • Enabling query result reuse allows Athena to return the results of a previously executed query if a subsequent query is identical (within a configurable time frame, e.g., 60 minutes).
    • This avoids rescanning the data, leading to faster performance and lower costs, especially for frequently run queries on relatively static data.
  • Data Format and Compression:
    • Compression: Compressing your data in S3 reduces its size, leading to lower storage costs and faster query performance because less data needs to be read.
    • Columnar Formats: Using columnar data formats like Parquet or Apache Avro (instead of row-based formats like CSV or JSON) significantly improves analytical query performance.
      • Columnar storage allows Athena to read only the specific columns needed for a query, rather than entire rows.
      • This reduces the amount of data scanned and improves query execution time.
      • Columnar formats also often offer better compression capabilities.

By implementing these performance and cost optimization strategies, you can make your Athena queries more efficient and cost-effective.

Workgroups

AWS Athena Workgroups provide a way to organize and isolate queries for different teams, applications, or use cases. They allow you to manage query execution settings, control costs, and configure access for specific workloads.

Key benefits and features of workgroups:

  • Query Isolation: You can separate queries from different projects or teams, preventing interference and making it easier to manage resources. For example, you might have one workgroup for report generation used by QuickSight and another for ad-hoc analysis by data scientists.
  • Dedicated Configurations: Each workgroup can have its own settings, such as:
    • Query execution settings: These control how queries are run within the workgroup.
    • Result location: You can specify a dedicated S3 bucket or path within a bucket to store the query results for each workgroup. This helps in organizing and managing query outputs.
  • Cost Tracking and Control: Workgroups enable you to track and control the costs associated with different types of queries or teams. This allows for better cost allocation and management.
  • Access Control: You can control access to workgroups using standard AWS Identity and Access Management (IAM) policies. This allows you to grant specific users or roles permission to use particular workgroups. Users with access will see a dropdown in the Athena console to select the workgroup they want to use for their queries.
  • Engine Type: While the default engine for Athena is Athena SQL, you can also create workgroups that are enabled for Apache Spark. This allows you to run Spark-based data processing within Athena. You would need to create a separate workgroup and configure its engine type to Apache Spark.
  • Limits: You can create up to 1000 workgroups per AWS region.
  • Primary Workgroup: Every AWS account has a default “primary” workgroup. All authenticated users in the account have access to this workgroup by default, and it cannot be deleted. For other workgroups you create, you can configure access using IAM trust policies.

In summary, Athena Workgroups are a valuable tool for organizing your Athena usage, managing costs, and controlling access for different workloads and teams, providing a more structured and efficient way to interact with your data.

Workgroups (Hands-on)

This lecture provides a brief demonstration on how to set up and configure a workgroup in AWS Athena.

  1. Accessing Workgroup Settings: In the Athena service, navigate to the left-hand navigation panel and expand it. Under the “Administration” section, you will find “Workgroups.”
  2. Primary Workgroup: By default, you’ll see the “primary” workgroup, which is always available. This workgroup uses the Athena SQL engine.
  3. Creating a New Workgroup: To create a new workgroup, click the “Create workgroup” button.
  4. Workgroup Configuration:
    • Name: Give your workgroup a descriptive name (e.g., “report generation” or “development team”).
    • Engine: Choose the engine type. The default is “Athena SQL” for interactive queries. You can also select “Apache Spark” to use the serverless Spark engine within Athena.
    • Authentication: You can leave the authentication settings at their default values for a basic setup.
    • Query Result Location: Optionally, specify an S3 bucket or path within a bucket where the query results for this workgroup will be stored.
    • Usage Control: You can set data limits or usage alerts to track and control costs for this specific workgroup. This allows you to isolate and manage costs for different teams or workloads.
  5. Using the Workgroup: Once the workgroup is created, you can select it from the dropdown menu in the query editor to run queries within that specific workgroup context.

This demonstration shows how to create and configure workgroups in the Athena console to organize and manage your queries and associated resources.