Leveraging Glue to act as a central Metadata store

Christopher Lagali
6 min readOct 23, 2019

--

I have often used Hive in conjunction with HDFS to propel my big data jobs on pyspark which is easy to setup and use.

Hive Metastore hosted on the master node made perfect sense with an on-premise setup where the cluster was always up; save for times when the Master/Slave node would crash.

However, this setup is decadent when EMR is being considered and cost effectiveness is the need of the hour. Not to Mention, transient EMR clusters that will be invoked only for a particular set of jobs cannot afford to waste precious time and resources in setting up Hive every time it is invoked.

EMR Cluster that uses Hive as the Metastore

Here comes Glue to the rescue!!!!

Imagine an external persistent data store that is managed by AWS and houses all your metadata with 100% availability. Ding Ding Ding! We have a winner!

This article will attempt to demonstrate the steps used to setup Glue data catalog and access it from an EMR cluster using pyspark.

Architecture Dig. depicting the potential data flows with Glue acting as a Catalog for our data sets

Data files used for this demonstration will be housed in an S3 bucket and accessed using EMRFS.

Step 1: Setup Demo Data Set

Create a bucket with 2 folders:

  • read
  • write
S3 bucket setup for the Tutorial

I have used the movie data set for this demonstration that can be downloaded here.

Add this file to the read folder.

Step 2: Setup a Glue Table

The first step in setting up a data catalog is to create a table in Glue that will house the metadata of the target data set.

It is essential to understand some terminologies before we proceed:

  • Crawler: Programs in Glue that scan data in any repository (of our choice) to classify it, extract the schema information and store it in a Glue table.
  • Catalog Table: A table that contains the schema information (populated by the crawler) of one or more data sets that have similar structure.

Setting up a Crawler

Head over to AWS Glue from your AWS Console and select Crawlers from the Data Catalog Menu located to the left.

Creating Crawlers in AWS Glue

Select Add Crawler to create a new Crawler which will scan our data set and create a Catalog Table.

There are multiple steps that one must go through to setup a crawler but I will walk you through the crucial ones.

1. Crawler Info: Specify the name of the crawler and tags that you wish to add.

2. Crawler Source Type: Here we specify the data source type.

Specifying a Data Store Type

3. Data Store: Specify S3 as the Data Store and the bucket name in the Include path section.

Setting up the data store and the path

4. IAM Role: Glue needs an IAM Role that will allow it to access an S3 bucket.

First time users are required to create an IAM Role that the Crawler can use to access our S3 bucket.

Though AWS Glue offers to create an IAM role for us; I would suggest creating it first from another window and specifying it here as Choose an existing IAM role.

Adding IAM Role for the Crawler

Note: AWS recommends providing an S# managed Role as well as an Inline Role for the specific file path. However, I found it easier to provide S3 read access (Managed Role) which ruled out the need for an Inline Role.

5. Schedule: This section is concerned with setting up a schedule for running our crawler. This schedule can be configured based on how often our dataset schema changes.

Specifying a schedule for the Crawler

Since my data set is static and the schema wont change, I have chosen to Run on demand. If my data set schema changes (a new column is added) then I will need to rerun this Crawler.

6. Output: Here we are asked to specify a database for the catalog tables.

Grouping Behavior when enabled will group data sets that have a similar schema into one table which comes in handy when we are scanning multiple files that have the same schema.

Creating a Glue database for the Catalog tables

I have chosen to go with the defaults here, considering the fact that I am scanning a single data set.

We now have a Crawler Configured!!! Yay!!!

Run the Crawler and check the Tables:

movie_txt Glue catalog table that displays the schema

If executed successfully; the Crawler will create a table populated with stats regarding to the schema of the data set.

Step 3: Configure pyspark to use Glue

Feel free to check out my article Launch Jupyter notebooks with pyspark on an EMR Cluster if you wish to leverage Jupyter notebooks hosted on an EMR Cluster.

In the Software and Steps section be sure to check Use for Spark table metadata option to avoid using Hive as the metadata Store.

In our Jupyter notebook we will now perform the following:

1. Configure Spark Session to use Glue:

By Default Spark is configured to use Hive as its Metastore.

We will thus specify hive.metastore.client.factory.class to use com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

spark = SparkSession.builder
/.appName(“myjob”) /.config(“hive.metastore.client.factory.class”,
“com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”)
/.enableHiveSupport()
/.getOrCreate()
# Set a database and specify the Glue Data base name
spark.catalog.setCurrentDatabase(“glue-blog-tutorial-db”)

2. Query the table using Spark SQL

movie_data = spark.sql(“select * from movie_txt”)movie_data.show()
Spark SQL to query movie data hosted on S3 via EMRFS

Query Movie data via Athena [Optional]

AWS has provided integration between Athena and Glue enabling users to fire SQL like queries on data stored on S3.

Cool isn’t it!!!

Head over to Athena from you AWS Console.

Select the Database in Athena

Select the database that we just created in Glue.

The Tables section will display a list of all tables under glue-blog-tutorial-db database.

Athena queries Glue to retrieve a list of tables in the specified database

Before querying the table be sure to setup the output folder for the results of the query.

Setup output folder for Athena to write the results of the query

You are now ready to fire queries against your data set via Athena.

Querying our movie_txt table

In Conclusion:

  • Glue can act as a persistent, highly available data catalog for any AWS service that wishes to query your data set residing on S3.
  • EMR clusters can now be run in a transient fashion which will be highly cost effective.

--

--

Christopher Lagali
Christopher Lagali

Written by Christopher Lagali

An Enthusiastic Data Eng. who is on a mission to unravel the possibilities of pipeline building with AWS and who believes in knowledge sharing.

No responses yet