Leveraging Glue to act as a central Metadata store
I have often used Hive in conjunction with HDFS to propel my big data jobs on pyspark which is easy to setup and use.
Hive Metastore hosted on the master node made perfect sense with an on-premise setup where the cluster was always up; save for times when the Master/Slave node would crash.
However, this setup is decadent when EMR is being considered and cost effectiveness is the need of the hour. Not to Mention, transient EMR clusters that will be invoked only for a particular set of jobs cannot afford to waste precious time and resources in setting up Hive every time it is invoked.
Here comes Glue to the rescue!!!!
Imagine an external persistent data store that is managed by AWS and houses all your metadata with 100% availability. Ding Ding Ding! We have a winner!
This article will attempt to demonstrate the steps used to setup Glue data catalog and access it from an EMR cluster using pyspark.
Data files used for this demonstration will be housed in an S3 bucket and accessed using EMRFS.
Step 1: Setup Demo Data Set
Create a bucket with 2 folders:
- read
- write
I have used the movie data set for this demonstration that can be downloaded here.
Add this file to the read folder.
Step 2: Setup a Glue Table
The first step in setting up a data catalog is to create a table in Glue that will house the metadata of the target data set.
It is essential to understand some terminologies before we proceed:
- Crawler: Programs in Glue that scan data in any repository (of our choice) to classify it, extract the schema information and store it in a Glue table.
- Catalog Table: A table that contains the schema information (populated by the crawler) of one or more data sets that have similar structure.
Setting up a Crawler
Head over to AWS Glue from your AWS Console and select Crawlers from the Data Catalog Menu located to the left.
Select Add Crawler to create a new Crawler which will scan our data set and create a Catalog Table.
There are multiple steps that one must go through to setup a crawler but I will walk you through the crucial ones.
1. Crawler Info: Specify the name of the crawler and tags that you wish to add.
2. Crawler Source Type: Here we specify the data source type.
3. Data Store: Specify S3 as the Data Store and the bucket name in the Include path section.
4. IAM Role: Glue needs an IAM Role that will allow it to access an S3 bucket.
First time users are required to create an IAM Role that the Crawler can use to access our S3 bucket.
Though AWS Glue offers to create an IAM role for us; I would suggest creating it first from another window and specifying it here as Choose an existing IAM role.
Note: AWS recommends providing an S# managed Role as well as an Inline Role for the specific file path. However, I found it easier to provide S3 read access (Managed Role) which ruled out the need for an Inline Role.
5. Schedule: This section is concerned with setting up a schedule for running our crawler. This schedule can be configured based on how often our dataset schema changes.
Since my data set is static and the schema wont change, I have chosen to Run on demand. If my data set schema changes (a new column is added) then I will need to rerun this Crawler.
6. Output: Here we are asked to specify a database for the catalog tables.
Grouping Behavior when enabled will group data sets that have a similar schema into one table which comes in handy when we are scanning multiple files that have the same schema.
I have chosen to go with the defaults here, considering the fact that I am scanning a single data set.
We now have a Crawler Configured!!! Yay!!!
Run the Crawler and check the Tables:
If executed successfully; the Crawler will create a table populated with stats regarding to the schema of the data set.
Step 3: Configure pyspark to use Glue
Feel free to check out my article Launch Jupyter notebooks with pyspark on an EMR Cluster if you wish to leverage Jupyter notebooks hosted on an EMR Cluster.
In the Software and Steps section be sure to check Use for Spark table metadata option to avoid using Hive as the metadata Store.
In our Jupyter notebook we will now perform the following:
1. Configure Spark Session to use Glue:
By Default Spark is configured to use Hive as its Metastore.
We will thus specify hive.metastore.client.factory.class to use com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark = SparkSession.builder
/.appName(“myjob”) /.config(“hive.metastore.client.factory.class”,
“com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”)
/.enableHiveSupport()
/.getOrCreate()# Set a database and specify the Glue Data base name
spark.catalog.setCurrentDatabase(“glue-blog-tutorial-db”)
2. Query the table using Spark SQL
movie_data = spark.sql(“select * from movie_txt”)movie_data.show()
Query Movie data via Athena [Optional]
AWS has provided integration between Athena and Glue enabling users to fire SQL like queries on data stored on S3.
Cool isn’t it!!!
Head over to Athena from you AWS Console.
Select the database that we just created in Glue.
The Tables section will display a list of all tables under glue-blog-tutorial-db database.
Before querying the table be sure to setup the output folder for the results of the query.
You are now ready to fire queries against your data set via Athena.
In Conclusion:
- Glue can act as a persistent, highly available data catalog for any AWS service that wishes to query your data set residing on S3.
- EMR clusters can now be run in a transient fashion which will be highly cost effective.
Links:
Special Thanks to Mikael Ahonen’s article that set the ball rolling.