Launch Jupyter notebooks with pyspark on an EMR Cluster
The Beginner’s Guide describes Jupyter Notebook as “The Jupyter Notebook App is a server-client application that allows editing and running notebook documents via a web browser. The Jupyter Notebook App can be executed on a local desktop requiring no internet access (as described in this document) or can be installed on a remote server and accessed through the internet.”
This blog will be about setting the infrastructure up to use Spark via AWS Elastic Map Reduce (AWS EMR) and Jupyter Notebook.
Step 1: Launch an EMR Cluster
To start off, Navigate to the EMR section from your AWS Console.
Switch over to Advanced Options to have a choice list of different versions of EMR to choose from.
In the advanced window; each EMR version comes with a specific version of Spark, Hue and other packaged distributions.
For this Tutorial I have chosen to launch an EMR version 5.20 which comes with Spark 2.4.0.
I feel comfortable with V 2.4.0 as there are no surprises that come with newer versions.
Setup Hardware:
In the hardware section I have used m4.xlarge EC2 instances with 200 GB of EBS storage each.
To reduce costs of ownership I have opted for spot instances which (at the time of provisioning) cost me 0.060 $.
Here are some configurations that you can use for a temporary EMR cluster (usually fit for exploration).
Master Node:
No of Node: 1
EC2 class: m4.xlarge
EBS: 200 GB
Type: Spot Instance
Core Node:
No of Node: 1
EC2 class: m4.xlarge
EBS: 200 GB
Type: Spot Instance
Both Instances are deployed in us-east-1e which is offering the lowest bid price at the moment.
Note:
- Cost for Spot instances vary depending on the region you choose.
- If the price of these spot instances goes above you bid price then they will be terminated immediately. So be very careful while using this Instance type.
- Spot Instances are preferred only for provisioning Task Nodes i.e. when the Core Nodes are not in a position to handle the memory/computing requirements.
General Cluster Settings
Here we give a distinct name to the cluster which will stand out if you have a bunch of terminated clusters in your EMR console window.
Logging:
Logging by default is enabled and the log files are stored in an S3 bucket which can be a source of truth for any failures that occur on your EMR cluster.
Termination Protection:
As the name suggests; this flag will not allow users to terminate the cluster unless they choose to turn it off.
Since I am the only user who would interact with this cluster, I will leave the flag as it is.
Security:
As far as I am concerned this is an important section that will determine which external applications will have the right to access your EMR cluster.
Be careful while setting up this section!
EC2 Key Pair:
This is a key that you can setup and use to connect to your EC2 instance via SSH [link].
If you do not wish to use a key you can proceed without specifying one which will also work.
EC2 Security Groups:
By Default EMR will provision two different security groups (one for Master node and another for your Core Node cluster).
I however; have used a custom Security group that allows all traffic to these clusters (Unwise to use on Production Systems.)
Security Group configuration:
This Configuration allows any IP address to connect to your EMR cluster; be it TCP or SSH which we we normally use to login into the EC2 instances.
Warning: This Configuration is not secure and not suitable for Production systems.
Cluster Provisioning:
You are now ready to use your cluster! Yay!
Step 2: Connecting to your EMR Cluster
Windows Users: You can use PUTTY (download link) to SSH into the Master Node.
Click on SSH link which will give you detailed instructions for logging in.
Use this host name (provided by aws) to jump into the Master Node which logs you in using Hadoop username (super user).
PUTTY
In the Session section paste the hostname and use 22 as the port.
In the Connection -> SSH section click on Auth.
Here we specify the private key if you wish to use one. If you have opted not to use a private key then leave this section blank.
Tunneling
This how we will configure a “tunnel” between your EMR cluster that sits on a Virtual Private Network (VPC) and your browser.
IMP: It is important to specify ports on which you wish to setup your tunnels. Applications running on your EMR cluster will then be accessible via those ports only.
In the Connection -> SSH -> Tunnels sections specify the ports that you wish to use for your tunnels.
Here we will be using port 8989 to launch our spark cluster which will in-turn launch a jupyter notebook.
Once done you will be logged into your EMR’s Master Node.
If you see this screen then you are good!
Configuring your browser
Configuring your browser to connect via the tunnels (created via PUTTY) is the tricky part.
You can refer to this link that will walk you through this process.
Do not Fret!!!! I will walk you through this process too.
Browser Used: Chrome
Extension used: FoxyProxy Standard
- Install FoxyProxy extension on your chrome browser.
2. Create an xml file and paste the below code.
<?xml version=”1.0" encoding=”UTF-8"?><foxyproxy><proxies><proxy name=”emr-socks-proxy” id=”2322596116" notes=”” fromSubscription=”false” enabled=”true” mode=”manual” selectedTabIndex=”2" lastresort=”false” animatedIcons=”true” includeInCycle=”true” color=”#0055E5" proxyDNS=”true” noInternalIPs=”false” autoconfMode=”pac” clearCacheBeforeUse=”false” disableCache=”false” clearCookiesBeforeUse=”false” rejectCookies=”false”><matches><match enabled=”true” name=”*ec2*.amazonaws.com*” pattern=”*ec2*.amazonaws.com*” isRegEx=”false” isBlackList=”false” isMultiLine=”false” caseSensitive=”false” fromSubscription=”false” /><match enabled=”true” name=”*ec2*.compute*” pattern=”*ec2*.compute*” isRegEx=”false” isBlackList=”false” isMultiLine=”false” caseSensitive=”false” fromSubscription=”false” /><match enabled=”true” name=”10.*” pattern=”http://10.*" isRegEx=”false” isBlackList=”false” isMultiLine=”false” caseSensitive=”false” fromSubscription=”false” /><match enabled=”true” name=”*10*.amazonaws.com*” pattern=”*10*.amazonaws.com*” isRegEx=”false” isBlackList=”false” isMultiLine=”false” caseSensitive=”false” fromSubscription=”false” /><match enabled=”true” name=”*10*.compute*” pattern=”*10*.compute*” isRegEx=”false” isBlackList=”false” isMultiLine=”false” caseSensitive=”false” fromSubscription=”false” /><match enabled=”true” name=”*.compute.internal*” pattern=”*.compute.internal*” isRegEx=”false” isBlackList=”false” isMultiLine=”false” caseSensitive=”false” fromSubscription=”false”/><match enabled=”true” name=”*.ec2.internal* “ pattern=”*.ec2.internal*” isRegEx=”false” isBlackList=”false” isMultiLine=”false” caseSensitive=”false” fromSubscription=”false”/></matches><manualconf host=”localhost” port=”8989" socksversion=”5" isSocks=”true” username=”” password=”” domain=”” /></proxy></proxies></foxyproxy>
In the manualconf tag we specify 8989 as the port which will allow the browser to communicate freely with your EMR cluster.
3. Import xml file into FoxyProxy.
In the Options window click on Import/Export.
Here Specify the xml file we created earlier and select yes to overwrite the configuration.
Once done, test to see if your browser is able to access internet via your EMR cluster.
This means your tunneling has been setup successfully and now you can access the applications that are running on your EMR cluster.
Step 3: Install Anaconda
Head over to your Master Node EC2 instance to install Anaconda and packages that are necessary for Jupyter notebooks.
Choose your Anaconda Version using https://repo.continuum.io/archive/ and copy/paste filepath in the next step.
Download Anaconda:
Type the following in your command prompt:
wget https://repo.continuum.io/archive/Anaconda2-2019.07-Linux-x86_64.sh
Install Anaconda:
bash Anaconda2–2019.07-Linux-x86_64.sh
Once installed check the version of Python using python — version.
Step 3: Launch pyspark
Let us start by configuring jupyter as the default notebook for pyspark
Use the following code:
source .bashrcexport PYSPARK_DRIVER_PYTHON=jupyterexport PYSPARK_DRIVER_PYTHON_OPTS=’notebook — no-browser — ip=0.0.0.0 — port=8989'source .bashrc
What do the above flags mean??
- — no-browser : This flag tells pyspark to launch jupyter notebook by default but without invoking a browser window.
— ip=0.0.0.0: by default pyspark chooses localhost(127.0.0.1) to launch Jupyter which may not be accessible from your browser. We thus force pyspark to launch Jupyter Notebooks using any IP address of its choice.
We will thus use this IP address to connect to Jupyter from our browser.
— port=8989: Port on which Jupyter is accessible
Launch pyspark
Type pyspark
Copy and paste this URL in your browser.
And Viola!
Test if Spark Context is configured properly.
In a new notebook try the following:
As you can see SparkContext is configured properly with YARN as its Resource Negotiator.
You are now free to create and run pyspark applications using jupyter notebooks running on your EMR’s Master node.