Blog: AWS GluePyspark Locally
Posted on Sat 16 May 2020 in blogs
Download and install maven
- Download maven from https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
- untar the content to respective folder
For example,
mv apache-maven-3.6.0 {HOME}/Documents/opt/apache-maven
-
Add mvn to your path
bash echo 'export PATH=$PATH:/Users/bhavintandel/Documents/opt/apache-maven/bin' >> ~/.profile
-
Restart the session
Download the Spark distrubution
At the moment aws provide two glue spark executable,
- Glue version 0.9: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-0.9/spark-2.2.1-bin-hadoop2.7.tgz
-
Glue version 1.0: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
-
Download the spark executables and move it to respective folder.
- Export the SPARK_HOME,
export SPARK_HOME=$HOME/Documents/opt/spark/latest
You can also add it to your ~/.profile file.
Download the aws-glue-libs
Aws have two version for aws-glue-libs,
- 0.9 -> python2 -> git@github.com:awslabs/aws-glue-libs.git
-
1.0 -> support python3 -> git@github.com:awslabs/aws-glue-libs.git
-
Clone the aws-glue-libs repo, For specific branch,
git clone -b {branch-name} git@github.com:awslabs/aws-glue-libs.git
-
Run the gluepyspark
bash cd aws-glue-libs ./bin/gluepyspark
Configure pycharm for pyspark development
- Install pyspark as python package