This project helps setting-up spark cluster in standalone mode on mac or windows inside docker terminal. Idea is to have a playground for learning Spark using pyspark or other Interpretors that can be setup on this image.
Here is the referenced article used to create the functional version of the same docker:
Please ensure to set below shown property in docker desktop before starting wtih running commands. Buildkit property must be set to false
Clone the repo.
-
Run build.bat or build.sh depending upon windows or linux on which you are running the command.
build.bat
Depending upon size of images, speed of your connection, it may take some time to download all the images for the first time.
-
Run
docker compose up
command to start the container once step 1 is complete
Once above steps are done, you can access spark cluster using following links
- JupyterLab at localhost:8888;
- Spark master at localhost:8080;
- Spark worker I at localhost:8081;
- Spark worker II at localhost:8082;
Optionally you can make an entry into /etc/hosts
file to replace localhost names with corresponding node names
Open the notebook JupyterLab at localhost:8888 and paste the below code to see Jupyter in action
from pyspark.sql import SparkSession
spark = SparkSession.\
builder.\
appName("pyspark-notebook").\
master("spark://spark-master:7077").\
config("spark.executor.memory", "512m").\
getOrCreate()
import wget
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
wget.download(url)
data = spark.read.csv("iris.data")
data.show(n=5)
-
Playing around with - Job can be submitted in master-node using below code. We are submitting the job in client mode.
spark-submit \
--class com.sparkTutorial.input \
--deploy-mode client \
--master "spark://master-node:7077" \
target/scala-2.12/sql-mongo-validation-assembly-0.1.jar