Published on

Setting up a Spark cluster using Docker Compose

Authors
setting-up-a-spark-cluster-using-docker-compose

Prerequisites

Before we begin, make sure that you have Docker and Docker Compose installed on your system.

Step 1: Create a Docker Compose file

Create a new file called docker-compose.yml in your project directory and add the following code to it:

docker-compose.yml
version: '2'
services:
  spark-master:
    image: docker.io/bitnami/spark:3.3
    container_name:
    environment: spark-master
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
    ports:
      - '8080:8080'
      - '7077:7077'
  spark-worker1:
    image: docker.io/bitnami/spark:3.3
    environment: spark-worker1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
  spark-worker2:
    image: docker.io/bitnami/spark:3.3
    environment: spark-worker2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark

or

curl -LO https://raw.githubusercontent.com/bitnami/containers/main/bitnami/spark/docker-compose.yml

This will create two services: a Spark master node (spark) and a Spark worker node (worker). The spark service exposes ports 8080 and 7077, which you can use to access the Spark Web UI and connect to the Spark master from your PySpark code.

Step 2: Start the Spark cluster

To start the Spark cluster, open a terminal window in your project directory and run the following command:

docker-compose up

This will download the necessary Docker images and start the Spark master and worker nodes. You should see the Spark Web UI at http://localhost:8080, which you can use to monitor the status of your Spark cluster.

Step 3: Connect to the Spark cluster from PySpark

Now that your Spark cluster is up and running, you can connect to it from your PySpark code. Here's some sample code to get you started:

from pyspark.sql import SparkSession

# create a new SparkSession
spark = SparkSession.builder \
    .appName('MyApp') \
    .master('spark://localhost:7077') \
    .getOrCreate()

# read some data from a CSV file
df = spark.read \
    .format('csv') \
    .option('header', 'true') \
    .load('/path/to/my/data.csv')

# do some processing on the data
result = df.groupBy('category') \
    .agg({'price': 'max', 'quantity': 'sum'}) \
    .orderBy('category')

# write the result to a Parquet file
result.write \
    .format('parquet') \
    .mode('overwrite') \
    .save('/path/to/my/result.parquet')

# stop the SparkSession
spark.stop()

This code creates a new SparkSession and connects to the Spark master at spark://localhost:7077. It then reads some data from a CSV file, processes it using PySpark SQL, and writes the result to a Parquet file. Finally, it stops the SparkSession.

Conclusion

In this blog post, we walked through the steps of setting up a Spark cluster using Docker Compose with Bitnami images and connecting to it from PySpark. With this setup, you can easily experiment with PySpark and build big data applications without having to worry about the complexities of setting up a distributed cluster.