Saturday, December 28, 2024
Google search engine
HomeData Modelling & AIHow To Install Apache Spark on Debian 11 / Debian 10

How To Install Apache Spark on Debian 11 / Debian 10

In this article we shall walk you through the installation of Apache Spark on Debian 11 / Debian 10 Linux system. Apache Spark is an open source, multi-language engine for executing data science, data engineering, and machine learning on a single server or a fleet of servers working as Spark cluster. Spark offers a unified analytics engine for large-scale data processing. As it can be deployed on commodity hardware, it means you get a highly available solution, and very cheaply.

Spark provides high-level APIs in Java, R, Scala, Python, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Follow the steps outlined in the following sections to deploy and use Apache Spark on Debian 11 / Debian 10 Linux system.

Key features of Apache Spark

  • SQL analytics: Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting.
  • Data science at scale: Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
  • Machine learning: Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
  • Batch/streaming data: Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.

Install Apache Spark on Debian 11 / Debian 10

Before we begin the installation, let’s update and upgrade all the packages on our Debian system. Run the commands below to perform a system upgrade.

sudo apt update
sudo apt -y upgrade

After a successful upgrade, consider performing a system reboot to use latest Kernel.

[ -f /var/run/reboot-required ] && sudo reboot -f

Now proceed to the first step of the installingApache Spark on Debian 11 / Debian 10.

Step 1: Install Java on Debian System

Apache Spark requires Java to execute its binary. Since Java is not installed by default on Debian, use the commands provided to install Javaon Debian 11 / Debian 10.

sudo apt install default-jdk mlocate curl -y

Check Java version:

$ java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-post-Debian-1deb11u1)
OpenJDK 64-Bit Server VM (build 11.0.13+8-post-Debian-1deb11u1, mixed mode, sharing)

For missing add-apt-repository command, check How to Install add-apt-repository on Debian / Ubuntu

Step 2: Download Apache Spark

Use the commands below to download the latest Apache Spark release from software releases downloads page.

wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

Use tar to extract Apache Spark archive:

tar xvf spark-3.2.0-bin-hadoop3.2.tgz

Move extracted Spark folder to the /opt directory.

sudo mv spark-3.2.0-bin-hadoop3.2/ /opt/spark 

Configure Spark environment in your bashrc file:

tee -a ~/.bashrc<<EOF
export SPARK_HOME=/opt/spark
export PATH=\$PATH:\$SPARK_HOME/bin:\$SPARK_HOME/sbin
EOF

Source the file to activate the environment:

source ~/.bashrc

Confirm it works:

$ echo $SPARK_HOME
/opt/spark

$ echo $PATH
/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/opt/spark/bin:/opt/spark/sbin

Step 3: Start a standalone Spark master server

Use the Spark service start script to initiate a standalone master server.

$ start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-jkmutai-org.apache.spark.deploy.master.Master-1-debian.out

When the service starts, it will bind to port 8080 and can be confirmed using ss command line tool:

$ sudo ss -tunelp | grep 8080
tcp   LISTEN 0      1                       *:8080             *:*    users:(("java",pid=5119,fd=270)) uid:1000 ino:25440 sk:c cgroup:/user.slice/user-1000.slice/session-3.scope v6only:0 <->

Access Apache Spark Web Interface on http://[serverip_or_hostname]:8080:

install apache spark debian 01

My Spark URL, in my case this is spark://debian.localdomain:7077.

Step 4: Starting Spark Worker Process

The start-slave.sh command is used to start Spark Worker Process.

$ start-worker.sh spark://debian.localdomain:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-jkmutai-org.apache.spark.deploy.worker.Worker-1-debian.out

If you don’t have the script in your $PATH, you can first locate it.

$ sudo updatedb
$ locate start-slave.sh
/opt/spark/sbin/start-slave.sh

You’ll need to configure script absolute path before you can run the script.

Step 5: Accessing Apache Spark shell

Use the spark-shell script to access Spark Shell.

$ /opt/spark/bin/spark-shell
21/12/30 15:00:26 WARN Utils: Your hostname, debian resolves to a loopback address: 127.0.1.1; using 192.168.200.50 instead (on interface enp1s0)
21/12/30 15:00:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/30 15:00:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.200.50:4040
Spark context available as 'sc' (master = local[*], app id = local-1640894449280).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
      /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.13)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

If you’re more of a Python person, use pyspark.

$ /opt/spark/bin/pyspark
Python 3.9.2 (default, Feb 28 2021, 17:03:44)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
21/12/30 15:01:45 WARN Utils: Your hostname, debian resolves to a loopback address: 127.0.1.1; using 192.168.200.50 instead (on interface enp1s0)
21/12/30 15:01:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/30 15:01:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0
      /_/

Using Python version 3.9.2 (default, Feb 28 2021 17:03:44)
Spark context Web UI available at http://192.168.200.50:4040
Spark context available as 'sc' (master = local[*], app id = local-1640894512293).
SparkSession available as 'spark'.
>>>

Easily shut down the master and slave Spark processes using commands below.

$ /opt/spark/sbin/stop-worker.sh
stopping org.apache.spark.deploy.worker.Worker
$ /opt/spark/sbin/stop-master.sh
stopping org.apache.spark.deploy.master.Master

You now have Apache Spark installed and working on Debian 11 / Debian 10 Linux system. Apache Spark utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. Apache Spark is one of the most used big data distributed processing framework. Feel free to contribute towards its improvements and visit the project’s official Documentation to read more.

RELATED ARTICLES

Most Popular

Recent Comments