In this article we shall walk you through the installation of Apache Spark on Debian 11 / Debian 10 Linux system. Apache Spark is an open source, multi-language engine for executing data science, data engineering, and machine learning on a single server or a fleet of servers working as Spark cluster. Spark offers a unified analytics engine for large-scale data processing. As it can be deployed on commodity hardware, it means you get a highly available solution, and very cheaply.
Spark provides high-level APIs in Java, R, Scala, Python, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Follow the steps outlined in the following sections to deploy and use Apache Spark on Debian 11 / Debian 10 Linux system.
Key features of Apache Spark
- SQL analytics: Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting.
- Data science at scale: Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
- Machine learning: Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
- Batch/streaming data: Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
Install Apache Spark on Debian 11 / Debian 10
Before we begin the installation, let’s update and upgrade all the packages on our Debian system. Run the commands below to perform a system upgrade.
sudo apt update
sudo apt -y upgrade
After a successful upgrade, consider performing a system reboot to use latest Kernel.
[ -f /var/run/reboot-required ] && sudo reboot -f
Now proceed to the first step of the installingApache Spark on Debian 11 / Debian 10.
Step 1: Install Java on Debian System
Apache Spark requires Java to execute its binary. Since Java is not installed by default on Debian, use the commands provided to install Javaon Debian 11 / Debian 10.
sudo apt install default-jdk mlocate curl -y
Check Java version:
$ java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-post-Debian-1deb11u1)
OpenJDK 64-Bit Server VM (build 11.0.13+8-post-Debian-1deb11u1, mixed mode, sharing)
For missing add-apt-repository command, check How to Install add-apt-repository on Debian / Ubuntu
Step 2: Download Apache Spark
Use the commands below to download the latest Apache Spark release from software releases downloads page.
wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
Use tar to extract Apache Spark archive:
tar xvf spark-3.2.0-bin-hadoop3.2.tgz
Move extracted Spark folder to the /opt directory.
sudo mv spark-3.2.0-bin-hadoop3.2/ /opt/spark
Configure Spark environment in your bashrc file:
tee -a ~/.bashrc<<EOF
export SPARK_HOME=/opt/spark
export PATH=\$PATH:\$SPARK_HOME/bin:\$SPARK_HOME/sbin
EOF
Source the file to activate the environment:
source ~/.bashrc
Confirm it works:
$ echo $SPARK_HOME
/opt/spark
$ echo $PATH
/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/opt/spark/bin:/opt/spark/sbin
Step 3: Start a standalone Spark master server
Use the Spark service start script to initiate a standalone master server.
$ start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-jkmutai-org.apache.spark.deploy.master.Master-1-debian.out
When the service starts, it will bind to port 8080 and can be confirmed using ss command line tool:
$ sudo ss -tunelp | grep 8080
tcp LISTEN 0 1 *:8080 *:* users:(("java",pid=5119,fd=270)) uid:1000 ino:25440 sk:c cgroup:/user.slice/user-1000.slice/session-3.scope v6only:0 <->
Access Apache Spark Web Interface on http://[serverip_or_hostname]:8080:
My Spark URL, in my case this is spark://debian.localdomain:7077.
Step 4: Starting Spark Worker Process
The start-slave.sh command is used to start Spark Worker Process.
$ start-worker.sh spark://debian.localdomain:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-jkmutai-org.apache.spark.deploy.worker.Worker-1-debian.out
If you don’t have the script in your $PATH, you can first locate it.
$ sudo updatedb
$ locate start-slave.sh
/opt/spark/sbin/start-slave.sh
You’ll need to configure script absolute path before you can run the script.
Step 5: Accessing Apache Spark shell
Use the spark-shell script to access Spark Shell.
$ /opt/spark/bin/spark-shell
21/12/30 15:00:26 WARN Utils: Your hostname, debian resolves to a loopback address: 127.0.1.1; using 192.168.200.50 instead (on interface enp1s0)
21/12/30 15:00:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/30 15:00:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.200.50:4040
Spark context available as 'sc' (master = local[*], app id = local-1640894449280).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.0
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.13)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
If you’re more of a Python person, use pyspark.
$ /opt/spark/bin/pyspark
Python 3.9.2 (default, Feb 28 2021, 17:03:44)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
21/12/30 15:01:45 WARN Utils: Your hostname, debian resolves to a loopback address: 127.0.1.1; using 192.168.200.50 instead (on interface enp1s0)
21/12/30 15:01:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/30 15:01:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.0
/_/
Using Python version 3.9.2 (default, Feb 28 2021 17:03:44)
Spark context Web UI available at http://192.168.200.50:4040
Spark context available as 'sc' (master = local[*], app id = local-1640894512293).
SparkSession available as 'spark'.
>>>
Easily shut down the master and slave Spark processes using commands below.
$ /opt/spark/sbin/stop-worker.sh
stopping org.apache.spark.deploy.worker.Worker
$ /opt/spark/sbin/stop-master.sh
stopping org.apache.spark.deploy.master.Master
You now have Apache Spark installed and working on Debian 11 / Debian 10 Linux system. Apache Spark utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. Apache Spark is one of the most used big data distributed processing framework. Feel free to contribute towards its improvements and visit the project’s official Documentation to read more.