How to Install Spark On Windows
Apache Spark is one of most popular data processing tools. It has multiple useful libraries like streaming, machine learning, etc. In this blog we are going to learn how to install spark on windows. It is a common misconception that spark is a part of Hadoop ecosystem and it needs Hadoop installed to to work with Spark. We will see that how easy it is to set up spark on windows and use it for practise.
Java Virtual Machine
Before we start, we will need to make sure we have java set up on our machine. This is necessary as Spark needs JVM to run. We can check that if Java is installed or not by running below command in Powershell.
1 2 3 4 5 |
java -version java version "1.8.0_281" java version "1.8.0_281" Java(TM) SE Runtime Environment (build 1.8.0_281-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.281-b09, mixed mode) |
If you do not have java installed on your windows machine, you can follow one of method below.
Using Chocolatey to Install Java
You can use chocolatey package installer for windows to set up Java on your machine. All you need to do is run below command in power shell.
1 2 3 |
choco install jdk8 choco install javaruntime |
you can get more details about this at below links.
Downloading Java from Oracle site
You can also download latest version of java from oracle website and install it on windows. You can get java at this link https://www.java.com/en/download/. Once you have an installer, just execute it and it will set up java on your machine.
Once installation is complete by either of ways, check java version using command mention above. If you get output with some version, all is good. If you do not get any output or get an error, check if you have JAVA_HOME set up in your environment variables.
Download Spark
Now we can download spark from apache spark website. You can choose which spark version you need and which type of pre-built Hadoop version it comes with.
Setting Up Spark On Windows
Once your download is complete, it will be zip file. You can unzip that file which will have Spark code.
Now we can place this code anywhere on our windows system. I like to create spark directory under C drive and place code there.
Setting Up WinUtils for Hadoop version
Next thing we will need is win utils file. This will trick spark into thinking Hadoop is installed on this machine. You can download “winutils” at this GitHub repository.
Once you have this exe file, create another directory in C drive with name hadoop, inside that create bin directory and put this exe file inside C:/hadoop/bin path.
Setting Up Environment Variables
Last thing we need to do is set up environment variables for Spark Home and Hadoop Home so that we can access spark from anywhere.
To set this up, search environment variables in windows start menu. Once environment box is open, go to “Path” variable for your user.
Select and edit this path variable and add below two lines to it. If you have placed spark code and winutils in a different directory, change file paths below.
C:\spark\bin
C:\hadoop\bin
Check Installation Status
If you have come this far and done all steps correctly, We should be able to use Spark form power shell. To check this try running “spark-shell” or “pyspark” from windows power shell. If you get output with spark version, all is good and you can start working with Spark from your own machine.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
PS C:\Users\mahesh> spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". .... Spark context Web UI available at http://host.docker.internal:4041 Spark context available as 'sc' (master = local[*], app id = local-1617038636125). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211) Type in expressions to have them evaluated. Type :help for more information. scala> |
When you launch spark, you can check spark job status at http://localhost:4040/.
Conclusion
In this article, we have learned how to set up spark on windows. This was much easier than expected and we have Spark running in few mins. We will use this Spark set up to learn and practise in future blogs. I hope you have found this useful. See you soon.