In last blog we had set up spark on our machine. We can access spark from console or command prompt. But when we are working, this is not the ideal way. It won’t save our commands, fixing errors in console is much difficult, and it does not have any intellisense.
That is why, in this blog, we are going to learn how to use spark with Jupyter notebook. We can use Jupyter notebooks from Anaconda or we can use them inside Visual studio code as well. I like to use visual studio code as its lightweight, have a lot of good extensions and we do not need another IDE just for working with notebooks. So let’s get started.
Setting Up Visual Studio Code
First thing we will need is visual studio code installed on our machine. It is free to download and easy to set up. You can download it from this link.
Once you have installed visual studio code, open it and search for python extension and install it.
Required Python Packages
Next thing will be to install required python packages on our system. For this, we can use pip. There are two packages that we need to install.
- jupyter – this package will help us use jupyter notebooks inside visual studio code.
- findspark – this package will help us Spark installed on our machine to integrate with jupyter notebooks.
We can install both packages using command below.
pip install jupyter
pip install findspark
Starting Jupyter Notebook In Visual Studio Code
We can now work with notebooks in visual studio code. For that, open your visual studio code and press “CTRL + SHIFT + P”. This will open command pallet. Search for create notebook.
This will start our notebook.
For using spark inside it we need to first initialize findspark. We can do that using below code.
Now we can create spark session to use for our work.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("D:\\code\\spark\\spark-basics\\data\\flight-data\\csv\\2010-summary.csv")
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
We have set up visual studio code and jupyter notebooks to use them with Spark. This development environment will save you a lot of time and easy to use when working with Spark. I hope you have found his useful. If you have questions, let me know. See you in the next blog.