Integrate Spark with Jupyter Notebook and Visual Studio Code
In last blog we had set up spark on our machine. We can access spark from console or command prompt. But when we are working, this is not the ideal way. It won’t save our commands, fixing errors in console is much difficult, and it does not have any intellisense.
That is why, in this blog, we are going to learn how to use spark with Jupyter notebook. We can use Jupyter notebooks from Anaconda or we can use them inside Visual studio code as well. I like to use visual studio code as its lightweight, have a lot of good extensions and we do not need another IDE just for working with notebooks. So let’s get started.
Setting Up Visual Studio Code
First thing we will need is visual studio code installed on our machine. It is free to download and easy to set up. You can download it from this link.
Once you have installed visual studio code, open it and search for python extension and install it.
Required Python Packages
Next thing will be to install required python packages on our system. For this, we can use pip. There are two packages that we need to install.
- jupyter – this package will help us use jupyter notebooks inside visual studio code.
- findspark – this package will help us Spark installed on our machine to integrate with jupyter notebooks.
We can install both packages using command below.
1 2 3 |
pip install jupyter pip install findspark |
Starting Jupyter Notebook In Visual Studio Code
We can now work with notebooks in visual studio code. For that, open your visual studio code and press “CTRL + SHIFT + P”. This will open command pallet. Search for create notebook.
This will start our notebook.
For using spark inside it we need to first initialize findspark. We can do that using below code.
1 2 |
import findspark findspark.init() |
Now we can create spark session to use for our work.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.csv("D:\\code\\spark\\spark-basics\\data\\flight-data\\csv\\2010-summary.csv") df.printSchema() root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) |
Conclusion
We have set up visual studio code and jupyter notebooks to use them with Spark. This development environment will save you a lot of time and easy to use when working with Spark. I hope you have found his useful. If you have questions, let me know. See you in the next blog.
I downloaded visual studio c++ sp1 retribution package but i m getting error while connecting spark that running spark requires visual studio
I installed three times, i m not getting how to solve this error. Which setting i need to do? Kindly assist me plz I need to solve this error early
Hi Ojaswi, can you try Visual Studio Code. You can find that here https://code.visualstudio.com/download