Integrate Spark with Jupyter Notebook and Visual Studio Code

In last blog we had set up spark on our machine. We can access spark from console or command prompt. But when we are working, this is not the ideal way. It won’t save our commands, fixing errors in console is much difficult, and it does not have any intellisense.

That is why, in this blog, we are going to learn how to use spark with Jupyter notebook. We can use Jupyter notebooks from Anaconda or we can use them inside Visual studio code as well. I like to use visual studio code as its lightweight, have a lot of good extensions and we do not need another IDE just for working with notebooks. So let’s get started.

Setting Up Visual Studio Code

First thing we will need is visual studio code installed on our machine. It is free to download and easy to set up. You can download it from this link.

Once you have installed visual studio code, open it and search for python extension and install it.

vsc-python-extension — visual studio code – Python extension

Required Python Packages

Next thing will be to install required python packages on our system. For this, we can use pip. There are two packages that we need to install.

jupyter – this package will help us use jupyter notebooks inside visual studio code.
findspark – this package will help us Spark installed on our machine to integrate with jupyter notebooks.

We can install both packages using command below.

pip install jupyter

pip install findspark

pip install jupyter

pip install findspark

Starting Jupyter Notebook In Visual Studio Code

We can now work with notebooks in visual studio code. For that, open your visual studio code and press “CTRL + SHIFT + P”. This will open command pallet. Search for create notebook.

This will start our notebook.

For using spark inside it we need to first initialize findspark. We can do that using below code.

import findspark
findspark.init()

1 2	import findspark findspark.init()

Now we can create spark session to use for our work.

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.read.csv("D:\\code\\spark\\spark-basics\\data\\flight-data\\csv\\2010-summary.csv")

df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)

import pyspark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.read.csv("D:\\code\\spark\\spark-basics\\data\\flight-data\\csv\\2010-summary.csv")

df.printSchema()

root

|-- _c0: string (nullable = true)

|-- _c1: string (nullable = true)

|-- _c2: string (nullable = true)

Conclusion

We have set up visual studio code and jupyter notebooks to use them with Spark. This development environment will save you a lot of time and easy to use when working with Spark. I hope you have found his useful. If you have questions, let me know. See you in the next blog.

Integrate Spark with Jupyter Notebook and Visual Studio Code

Setting Up Visual Studio Code

Required Python Packages

Starting Jupyter Notebook In Visual Studio Code

Conclusion

Reading data from a file in Spark

Read CSV Data in Spark

Date & Timestamp Functions in Spark

Renaming DataFrame Columns in Spark

Removing White Spaces From Data in Spark

Date Difference functions in Spark

Setting Up Visual Studio Code

Required Python Packages

Starting Jupyter Notebook In Visual Studio Code

Conclusion

Similar Posts