Removing White Spaces From Data in Spark

Updated On September 13, 2020 | By Mahesh Mogal

There are multiple methods provided by the spark to handle white spaces in data. The most basic way to remove white spaces is to use "regexp_replace". Unfortunately "regexp_replace" is not always easy to use. So we are going to learn some simple functions like trim, ltrim & rtrim to remove white spaces.

ltrim

We can use ltrim to remove white spaces from beginning of string.

df = spark.read.format("csv") \
    .option("inferSchema", "true") \
    .option("header","true") \
    .load("data/sample.csv")

from pyspark.sql.functions import ltrim,rtrim,trim
df.select(ltrim(col("DEST_COUNTRY_NAME"))).show(5)

rtrim

Just like ltrim, we can use rtrim to remove trailing white spaces from string.

df.select(rtrim(col("DEST_COUNTRY_NAME"))).show(5)

trim

If we want to remove white spaces from both ends of string we can use the trim function.

df.select(trim(col("DEST_COUNTRY_NAME"))).show(5)

We can easily check if this is working or not by using length function.

from pyspark.sql.functions import length,col
df.select( \
    col("DEST_COUNTRY_NAME"), \
    length(col("DEST_COUNTRY_NAME")).alias("length_with_whitespace"), \
    trim(col("DEST_COUNTRY_NAME")), \
    length(trim(col("DEST_COUNTRY_NAME"))).alias("length_without_whitespace") \
    ).show(5)
Removing White Spaces using Trim
Removing White Spaces using Trim

Using "regexp_replace" to remove white spaces

"regexp_replace" is powerful & multipurpose method. Let us see how we can use it to remove white spaces around string data in spark.

reg_exp="\\s+"
reg = regexp_replace(col("DEST_COUNTRY_NAME"), reg_exp,"")
df.select("DEST_COUNTRY_NAME", \
    length(col("DEST_COUNTRY_NAME")).alias("length_with_whitespace"), \
    reg.alias("white space removed"), \
    length(reg).alias("length_without_whitespace") \
    ).show(5)

Obviously this regular expression removes all white space from a string. even space between words. By changing regular expression, you can use the above code for multiple use cases.

I hope you found this useful. See you in next blog 🙂

.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Sorting in Spark Dataframe

In this blog, we will learn how to sort rows in spark dataframe based on some column values.

Read More
Removing White Spaces From Data in Spark

White spaces can be a headache if not removed before processing data. We will learn how to remove spaces from data in spark using inbuilt functions.

Read More
Padding Data in Spark Dataframe

In this blog, we will learn how to use rpad and lpad functions to add padding to data in spark dataframe.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap