Removing White Spaces From Data in Spark

There are multiple methods provided by the spark to handle white spaces in data. The most basic way to remove white spaces is to use “regexp_replace”. Unfortunately “regexp_replace” is not always easy to use. So we are going to learn some simple functions like trim, ltrim & rtrim to remove white spaces.

ltrim

We can use ltrim to remove white spaces from beginning of string.

df = spark.read.format("csv") \
    .option("inferSchema", "true") \
    .option("header","true") \
    .load("data/sample.csv")
from pyspark.sql.functions import ltrim,rtrim,trim
df.select(ltrim(col("DEST_COUNTRY_NAME"))).show(5)

df = spark.read.format("csv") \

.option("inferSchema", "true") \

.option("header","true") \

.load("data/sample.csv")

from pyspark.sql.functions import ltrim,rtrim,trim

df.select(ltrim(col("DEST_COUNTRY_NAME"))).show(5)

rtrim

Just like ltrim, we can use rtrim to remove trailing white spaces from string.

df.select(rtrim(col("DEST_COUNTRY_NAME"))).show(5)

1	df.select(rtrim(col("DEST_COUNTRY_NAME"))).show(5)

trim

If we want to remove white spaces from both ends of string we can use the trim function.

df.select(trim(col("DEST_COUNTRY_NAME"))).show(5)

1	df.select(trim(col("DEST_COUNTRY_NAME"))).show(5)

We can easily check if this is working or not by using length function.

from pyspark.sql.functions import length,col
df.select( \
    col("DEST_COUNTRY_NAME"), \
    length(col("DEST_COUNTRY_NAME")).alias("length_with_whitespace"), \
    trim(col("DEST_COUNTRY_NAME")), \
    length(trim(col("DEST_COUNTRY_NAME"))).alias("length_without_whitespace") \
    ).show(5)

from pyspark.sql.functions import length,col

df.select( \

col("DEST_COUNTRY_NAME"), \

length(col("DEST_COUNTRY_NAME")).alias("length_with_whitespace"), \

trim(col("DEST_COUNTRY_NAME")), \

length(trim(col("DEST_COUNTRY_NAME"))).alias("length_without_whitespace") \

).show(5)

Using “regexp_replace” to remove white spaces

“regexp_replace” is powerful & multipurpose method. Let us see how we can use it to remove white spaces around string data in spark.

reg_exp="\\s+"
reg = regexp_replace(col("DEST_COUNTRY_NAME"), reg_exp,"")
df.select("DEST_COUNTRY_NAME", \
    length(col("DEST_COUNTRY_NAME")).alias("length_with_whitespace"), \
    reg.alias("white space removed"), \
    length(reg).alias("length_without_whitespace") \
    ).show(5)

reg_exp="\\s+"

reg = regexp_replace(col("DEST_COUNTRY_NAME"), reg_exp,"")

df.select("DEST_COUNTRY_NAME", \

length(col("DEST_COUNTRY_NAME")).alias("length_with_whitespace"), \

reg.alias("white space removed"), \

length(reg).alias("length_without_whitespace") \

).show(5)

Obviously this regular expression removes all white space from a string. even space between words. By changing regular expression, you can use the above code for multiple use cases.

I hope you found this useful. See you in next blog 🙂

Removing White Spaces From Data in Spark

ltrim

rtrim

trim

Using “regexp_replace” to remove white spaces

Hive Data Manipulation – Loading Data to Hive Tables

Reading JSON data in Spark

Reading data from a file in Spark

String Functions in Spark

Working With Timestamps in Spark

Add, Rename, Drop Columns in Spark Dataframe

ltrim

rtrim

trim

Using “regexp_replace” to remove white spaces

Similar Posts