String Functions in Spark

While processing data, working with strings is one of the most used tasks. That is why spark has provided multiple functions that can be used to process string data easily. In this blog, we will be listing most of the string functions in spark.

Capitalize Word

We can use “initCap” function to capitalize word in string.

df_csv = spark.read.format("csv") \
        .option("inferSchema", "true") \
        .option("header","true") \
        .load("data/flights.csv")
from pyspark.sql.functions import initcap
df_csv.select(initcap("DEST_COUNTRY_NAME")).show(2)
+--------------------------+
|initcap(DEST_COUNTRY_NAME)|
+--------------------------+
|             United States|
|             United States|
+--------------------------+

df_csv = spark.read.format("csv") \

.option("inferSchema", "true") \

.option("header","true") \

.load("data/flights.csv")

from pyspark.sql.functions import initcap

df_csv.select(initcap("DEST_COUNTRY_NAME")).show(2)

+--------------------------+

|initcap(DEST_COUNTRY_NAME)|

+--------------------------+

| United States|

+--------------------------+

Uppercase

We can use the “upper” function to convert all characters in a string to uppercase.

from pyspark.sql.functions import upper, lower, col
df_csv.select(upper(col("DEST_COUNTRY_NAME"))).show(2)
+------------------------+
|upper(DEST_COUNTRY_NAME)|
+------------------------+
|           UNITED STATES|
|           UNITED STATES|
+------------------------+

from pyspark.sql.functions import upper, lower, col

df_csv.select(upper(col("DEST_COUNTRY_NAME"))).show(2)

+------------------------+

|upper(DEST_COUNTRY_NAME)|

+------------------------+

| UNITED STATES|

+------------------------+

Lowercase

Similarly, we can use the”lower” function to convert string to lowercase characters.

df_csv.select(lower(col("ORIGIN_COUNTRY_NAME"))).show(2)

1	df_csv.select(lower(col("ORIGIN_COUNTRY_NAME"))).show(2)

Trim – Removing White Spaces

We can use the trim function to remove leading and trailing white spaces from data in spark.

from pyspark.sql.functions import ltrim,rtrim,trim
df.select(trim(col("DEST_COUNTRY_NAME"))).show(5)

1 2	from pyspark.sql.functions import ltrim,rtrim,trim df.select(trim(col("DEST_COUNTRY_NAME"))).show(5)

There are other two functions as well. ltrim and rtrim. These functions can be used to remove leading white spaces and trailing white spaces respectively. If you need detail about these function you can read this blog.

Padding Data in Spark

we can use lpad and rpad functions to add padding to data in spark. These function can be used to format data if needed.

from pyspark.sql.functions import lpad, rpad
df_csv.select("DEST_COUNTRY_NAME", \
    "count", \
    lpad(trim(col("count")), 4, "0").alias("formmated_data") \
    ).show(2)

from pyspark.sql.functions import lpad, rpad

df_csv.select("DEST_COUNTRY_NAME", \

"count", \

lpad(trim(col("count")), 4, "0").alias("formmated_data") \

).show(2)

For more details about padding, you can read this blog.

regexp_replace

regexp_replace is a powerful function and can be used for multiple purposes from removing white spaces to replacing the string with something else. In the below example, we are replacing the “United States” with “us”.

from pyspark.sql.functions import regexp_replace
reg_exp = "United States"
df_csv.select(regexp_replace \
    (col("DEST_COUNTRY_NAME"), \
    reg_exp,"us") \
    .alias("dest"), "DEST_COUNTRY_NAME") \
    .show(5)

from pyspark.sql.functions import regexp_replace

reg_exp = "United States"

df_csv.select(regexp_replace \

(col("DEST_COUNTRY_NAME"), \

reg_exp,"us") \

.alias("dest"), "DEST_COUNTRY_NAME") \

.show(5)

translate

Though regexp_replace is a powerful function, it can be difficult to use in many cases. That is why spark has provided some useful functions to deal with strings.

Using translate function we can replace one or more characters to another character.

from pyspark.sql.functions import translate
df_csv.select(translate \
    (col("DEST_COUNTRY_NAME"), \
    "t", "T") \
    .alias("Updated Name") \
    ).show(5)
+-------------+
| Updated Name|
+-------------+
|UniTed STaTes|
|UniTed STaTes|
|UniTed STaTes|
|        EgypT|
|UniTed STaTes|
+-------------+

from pyspark.sql.functions import translate

df_csv.select(translate \

(col("DEST_COUNTRY_NAME"), \

"t", "T") \

.alias("Updated Name") \

).show(5)

+-------------+

| Updated Name|

+-------------+

|UniTed STaTes|

| EgypT|

|UniTed STaTes|

+-------------+

instr

Sometimes we need to check if the string contains some other string or not. For doing this, we can use the instr function.

from pyspark.sql.functions import instr
df_csv.select("DEST_COUNTRY_NAME", \
    instr(col("DEST_COUNTRY_NAME"), \
    "United")).show(5)

from pyspark.sql.functions import instr

df_csv.select("DEST_COUNTRY_NAME", \

instr(col("DEST_COUNTRY_NAME"), \

"United")).show(5)

Conclusion

We have seen multiple useful string functions in spark. This list by no meas is exhaustive. I will keep adding more functions when i encounter them. If you know some function to handle string do let me know. I hope you found this useful. See you in the next blog (Y)

String Functions in Spark

Capitalize Word

Uppercase

Lowercase

Trim – Removing White Spaces

Padding Data in Spark

regexp_replace

translate

instr

Conclusion

Renaming DataFrame Columns in Spark

Adding Custom Schema to Spark Dataframe

Where and Filter in Spark Dataframes

Hive Data Manipulation – Loading Data to Hive Tables

Reading Parquet and ORC data in Spark

Adding White Spaces to Data in Spark Dataframe

Leave a Reply Cancel reply

Capitalize Word

Uppercase

Lowercase

Trim – Removing White Spaces

Padding Data in Spark

regexp_replace

translate

instr

Conclusion

Similar Posts

Leave a Reply Cancel reply