Skip to content Skip to sidebar Skip to footer

How To Concatenate To A Null Column In Pyspark Dataframe

I have a below dataframe and I wanted to update the rows dynamically with some values input_frame.show() +----------+----------+---------+ |student_id|name |timestamp| +------

Solution 1:

use concat_ws, like this:

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([["1", "2"], ["2", None], ["3", "4"], ["4", "5"], [None, "6"]]).toDF("a", "b")

# This won't workdf = df.withColumn("concat", concat(df.a, df.b))

# This won't workdf = df.withColumn("concat + cast", concat(df.a.cast('string'), df.b.cast('string')))

# Do it like thisdf = df.withColumn("concat_ws", concat_ws("", df.a, df.b))
df.show()

gives:

+----+----+------+-------------+---------+|   a|   b|concat|concat + cast|concat_ws|+----+----+------+-------------+---------+|1|2|12|12|12||2|null|null|null|2||3|4|34|34|34||4|5|45|45|45||null|6|null|null|6|+----+----+------+-------------+---------+

Note specifically that casting a NULL column to string doesn't work as you wish, and will result in the entire row being NULL if any column is null.

There's no nice way of dealing with more complicated scenarios, but note that you can use a when statement in side a concat if you're willing to suffer the verboseness of it, like this:

df.withColumn("concat_custom", concat(
  when(df.a.isNull(), lit('_')).otherwise(df.a), 
  when(df.b.isNull(), lit('_')).otherwise(df.b))
)

To get, eg:

+----+----+-------------+|   a|   b|concat_custom|+----+----+-------------+|1|2|12||2|null|2_||3|4|34||4|5|45||null|6|           _6|+----+----+-------------+

Solution 2:

You can fill null values with empty strings:

import pyspark.sql.functions as f
from pyspark.sql.types import *
data = spark.createDataFrame([('s1', 't1'), ('s2', 't2')], ['col1', 'col2'])
data = data.withColumn('test', f.lit(None).cast(StringType()))
display(data.na.fill('').withColumn('test2', f.concat('col1', 'col2', 'test')))

Is that what you were looking for?

Solution 3:

This will resolve the issue

df = df.withColumn("concat", concat(collease(df.a, lit('')), collease(df.b, lit(''))))

Post a Comment for "How To Concatenate To A Null Column In Pyspark Dataframe"