Creating Dictionary From Large Pyspark Dataframe Showing OutOfMemoryError: Java Heap Space
I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset
Solution 1:
Why not keep as much data and processing in Executors, rather than collecting to Driver? If I understand this correctly, you could use pyspark
transformations and aggregations and save directly to JSON, therefore leveraging executors, then load that JSON file (likely partitioned) back into Python as a dictionary. Admittedly, you introduce IO overhead, but this should allow you to get around your OOM heap space errors. Step-by-step:
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
data = [
("BOND-9129450", "90cb"),
("BOND-1742850", "d5c3"),
("BOND-3211356", "811f"),
("BOND-7630290", "d5c3"),
("BOND-7175508", "90cb"),
]
df = spark.createDataFrame(data, ["id", "hash_of_cc_pn_li"])
df.groupBy(
f.col("hash_of_cc_pn_li"),
).agg(
f.collect_set("id").alias("id") # use f.collect_list() here if you're not interested in deduplication of BOND-XXXXX values
).write.json("./test.json")
Inspecting the output path:
ls -l ./test.json
-rw-r--r-- 1 jovyan users 0 Jul 27 08:29 part-00000-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 50 Jul 27 08:29 part-00039-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00043-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00159-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 0 Jul 27 08:29 _SUCCESS
_SUCCESS
Loading to Python as dict
:
import json
from glob import glob
data = []
for file_name in glob('./test.json/*.json'):
with open(file_name) as f:
try:
data.append(json.load(f))
except json.JSONDecodeError: # there is definitely a better way - this is here because some partitions might be empty
pass
Finally
{item['hash_of_cc_pn_li']:item['id'] for item in data}
{'d5c3': ['BOND-7630290', 'BOND-1742850'],
'811f': ['BOND-3211356'],
'90cb': ['BOND-9129450', 'BOND-7175508']}
I hope this helps! Thank you for the good question!
Post a Comment for "Creating Dictionary From Large Pyspark Dataframe Showing OutOfMemoryError: Java Heap Space"