Skip to content Skip to sidebar Skip to footer

Filtering Files Using Specific Pattern When Reading Tar.gz Archive In Pyspark

I have multiple CSV files in my folder myfolder.tar.gz. Which I created in this way: first put all my files in a folder name myfolder then prepare a tar folder of it. Then prepare

Solution 1:

Based on this post, you can read the .tar.gz file as binaryFile then using python tarfile you can extract the archive members and filter on file names using the regex def_[1-9]. The result is an rdd that you can convert into a data frame :

import re
import tarfile
from io import BytesIO

# extract only the files with which math regex 'def_[1-9].csv'
def extract_files(bytes):
    tar = tarfile.open(fileobj=BytesIO(bytes), mode="r:gz")
    return [tar.extractfile(x).read() for x in tar if re.match(r"def_[1-9].csv", x.name)]

# read binary file and convert to df
rdd = sc.binaryFiles("/path/myfolder.tar.gz") \
        .mapValues(extract_files) \
        .flatMap(lambda row: [x.decode("utf-8").split("\n") for x in row[1]])\
        .flatMap(lambda row: [e.split(",") for e in row])

df = rdd.toDF(*csv_cols)

Post a Comment for "Filtering Files Using Specific Pattern When Reading Tar.gz Archive In Pyspark"