Skip to content Skip to sidebar Skip to footer

Json Lines (jsonl) Generator To Csv Format

I have a large Jsonl file (6GB+) which I need to convert to .csv format. After running: import json with open(root_dir + 'filename.json') as json_file: for line in json_file:

Solution 1:

The generator seems completely superfluous.

withopen(root_dir + 'filename.json') as old, open(root_dir + 'output.csv', 'w') as csvfile:
    new = csv.writer(csvfile)
    for x in old:
        row = json.loads(x)
        new.writerow(row)

If one line of JSON does not simply produce an array of strings and numbers, you still need to figure out how to convert it from whatever structure is inside the JSON to something which can usefully be serialized as a one-dimensional list of strings and numbers of a fixed length.

If your JSON can be expected to reliably contain a single dictionary with a fixed set of keyword-value pairs, maybe try

from csv import DictWriter
import json

withopen(jsonfile, 'r') as inp, open(csvfile, 'w') as outp:
    writer = DictWriter(outp, fieldnames=[
            'url', 'date', 'content', 'renderedContent', 'id', 'username',
            # 'user',  # see below'outlinks', 'outlinksss', 'tcooutlinks', 'tcooutlinksss',
            'replyCount', 'retweetCount', 'likeCount', 'quoteCount',
            'conversationId', 'lang', 'source', 'media', 'retweetedTweet',
            'quotedTweet', 'mentionedUsers'])
    for line in inp:
        row = json.loads(line)
        writer.writerow(row)

I omitted the user field because in your example, this key contains a nested structure which cannot easily be transformed into CSV without further mangling. Perhaps you would like to extract just user.id into a new field user_id; or perhaps you would like to lift the entire user structure into a flattened list of additional columns in the main record like user_username, user_displayname, user_id, etc?

In some more detail, CSV is basically a two-dimensional matrix where every row is a one-dimensional collection of columns corresponding to one record in the data, where each column can contain one string or one number. Every row needs to have exactly the same number of columns, though you can leave some of them empty.

JSON which can trivially be transformed into CSV would look like

["Adolf", "1945", 10000000]["Joseph", "1956", 25000000]["Donald", null, 1000000]

JSON which can be transformed to CSV by some transformation (which you'd have to specify separately, like for example with the dictionary key ordering specified above) might look like

{"name":"Adolf","dod":"1945","death toll":10000000}{"dod":"1956","name":"Joseph","death toll":25000000}{"death toll":1000000,"name":"Donald"}

(Just to make it more interesting, one field is missing, and the dictionary order varies from one record to the next. This is not typical, but definitely within the realm of valid corner cases that Python could not possibly guess on its own how to handle.)

Most real-world JSON is significantly more complex than either of these simple examples, to the point where we can say that the problem is not possible to solve in the general case.

Post a Comment for "Json Lines (jsonl) Generator To Csv Format"