Skip to content Skip to sidebar Skip to footer

Merge Csv Files With Different Column Order Remove Duplicates

I have multiple CSV files with same number of columns BUT different column orders in each , I wanted to merge them removing duplicates, all of the other solutions here dont conside

Solution 1:

The following script works properly if:

  • csv aren't too big (i.e. can be loaded in memory)
  • the first row of the CSV contains the column names

You only have to fill files and final_headers

import csv

files = ['c1.csv', 'c2.csv', 'c3.csv']
final_headers = ['col1', 'col2', 'col3']

merged_rows = set()
for f in files:
    withopen(f, 'rb') as csv_in:
        csvreader = csv.reader(csv_in, delimiter=',')
    headers = dict((h, i) for i, h inenumerate(csvreader.next()))
        for row in csvreader:
            merged_rows.add(tuple(row[headers[x]] for x in final_headers))
withopen('output.csv', 'wb') as csv_out:
    csvwriter = csv.writer(csv_out, delimiter=',')
    csvwriter.writerows(merged_rows)

Solution 2:

csvkit'scsvjoin can do that.

csvjoin -c "Column 1,Column 2"--outer file1.csv file2.csv

Solution 3:

Personally, I would separate the two tasks of merging files and removing duplicates. I would also recommend using a database instead of CSV files if that's an option, since managing columns in a database is easier.

Here is an example using Python, which has a csv library that is easy to use.

import csv
withopen(srcPath, 'r') as srcCSV:
    csvReader = csv.reader(csvFile, delimiter = ',')

    withopen(destPath, 'rw') as destCSV:
        csvWriter = csv.writer(destCSV, delimiter = ',')        

        for record in csvReader:
            csvWriter.writerow(record[1],record[3],record[2], ... record[n])

This allows you to rewrite the columns in any order you choose. The destination CSV could be an existing one that you expand, or it could be a new one with a better format. Using the CSV library will help prevent transcription errors that would happen elsewhere.

Once the data is consolidated, you could use the same library to iterate over the single data file to identify records that are identical.

Note: this method reads and writes files a line at a time, so it can process files of any size. I used this method to consolidate 221 millions records from files as large as 6 GB each.

Post a Comment for "Merge Csv Files With Different Column Order Remove Duplicates"