Merge Csv Files With Different Column Order Remove Duplicates
Solution 1:
The following script works properly if:
- csv aren't too big (i.e. can be loaded in memory)
- the first row of the CSV contains the column names
You only have to fill files
and final_headers
import csv
files = ['c1.csv', 'c2.csv', 'c3.csv']
final_headers = ['col1', 'col2', 'col3']
merged_rows = set()
for f in files:
withopen(f, 'rb') as csv_in:
csvreader = csv.reader(csv_in, delimiter=',')
headers = dict((h, i) for i, h inenumerate(csvreader.next()))
for row in csvreader:
merged_rows.add(tuple(row[headers[x]] for x in final_headers))
withopen('output.csv', 'wb') as csv_out:
csvwriter = csv.writer(csv_out, delimiter=',')
csvwriter.writerows(merged_rows)
Solution 2:
Solution 3:
Personally, I would separate the two tasks of merging files and removing duplicates. I would also recommend using a database instead of CSV files if that's an option, since managing columns in a database is easier.
Here is an example using Python, which has a csv library that is easy to use.
import csv
withopen(srcPath, 'r') as srcCSV:
csvReader = csv.reader(csvFile, delimiter = ',')
withopen(destPath, 'rw') as destCSV:
csvWriter = csv.writer(destCSV, delimiter = ',')
for record in csvReader:
csvWriter.writerow(record[1],record[3],record[2], ... record[n])
This allows you to rewrite the columns in any order you choose. The destination CSV could be an existing one that you expand, or it could be a new one with a better format. Using the CSV library will help prevent transcription errors that would happen elsewhere.
Once the data is consolidated, you could use the same library to iterate over the single data file to identify records that are identical.
Note: this method reads and writes files a line at a time, so it can process files of any size. I used this method to consolidate 221 millions records from files as large as 6 GB each.
Post a Comment for "Merge Csv Files With Different Column Order Remove Duplicates"