How To Read The Csv File Properly If Each Row Contains Different Number Of Fields (number Quite Big)?
Solution 1:
As suggested, DictReader
could also be used as follows to create a list of rows. This could then be imported as a frame in pandas:
import pandas as pd
import csv
rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']
with open('input.csv', 'rb') as f_input:
for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
try:
rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
except KeyError, e:
rows.append([row['user'], row['item'], row['rating'], ' '])
frame = pd.DataFrame(rows, columns=frame_header)
print frame
This would display the following:
user item rating review
0 disjiad123 TYh23hs9 5 I love this phone as it is easy to use
1 hjf2329ccc TGjsk123 3 Suck restaurant
If the review appears at the start of the row, then one approach would be to parse the line in reverse as follows:
import pandas as pd
import csv
rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']
with open('input.csv', 'rb') as f_input:
for row in f_input:
cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
rows.append(cols[:4] + [' '.join(cols[4:][::-1])])
frame = pd.DataFrame(rows, columns=frame_header)
print frame
This would display:
rating time item user \
0 5 13160032 TYh23hs9 isjiad123
1 3 14423321 TGjsk123 hjf2329ccc
review
0 I love this phone as it is easy to used
1 Suck restaurant
row[::-1]
is used to reverse the text of the whole line, the [2:]
skips over the line ending which is now at the start of the line. Each line is then split on spaces. A list comprehension then re-reverses each split entry. Finally rows
is appended to first by taking the fixed 5 column entries (now at the start). The remaining entries are then joined back together with a space and added as the final column.
The benefit of this approach is that it does not rely on your input data being in an exactly fixed width format, and you don't have to worry if the column widths being used change over time.
Solution 2:
It looks like this is a fixed width file. Pandas supplies read_fwf
for this exact purpose. The following code reads the file correctly for me. You may want to mess around with the widths a little if it doesn't work perfectly.
pandas.read_fwf('test.fwf',
widths=[13, 12, 13, 5, 100],
names=['user', 'item', 'time', 'rating', 'review'])
If the columns still line up with the edited version (where the rating comes first), you just need to add the correct specification. A guide line like the following helps to do this quickly:
0 1 2 3 4 5 6 7 8
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
I love this phone as it is easy to used isjiad123 TYh23hs9 13160032 5
Suck restaurant hjf2329ccc TGjsk123 14423321 3
So the new command becomes:
pandas.read_fwf('test.fwf',
colspecs=[[0, 43], [44, 56], [57, 69], [70, 79], [80, 84]],
names=['review', 'user', 'item', 'time', 'rating'])
Solution 3:
Usecols
refers to the name of the columns in the input file. If your file doesn't have those columns named like that (user, item, rating
) it won't know which columns you're referring to. Instead you should pass an index like usecols=[0,1,2]
.
Also, names
refers to what you're calling the columns you import. So, I think you cannot have four names upon importing 3 columns. Does this work?
pd.read_csv(filename, sep = " ",
header = None,
names = ["user","item","rating"],
usecols = [0,1,2])
The tokenizing error looks like a problem with the delimiter. It may try to parse your review text
column as many columns, because "I" "love" "this" ... are all separated by spaces. Hopefully if you're only reading the first three columns you can avoid throwing an error, but if not you could consider parsing row-by-row (for example, here: http://cmdlinetips.com/2011/08/three-ways-to-read-a-text-file-line-by-line-in-python/) and writing to a DataFrame from there.
Solution 4:
I think the best approach is using pandas
read_csv
:
import pandas as pd
import io
temp=u""" disjiad123 TYh23hs9 13160032 5 I love this phone as it is easy to use
hjf2329ccc TGjsk123 14423321 3 Suck restaurant so I love cooking pizza with onion ham garlic tomatoes """
#estimated max length of columns
N = 20
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
sep = "\s+", #separator is arbitrary whitespace
header = None, #first row is not header, read all data to df
names=range(N))
print df
0 1 2 3 4 5 6 7 8 \
0 disjiad123 TYh23hs9 13160032 5 I love this phone as
1 hjf2329ccc TGjsk123 14423321 3 Suck restaurant so I love
9 10 11 12 13 14 15 16 17 18 19
0 it is easy to use NaN NaN NaN NaN NaN NaN
1 cooking pizza with onion ham garlic tomatoes NaN NaN NaN NaN
#get order of wanted columns
df = df.iloc[:, [0,1,2]]
#rename columns
df.columns = ['user','item','time']
print df
user item time
0 disjiad123 TYh23hs9 13160032
1 hjf2329ccc TGjsk123 14423321
If you need all columns, you need preprocessing for founding max length of columns for parameter usecols
and then postprocessing join last columns to one:
import pandas as pd
import csv
#preprocessing
def get_max_len():
with open('file1.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
num = []
for i, row in enumerate(reader):
num.append(len(''.join(row).split()))
m = max(num)
#print m
return m
df = pd.read_csv('file1.csv',
sep = "\s+", #separator is arbitrary whitespace
header = None, #first row is not header, read all data to df
usecols = range(get_max_len())) #filter first, second and fourth column (python count from 0)
print df
0 1 2 3 4 5 6 7 8 \
0 disjiad123 TYh23hs9 13160032 5 I love this phone as
1 hjf2329ccc TGjsk123 14423321 3 Suck restaurant NaN NaN NaN
9 10 11 12 13
0 it is easy to use
1 NaN NaN NaN NaN NaN
#df from 4 col to last
print df.ix[:, 4:]
4 5 6 7 8 9 10 11 12 13
0 I love this phone as it is easy to use
1 Suck restaurant NaN NaN NaN NaN NaN NaN NaN NaN
#concanecate columns to one review text
df['review text'] = df.ix[:, 4:].apply(lambda x: ' '.join([e for e in x if isinstance(e, basestring)]), axis=1)
df = df.rename(columns={0:'user', 1:'item', 2:'time',3:'rating'})
#get string columns
cols = [x for x in df.columns if isinstance(x, basestring)]
#filter only string columns
print df[cols]
user item time rating \
0 disjiad123 TYh23hs9 13160032 5
1 hjf2329ccc TGjsk123 14423321 3
review text
0 I love this phone as it is easy to use
1 Suck restaurant
Solution 5:
Since the first four (now last four) of the fields are never going to contain spaces or need to be surrounded by quotes, let's forget about the csv library and use python's awesome string handling directly. Here is a one-liner that splits each line into exactly five columns, courtesy of the maxsplit
argument to rsplit()
:
with open("myfile.dat") as data:
frame = pd.DataFrame(line.strip().rsplit(maxsplit=4) for line in data)
The above should solve your problem, but I prefer to unpack it into a generator function that is easier to understand, and can be extended if necessary:
def splitfields(data):
"""Generator that parses the data correctly into fields"""
for line in data:
fields = line.rsplit(maxsplit=4)
fields[0] = fields[0].strip() # trim line-initial spaces
yield fields
with open("myfile.dat") as data:
frame = pd.DataFrame(splitfields(data))
Both versions avoid having to build a large ordinary array in memory only to hand it over to the DataFrame
constructor. As each line of input is read from the file, it is parsed and immediately added to the dataframe.
The above is for the format in the updated question, which has the free text on the left. (For the original format, use line.split
instead of line.rsplit
and strip the last field, not the first.)
I love this phone as it is easy to used isjiad123 TYh23hs9 13160032 5
Suck restaurant hjf2329ccc TGjsk123 14423321 3
There's more you could do depending on what the data actually looks like: If the fields are separated by exactly four spaces (as it seems from your example), you could split on " "
instead of splitting on all whitespace. That would also work correctly if some other fields can contain spaces. In general, pre-parsing like this is flexible and extensible; I leave the code simple since there's no evidence from your question that more is needed.
Post a Comment for "How To Read The Csv File Properly If Each Row Contains Different Number Of Fields (number Quite Big)?"