Why Is Pandas Read_csv Not Reading The Right Number Of Rows?
I'm trying to open part of a csv file using pandas read_csv. The section I am opening has a header on line 746, and goes to line 1120. gr = read_csv(inputfile,header=746,nrows=374
Solution 1:
Unless I'm reading the docs wrong this looks like a bug in read_csv
(I recommend filling an issue on github!).
A workaround, since your data is smallish (read in the lines as a string):
fromStringIO importStringIO
withopen(inputfile) asf:
df = pd.read_csv(StringIO(''.join(f.readlines()[:1120])), header=746, nrows=374)
I tested this with the csv you provide and it works/doesn't raise!
Solution 2:
I reckon this is an off by one/counting (user) error! That is, pd.read_csv(inputfile, header=746, nrows=374)
reads the 1021st 1-indexed line, so you should read one fewer row. I could be mistaken, but here's what I'm thinking...
In python line indexing (as with most python indexing) starts at 0.
In [11]: s = 'a,b\nA,B\n1,2\n3,4\n1,2,3,4'
In [12]: for i, line in enumerate(s.splitlines()): print(i, line)
0 a,b
1 A,B
21,233,441,2,3,4
The usual way you think of line numbers is from 1:
In[12]: fori, lineinenumerate(s.splitlines(), start=1): print(i, line)
1a,b2A,B31,243,451,2,3,4
In the following we are reading up the the 3rd row (with python indexing) or the 4th (with 1-indexing):
In [13]: pd.read_csv(StringIO(s), header=1, nrows=2) # Note: header + nrows == 3
Out[13]:
A B
012134
And if we include the next line it'll raise:
In [15]: pd.read_csv(StringIO(s), header=1, nrows=3)
CParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 4
Post a Comment for "Why Is Pandas Read_csv Not Reading The Right Number Of Rows?"