Skip to content Skip to sidebar Skip to footer

Pandas Read_csv Not Obeying A Regex Sep

Data: from io import StringIO import pandas as pd s = '''ID,Level,QID,Text,ResponseID,responseText,date_key,last 375280046,S,D3M,Which is your favorite?,D5M0,option 1,2012-08-08 0

Solution 1:

Let's look at this SO Post.

Use this regular expression, r',(?=\S)' explained above.

from io import StringIO
import pandas as pd

s = '''ID,Level,QID,Text,ResponseID,responseText,date_key,last
375280046,S,D3M,Which is your favorite?,D5M0,option 1,2012-08-08 00:00:00,ynot
375280046,S,D3M,How often? (at home, at work, other),D3M0,Work,2010-03-31 00:00:00,okkk
375280046,M,A78,Do you prefer a, b, or c?,A78C,a,2010-03-31 00:00:00,abc
376918925,M,A78,Which ONE (select only one),A78E,Milk,2004-02-02 00:00:00,launch Wed., '''

df = pd.read_csv(StringIO(s), sep=r',(?=\S)')

Output:

              ID                                 Level   QID      Text  \
375280046 S  D3M               Which is your favorite?  D5M0  option 1   
          S  D3M  How often? (at home, at work, other)  D3M0      Work   
          M  A78             Do you prefer a, b, or c?  A78C         a   
376918925 M  A78           Which ONE (selectonlyone)  A78E      Milk   

                ResponseID  responseText  date_key          last375280046 S  2012-08-080000          ynot  
          S  2010-03-310000          okkk  
          M  2010-03-310000           abc  
376918925 M  2004-02-020000  launch Wed.,  

Solution 2:

read_csv appears to be stripping the space from the end of the string prior to attempting to identify the separator. This can be worked around by modifying the regex to also check for commas identified as just prior to the end of the file:

pd.read_csv(StringIO(s), sep=r',(?!\s|\Z)', engine='python')
Out[347]: 
          ID Level  QID                                  Text ResponseID  \
0375280046     S  D3M               Which is your favorite?       D5M0   
1375280046     S  D3M  How often? (at home, at work, other)       D3M0   
2375280046     M  A78             Do you prefer a, b, or c?       A78C   
3376918925     M  A78           Which ONE (selectonlyone)       A78E   

  responseText             date_key          last0     option 12012-08-0800:00:00          ynot  
1         Work  2010-03-3100:00:00          okkk  
2            a  2010-03-3100:00:00           abc  
3         Milk  2004-02-0200:00:00  launch Wed.,  

Post a Comment for "Pandas Read_csv Not Obeying A Regex Sep"