Skip to content Skip to sidebar Skip to footer

How To Use Np.genfromtxt And Fill In Missing Columns?

I am trying to use np.genfromtxt to load a data that looks something like this into a matrix: 0.79 0.10 0.91 -0.17 0.10 0.33 -0.90 0.10 -0.19 -0.00 0.10 -0.99 -0.06 0.10 -

Solution 1:

Pandas has more robust readers and you can use the DataFrame methods to handle the missing values.

You'll have to figure out how many columns to use first:

columns = max(len(l.split()) for l in open('data.txt'))

To read the file:

import pandas
df = pandas.read_table('data.txt', 
                       delim_whitespace=True, 
                       header=None, 
                       usecols=range(columns), 
                       engine='python')

To convert to a numpy array:

import numpy
a = numpy.array(df)

This will fill in NaNs in the blank positions. You can use .fillna() to get other values for blanks.

filled = numpy.array(df.fillna(999))

Solution 2:

You need to modify the filling_values argument to np.nan (which is considered of type float so you won't have the string conversion issue) and specify the delimiter to be comma since by default genfromtxt expects only white space as delimiters:

trainData = np.genfromtxt('data.txt', usecols = range(0, 5), invalid_raise=False, missing_values = "", filling_values=np.nan, delimiter=',')

Solution 3:

I managed to figure out a solution.

df = pandas.DataFrame([line.strip().split() for line in open('data.txt', 'r')])
data = np.array(df)

Solution 4:

With the copy-n-paste of the 3 big lines, this pandas reader works:

In [149]: pd.read_csv(BytesIO(txt), delim_whitespace=True,header=None,error_bad_
     ...: lines=False,names=list(range(91)))
Out[149]: 
     0    1     2     3    4     5    6    7     8    9   ...     81   82  \
0  0.79  0.1  0.91 -0.17  0.1  0.33 -0.9  0.1 -0.19 -0.0  ...    515  163   
1  0.79  0.1  0.91 -0.17  0.1  0.33 -0.9  0.1 -0.19 -0.0  ...    515  163   
2  0.79  0.1  0.91 -0.17  0.1  0.33 -0.9  0.1 -0.19 -0.0  ...    125   30   

    83     84     85    86     87     88     89     90  
0  535    NaN    NaN   NaN    NaN    NaN    NaN    NaN  
1  509  112.0  535.0   NaN    NaN    NaN    NaN    NaN  
2  412  422.0  556.0  55.0  355.0  485.0  112.0  515.0  

_.values to get the array.

The key is specifying a big enough names list. Pandas can fill incomplete lines, while genfromtxt requires explicit delimiters.


Post a Comment for "How To Use Np.genfromtxt And Fill In Missing Columns?"