Skip to content Skip to sidebar Skip to footer

Tokenise Text And Create More Rows For Each Row In Dataframe

I want to do this with python and pandas. Let's suppose that I have the following: file_id text 1 I am the first document. I am a nice document. 2 I am the second

Solution 1:

Use:

s = (df.pop('text')
      .str.strip('.')
      .str.split('\.\s+', expand=True)
      .stack()
      .rename('text')
      .reset_index(level=1, drop=True))

df = df.join(s).reset_index(drop=True)
print (df)
   file_id                         text
01      I am the first document
11         I am a nice document
22     I am the second document
32  I am an even nicer document

Explanation:

First use DataFrame.pop for extract column, remove last . by Series.str.rstrip and split by with Series.str.split with escape . because special regex character, reshape by DataFrame.stack for Series, DataFrame.reset_index and rename for Series for DataFrame.join to original.

Solution 2:

df = pd.DataFrame( { 'field_id': [1,2], 
                    'text': ["I am the first document. I am a nice document.",
                             "I am the second document. I am an even nicer document."]})

df['sents'] = df.text.apply(lambda txt: [x for x in txt.split(".") if len(x) > 1])
df = df.set_index(['field_id']).apply(lambda x: 
                                      pd.Series(x['sents']),axis=1).stack().reset_index(level=1, drop=True)
df = df.reset_index()
df.columns = ['field_id','text']

Post a Comment for "Tokenise Text And Create More Rows For Each Row In Dataframe"