Split Text In Cells And Create Additional Rows For The Tokens
Let's suppose that I have the following in a DataFrame in pandas: id text 1 I am the first document and I am very happy. 2 Here is the second document and it likes playing ten
Solution 1:
You can use something like:
def divide_chunks(l, n):
# looping till length l
for i in range(0, len(l), n):
yield l[i:i + n]
Then using unnesting
:
df['text_new']=df.text.apply(lambda x: list(divide_chunks(x.split(),3)))
df_new=unnesting(df,['text_new']).drop('text',1)
df_new.text_new=df_new.text_new.apply(' '.join)
print(df_new)
text_new id
0 I am the 1
0 first document and 1
0 I am very 1
0 happy. 1
1 Here is the 2
1 second document and 2
1 it likes playing 2
1 tennis. 2
2 This is the 3
2 third document and 3
2 it looks very 3
2 good today. 3
EDIT:
m=(pd.DataFrame(df.text.apply(lambda x: list(divide_chunks(x.split(),3))).values.tolist())
.unstack().sort_index(level=1).apply(' '.join).reset_index(level=1))
m.columns=df.columns
print(m)
id text
0 0 I am the
1 0 first document and
2 0 I am very
3 0 happy.
0 1 Here is the
1 1 second document and
2 1 it likes playing
3 1 tennis.
0 2 This is the
1 2 third document and
2 2 it looks very
3 2 good today.
Solution 2:
A self contained solution, maybe a little slower:
# Split every n words
n = 3
# incase id is not index yet
df.set_index('id', inplace=True)
new_df = df.text.str.split(' ', expand=True).stack().reset_index()
new_df = (new_df.groupby(['id', new_df.level_1//n])[0]
.apply(lambda x: ' '.join(x))
.reset_index(level=1, drop=True)
)
new_df
is a series:
id
1 I am the
1 first document and
1 I am very
1 happy.
2 Here is the
2 second document and
2 it likes playing
2 tennis.
3 This is the
3 third document and
3 it looks very
3 good today.
Name: 0, dtype: object
Post a Comment for "Split Text In Cells And Create Additional Rows For The Tokens"