Skip to content Skip to sidebar Skip to footer

Extracting @mentions From Tweets Using Findall Python (giving Incorrect Results)

I have a csv file something like this text RT @CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://

Solution 1:

You can use str.findall method to avoid the for loop, use negative look behind to replace (^|[^@\w]) which forms another capture group you don't need in your regex:

df['mention'] = df.text.str.findall(r'(?<![@\w])@(\w{1,25})').apply(','.join)
df
#                                                text   mention#0  RT @CritCareMed: New Article: Male-Predominant...   CritCareMed#1  #CRISPR Inversion of CTCF Sites Alters Genome ...   CellCellPress#2  RT @gvwilson: Where's the theory for software ...   gvwilson#3  RT @sciencemagazine: What’s killing off the se...   sciencemagazine#4  RT @MHendr1cks: Eve Marder describes a horror ...   MHendr1cks,nucAmbiguous

Also X.iloc[:i,:] gives back a data frame, so str(X.iloc[:i,:]) gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text column, you can use X.text.iloc[0], or a better way to iterate through a column, use iteritems:

import re
for index, s in df.text.iteritems():
    result = re.findall("(?<![@\w])@(\w{1,25})", s)
    print(','.join(result))

#CritCareMed#CellCellPress#gvwilson#sciencemagazine#MHendr1cks,nucAmbiguous

Solution 2:

While you already have your answer, you could even try to optimize the whole import process like so:

import re, pandas as pd

rx = re.compile(r'@([^:\s]+)')

withopen("test.txt") as fp:
    dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines())

    df = pd.DataFrame(dft, columns = ['text', 'mention'])
    print(df)


Which yields:
text                  mention
0  RT @CritCareMed: New Article: Male-Predominant...              CritCareMed
1  #CRISPR Inversion of CTCF Sites Alters Genome ...            CellCellPress
2  RT @gvwilson: Where's the theory for software ...                 gvwilson3  RT @sciencemagazine: What’s killing off the se...          sciencemagazine
4  RT @MHendr1cks: Eve Marder describes a horror ...  MHendr1cks,nucAmbiguous

This might be a bit faster as you don't need to change the df once it's already constructed.

Solution 3:

mydata['text'].str.findall(r'(?:(?<=\s)|(?<=^))@.*?(?=\s|$)')

Same as this: Extract hashtags from columns of a pandas dataframe, but for mentions.

  • @.*? carries out a non-greedy match for a word starting with a hashtag
  • (?=\s|$) look-ahead for the end of the word or end of the sentence
  • (?:(?<=\s)|(?<=^)) look-behind to ensure there are no false positives if a @ is used in the middle of a word

The regex lookbehind asserts that either a space or the start of the sentence must precede a @ character.

Post a Comment for "Extracting @mentions From Tweets Using Findall Python (giving Incorrect Results)"