How To Make My Session.get() Link Into Variable?
My goal is to scrape multiple profile links and then scrape specific data on each of these profiles. Here is my code to get multiple profile links (it should work fine): from bs4 i
Solution 1:
It is fairly very straight forward to do this. I instead of printing the profile links store them to a list variable. Then loop through the list variable to scrape each link and then write to the csv file. Some pages do not have all the details so you have to handle those exceptions as well. In the code below I have marked them also as 'NA', following the convention used in your code. One other note for future is to consider using the python's inbuilt csv module for reading and writing csv files.
Merged Script
from bs4 import BeautifulSoup
from requests_html import HTMLSession
import re
session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
profile_list.append(final_profile_link)
filename = "khanscrapetry1.csv"
f = open(filename, "w")
headers = "date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)
for link in profile_list:
print("Scraping ",link)
session = HTMLSession()
r = session.get(link)
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'
user_socio_table=soup.find_all('div', class_='discussion-stat')
data = {}
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not indata.keys():
data[header_value]='NA'
user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
last_activity = user_calendar.find('span',class_='streak-cell filled')
try:
last_activity_date = last_activity['title']
except TypeError:
last_activity_date='NA'else:
last_activity_date='NA'
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n")
f.close()
Sample Output from khanscrapetry1.csv
date_joined,points,videos,questions,votes,answers,flags,project_request,project_replies,comments,tips_thx,last_date6yearsago,1527829,1123,25,100,2,0,NA,NA,0,0,SaturdayJun420166yearsago,1527829,1123,25,100,2,0,NA,NA,0,0,SaturdayJun420166yearsago,3164708,1276,164,2793,348,67,16,3,5663,885,WednesdayOct3120186yearsago,3164708,1276,164,2793,348,67,16,3,5663,885,WednesdayOct312018NA,NA,NA,18,NA,0,0,NA,NA,0,NA,MondayDec242018NA,NA,NA,18,NA,0,0,NA,NA,0,NA,MondayDec2420185yearsago,240334,56,7,42,6,0,2,NA,12,2,TuesdayNov2020185yearsago,240334,56,7,42,6,0,2,NA,12,2,TuesdayNov202018...
Post a Comment for "How To Make My Session.get() Link Into Variable?"