Need To Scrape A Table Which Is Loaded Through Ajax Using Python(selenium)

May 25, 2023 Post a Comment

I have a page that has a table (table id= 'ctl00_ContentPlaceHolder_ctl00_ctl00_GV' class='GridListings' )i need to scrape. I usually use BeautifulSoup & urllib for it,but in

Solution 1:

You can get the data using requests and bs4,, with almost if not all asp sites there are a few post params that always need to be provided like __EVENTTARGET, __EVENTVALIDATION etc.. :

from bs4 import BeautifulSoup
import requests

data = {"__EVENTTARGET": "ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV",
    "__EVENTARGUMENT": "LISTINGS;0",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$ctl00$hdnProductID": "139",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$hdnProductID": "139",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortField": "Listing Number",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortDirection": "A-Z, Low-High",
    "__ASYNCPOST": "true"}

And for the actual post, we need to add a few more values to out post data:

post = "https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
with requests.Session() as s:
    s.headers.update({"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
    soup = BeautifulSoup(s.get(post).content)

    data["__VIEWSTATEGENERATOR"] = soup.select_one("#__VIEWSTATEGENERATOR")["value"]
    data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
    data["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]

    r = s.post(post, data=data)
    soup2 = BeautifulSoup(r.content)
    table = soup2.select_one("div.GridListings")
    print(table)

You will see the table printed when you run the code.

Solution 2:

If you want to scrap something, it will be nice first to install a web debugger ( Firebug for Mozilla Firefox for example) to watch how the website you want to scrap is working.

Next, you need to copy the process of how the website is connecting to backoffice

As you said, the content that you want to scrap is being loaded asynchronously (only when the document is ready)

Assuming the debugger is running and also you have refreshed the page, you will see on the network tab the following request:

The final process flow to reach your goal will be:

1/ Use requests python module
2/ Open a requests session to the index page website site (with cookies handling)
3/ Scrap all the input for the specific POST form request
4/ Build a POST parameter DICT containing all inputs & value fields scrapped in the previous step + adding some specific fixed params.
5/ POST the request (with required data)
6/ Use finally BS4 module (as usual) to soup the answered html to scrap your data

Please see bellow a working code:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from bs4 import BeautifulSoup
import requests

base_url="https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"

#create requests session
s = requests.session()

#get index page
r=s.get(base_url)

#soup page
bs=BeautifulSoup(r.text)

#extract FORM html
form_soup= bs.find('form',{'name':'aspnetForm'})

#extracting all inputs
input_div = form_soup.findAll("input")

#build the data parameters for POST request
#we add some required <fixed> data parameters for post
data={
    '__EVENTARGUMENT':'LISTINGS;0',
    '__EVENTTARGET':'ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV',
    '__EVENTVALIDATION':'/wEWGwKis6fzCQLDnJnSDwLq4+CbDwK9jryHBQLrmcucCgL56enHAwLRrPHhCgKDk6P+CwL1/aWtDQLm0q+gCALRvI2QDAKch7HjBAKWqJHWBAKil5XsDQK58IbPAwLO3dKwCwL6uJOtBgLYnd3qBgKyp7zmBAKQyTBQK9qYAXAoieq54JAuG/rDkC1djKyQMC1qnUtgoC0OjaygUCv4b7sAhfkEODRvsa3noPfz2kMsxhAwlX3Q=='
}
#we add some <dynamic> data parameters
for input_d in input_div:
    try:
        data[ input_d['name'] ] =input_d['value'] 
    except:
        pass #skip unused input field

#post request
r2=s.post(base_url,data=data)

#write the result
with open("post_result.html","w") as f:
    f.write(r2.text.encode('utf8'))