Accessing Hidden Tabs, Web Scraping With Python 3.6
Solution 1:
From your other post I'm guessing the URL is https://www.sciencedirect.com/journal/construction-and-building-materials/issues
The web-page loads JSON from another URL when you click the link. You can request the JSON yourself without the need to click the link. All you need to know is the ISBN which never changes (09500618) and the year which you can pass in from a range. This even returns data from the tabs that are already expanded.
import requests
import json
# The website rejects requests except from user agents it has not blacklisted so set a header
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0'
}
for i in range (1999, 2019):
url = "https://www.sciencedirect.com/journal/09500618/year/" + str(i) + "/issues"
r = requests.get(url, headers=headers)
j = r.json()
for d in j['data']:
# Print the json object
print (json.dumps(d, indent=4, sort_keys=True))
# Or print specific values
print (d['coverDateText'], d['volumeFirst'], d['uriLookup'], d['srctitle'])
Outputs:
{
"cid": "271475",
"contentFamily": "serial",
"contentType": "JL",
"coverDateStart": "19991201",
"coverDateText": "1 December 1999",
"hubStage": "H300",
"issn": "09500618",
"issueFirst": "8",
"pages": [
{
"firstPage": "417",
"lastPage": "470"
}
],
"pii": "S0950061800X00323",
"sortField": "1999001300008zzzzzzz",
"srctitle": "Construction and Building Materials",
"uriLookup": "/vol/13/issue/8",
"volIssueSupplementText": "Volume 13, Issue 8",
"volumeFirst": "13"
}
1 December 1999 13 /vol/13/issue/8 Construction and Building Materials
...
Solution 2:
BeautifulSoup
is used to parse HTML/XML content. You can't click around on a webpage with it.
I recommend you look through the document to make sure it isn't just moving the content from one place to the other. If the content is loaded through AJAX when the button is clicked then you will have to use something like selenium
to trigger the click.
An easier option could be to check what url the content is fetched from when you click the button and make a similar call in your script if possible.
Post a Comment for "Accessing Hidden Tabs, Web Scraping With Python 3.6"