Skip to content Skip to sidebar Skip to footer

Accessing Hidden Tabs, Web Scraping With Python 3.6

I'm using bs4 and urllib.request in python 3.6 to webscrape. I have to open tabs / be able to toggle an 'aria-expanded' in button tabs in order to access the div tabs I need. The b

Solution 1:

From your other post I'm guessing the URL is https://www.sciencedirect.com/journal/construction-and-building-materials/issues

The web-page loads JSON from another URL when you click the link. You can request the JSON yourself without the need to click the link. All you need to know is the ISBN which never changes (09500618) and the year which you can pass in from a range. This even returns data from the tabs that are already expanded.

import requests
import json

# The website rejects requests except from user agents it has not blacklisted so set a header
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0'
}

for i in range (1999, 2019):
    url = "https://www.sciencedirect.com/journal/09500618/year/" + str(i) + "/issues"
    r = requests.get(url, headers=headers)
    j = r.json()

    for d in j['data']:
        # Print the json object
        print (json.dumps(d, indent=4, sort_keys=True))
        # Or print specific values
        print (d['coverDateText'], d['volumeFirst'], d['uriLookup'], d['srctitle'])

Outputs:

{
    "cid": "271475",
    "contentFamily": "serial",
    "contentType": "JL",
    "coverDateStart": "19991201",
    "coverDateText": "1 December 1999",
    "hubStage": "H300",
    "issn": "09500618",
    "issueFirst": "8",
    "pages": [
        {
            "firstPage": "417",
            "lastPage": "470"
        }
    ],
    "pii": "S0950061800X00323",
    "sortField": "1999001300008zzzzzzz",
    "srctitle": "Construction and Building Materials",
    "uriLookup": "/vol/13/issue/8",
    "volIssueSupplementText": "Volume 13, Issue 8",
    "volumeFirst": "13"
}
1 December 1999 13 /vol/13/issue/8 Construction and Building Materials
...

Solution 2:

BeautifulSoup is used to parse HTML/XML content. You can't click around on a webpage with it.

I recommend you look through the document to make sure it isn't just moving the content from one place to the other. If the content is loaded through AJAX when the button is clicked then you will have to use something like selenium to trigger the click.

An easier option could be to check what url the content is fetched from when you click the button and make a similar call in your script if possible.


Post a Comment for "Accessing Hidden Tabs, Web Scraping With Python 3.6"