Python->beautifulsoup->webscraping->looping Over Url (1 To 53) And Saving Results

January 23, 2024 Post a Comment

Here is the Website I am trying to scrape http://livingwage.mit.edu/ The specific URLs are from http://livingwage.mit.edu/states/01 http://livingwage.mit.edu/states/02 http://li

Solution 1:

Just get all the states from the initial page, then you can select the second table and use the css classesodd results to get the tr you need, there is no need to slice as the class names are unique:

import requests
from bs4 import BeautifulSoup
from urllib.parse import  urljoin # python2 -> from urlparse import urljoin 


base = "http://livingwage.mit.edu"
res = requests.get(base)

res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.# td + td skips the first td which is *Required annual income before taxes*# get all the anchors inside each li that are children of the# ul with the css class  "states list".for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
    # The hrefs look like "/states/51/locations".#  We want everything before /locations so we split on / from the right -> /states/51/# and join to the base url. The anchor text also holds the state name,# so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
    states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))


defparse(soup):
    # Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
    table = soup.select_one("table:nth-of-type(2)")
    # To get the text, we just need find all the tds and call .text on each.#  Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]


# Unpack the url and state from each tuple in our states list. for url, state in states:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print(state, parse(soup))

If you run the code you will see output like:

Alabama ['$21,144', '$43,213', '$53,468', '$67,788', '$34,783', '$41,847', '$46,876', '$52,531', '$34,783', '$48,108', '$58,748', '$70,014']
Alaska ['$24,070', '$49,295', '$60,933', '$79,871', '$38,561', '$47,136', '$52,233', '$61,531', '$38,561', '$54,433', '$66,316', '$82,403']
Arizona ['$21,587', '$47,153', '$59,462', '$78,112', '$36,332', '$44,913', '$50,200', '$58,615', '$36,332', '$52,483', '$65,047', '$80,739']
Arkansas ['$19,765', '$41,000', '$50,887', '$65,091', '$33,351', '$40,337', '$45,445', '$51,377', '$33,351', '$45,976', '$56,257', '$67,354']
California ['$26,249', '$55,810', '$64,262', '$81,451', '$42,433', '$52,529', '$57,986', '$68,826', '$42,433', '$61,328', '$70,088', '$84,192']
Colorado ['$23,573', '$51,936', '$61,989', '$79,343', '$38,805', '$47,627', '$52,932', '$62,313', '$38,805', '$57,283', '$67,593', '$81,978']
Connecticut ['$25,215', '$54,932', '$64,882', '$80,020', '$39,636', '$48,787', '$53,857', '$61,074', '$39,636', '$60,074', '$70,267', '$82,606']

You could loop in a range from 1-53 but extracting the anchor from the base page also gives us the state name in a single step, using the h1 from that page would also give you output Living Wage Calculation for Alabama which you would have to then try to parse to just get the name which would not be trivial considering some states have more the one word names.

Solution 2:

Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above?

You can get the text by simply by doing something on the lines of:

state_name=states.find('h1').text

The same can be applied for each of the rows too.

Problem 2: How do I loop through the request.get(url01 to url56)?

The same code block can be put inside a loop from 1 to 56 like so:

for i in range(1,57):
    res = requests.get('http://livingwage.mit.edu/states/'+str(i).zfill(2))
    ...rest of the code...

zfill will add those leading zeroes. Also, it would be better if requests.get is enclosed in a try-except block so that the loop continues gracefully even when the url is wrong.

Python Guru

Python->beautifulsoup->webscraping->looping Over Url (1 To 53) And Saving Results

Solution 1:

Solution 2:

Post a Comment for "Python->beautifulsoup->webscraping->looping Over Url (1 To 53) And Saving Results"