Skip to content Skip to sidebar Skip to footer

Complex Python3 Csv Scraper

I've got the code below working great when pulling data from a row, in my case row[0]. I'm wondering how to tweak it to pull data from multiple rows? Also, I would love to be able

Solution 1:

To add a per column find parameter, you could create a dictionary mapping the index number into the required find parameters as follows:

from bs4 import BeautifulSoup
import requests
import csv

class_1 = {"class": "productsPicture"}
class_2 = {"class": "product_content"}
class_3 = {"class": "id-fix"}

# map a column number to the required find parameters
class_to_find = {
    0 : class_3,    # Not defined in question1 : class_1,    
    2 : class_1,
    3 : class_3,    # Not defined in question4 : class_2, 
    5 : class_2}

withopen('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
    reader = csv.reader(csvFile)
    writer = csv.writer(results)

    for row in reader:
        # get the url

        output_row = []

        for index, url inenumerate(row):
            url = url.strip()

            # Skip any empty URLsiflen(url):
                #print('col: {}\nurl: {}\nclass: {}\n\n'.format(index, url, class_to_find[index]))# fetch content from servertry:
                    html = requests.get(url).content
                except requests.exceptions.ConnectionError as e:
                    output_row.extend([url, '', 'bad url'])
                    continueexcept requests.exceptions.MissingSchema as e:
                    output_row.extend([url, '', 'missing http...'])
                    continue# soup fetched content
                soup = BeautifulSoup(html, 'html.parser')


                divTag = soup.find("div", class_to_find[index])

                if divTag:
                    # Return all 'a' tags that contain an hreffor a in divTag.find_all("a", href=True):
                        url_sub = a['href']

                        # Test that link is validtry:
                            r = requests.get(url_sub)
                            output_row.extend([url, url_sub, 'ok'])
                        except requests.exceptions.ConnectionError as e:
                            output_row.extend([url, url_sub, 'bad link'])
                else:
                    output_row.extend([url, '', 'no results'])      

        writer.writerow(output_row)

The enumerate() function is used to return a counter whist iterating over a list. So index will be 0 for the first URL, and 1 for the next. This can then be used with the class_to_find dictionary to get the required parameters to search on.

Each URL results in 3 columns being created, the url, the sub-url if successful and the result. These can be removed if not needed.

Post a Comment for "Complex Python3 Csv Scraper"