Skip to content Skip to sidebar Skip to footer

Pass Url Column's Values One By One To Web Crawler Code In Python

Based on the answered code from this link, I'm able to create a new column: df['url'] = 'https://www.cspea.com.cn/list/c01/' + df['projectCode']. Next step I would like to pass the

Solution 1:

You need to combine the dfs generated in the loop. You could add them to a list and then call pd.concat on that list.

import requests
from bs4 import BeautifulSoup
import pandas as pd

df = pd.read_excel('items_scraped.xlsx')

# data = []
urls =  df.url.tolist()
dfs = []

for url_link in urls:

    url = url_link
    # url = "https://www.cspea.com.cn/list/c01/gr2021bj1000186"
    soup = BeautifulSoup(requests.get(url, verify=False).content, "html.parser")
    
    index, data = [], []
    for th in soup.select(".project-detail-left th"):
        h = th.get_text(strip=True)
        t = th.find_next("td").get_text(strip=True)
        index.append(h)
        data.append(t)
    
    df = pd.DataFrame(data, index=index, columns=["value"])
    df = df.T
    df.reset_index(drop=True, inplace=True)
    print(df)
    dfs.append(df)

df = pd.concat(dfs)
df.to_excel('result.xlsx', index = False)

Solution 2:

Use

urls =  df.url.tolist()

To create a list of URLs and then iterate through them using f string to insert each one into your base url

Post a Comment for "Pass Url Column's Values One By One To Web Crawler Code In Python"