Following Links In Python Assignment Using Beautifulsoup

February 28, 2024 Post a Comment

I have this assignment for a python class where I have to start from a specific link at a specific position, then follow that link for a specific number of times. Supposedly the fi

Solution 1:

[Edit: Cut+pasted this line from comments] Hi! I had to work in a similar exercise, and because i had some doubts i found your question. Here is my code and I think it works. I hope it will be helpful for you

import urllib
from bs4 import BeautifulSoup

url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
count = 8
position = 18
tags_lst = []

for x in xrange(count-1):
    tags = soup('a')
    my_tags = tags[position-1]
    needed_tag = my_tags.get('href', None)
    tags_lst.append(needed_tag)
    url = str(needed_tag)
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')

Solution 2:

I put the solution below, tested and working well as of today.

importing the require modules

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import re

accessing websites

url = "http://py4e-data.dr-chuck.net/known_by_Vairi.html"html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
all_num_list = list()
link_position = 18Process_repeat = 7

Retrieve all of the anchor tags

tags = soup('a')

while Process_repeat - 1  >= 0 :
    print("Process round", Process_repeat)
    target = tags[link_position - 1]
    print("target:", target)
    url = target.get('href', 2)
    print("Current url", url)
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    Process_repeat = Process_repeat - 1

Solution 3:

Try this. You can leave entering the URL. There is sample of your former link. Good Luck!

import urllib.request
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter url ')
cn = input('Enter count: ')
cnint = int(cn)
pos = input('Enter position: ')
posint = int(pos)
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/known_by_Fikret.html''''url''', context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

tags_lst = list()
for x inrange(0,cnint):
    tags = soup('a')
    my_tags = tags[posint-1]
    needed_tag = my_tags.get('href', None)
    url = str(needed_tag)
    html = urllib.request.urlopen(url,context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    print(my_tags.get('href', None))

Solution 4:

Your BeautifulSoup import was wrong. I don't think it works with the code you show. Also your lower loop was confusing. You can get the list of urls you want by slicing the completely retrieved list.

I've hardcoded your url in my code because it was easier than typing it in each run.

Try this:

import urllib
from bs4 import BeautifulSoup

#url = raw_input('Enter - ')
url = 'http://python-data.dr-chuck.net/known_by_Fikret.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# print soup
count = int(raw_input('Enter count: '))+1
position = int(raw_input('Enter position: '))


tags = soup('a')
# next line gets count tags starting from position
my_tags = tags[position: position+count]
tags_lst = []
for tag in my_tags:
    needed_tag = tag.get('href', None)
    tags_lst.append(needed_tag)
print tags_lst

Solution 5:

Almost all solutions to this assignment have two sections to load the urls. Instead, I defined a function that prints the relevant link for any given url.

Initially, the function will use the Fikret.html url as input. Subsequent inputs rely on refreshed urls that appear on the required position. The important line of code is this one: url = allerretour(url)[position-1] This gets the new url that feeds the loop another round.

import urllib
from bs4 import BeautifulSoup
url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'# raw_input('Enter URL : ')

position = 3# int(raw_input('Enter position : '))
count = 4#int(raw_input('Enter count : '))defallerretour(url):
    print('Retrieving: ' + url)
    soup = BeautifulSoup(urllib.urlopen(url).read())
    link = list()
    for tag in soup('a'):
        link.append(tag.get('href', None))
    return(link)


for x inrange(1, count + 2):
    url = allerretour(url)[position-1]

Python Guru