Skip to content Skip to sidebar Skip to footer

Beautifulsoup Not Downloading Files As Expected

I'm trying to download all the .txt files from this website with the following code: from bs4 import BeautifulSoup as bs import urllib import urllib2 baseurl = 'http://m-selig.ae

Solution 1:

You are reading the text of the links, not the href, and the text contains an extra space. This retrieves the hrefs:

links = soup.findAll("a", href=True)
forlinkin links:
    printlink['href']
    urllib.urlretrieve(baseurl+link['href'], link['href'])

I would actually like to only download specific files with names such as *geom.txt

Within the loop you can check, for example, if "geom" in link['href']:.

Solution 2:

You need to pull the href to get the link, you can also get just links that contain geom.txt using a css selector:

from bs4 importBeautifulSoupas bs
import urllib
import urllib2
from urlparse import urljoin


baseurl = "http://m-selig.ae.illinois.edu/props/volume-1/data/"




soup = bs(urllib2.urlopen(baseurl), 'lxml')
links = (a["href"] for a in soup.select("a[href*=geom.txt]"))
for link inlinks:
    urllib.urlretrieve(urljoin(baseurl, link), link)

a[href*=geom.txt] finds all anchor tags that have a href with geom.txt, it is equivalent to using if substring in main_string in python.

You could also use $= in your css to find hrefs ending in geom.txt:

links = (a["href"] for a in soup.select("a[href$=geom.txt]"))

Solution 3:

Maybe you should use link['href'] instead of the text. In this way you would not have the space which is in the presentation:

<li><ahref="ance_8.5x6_2850cm_5004.txt"> ance_8.5x6_2850cm_5004.txt</a></li>

In the text you have: " ance_8.5x6_2850cm_5004.txt" and in the 'href' field you have "ance_8.5x6_2850cm_5004.txt", without the spaces.

Solution 4:

from bs4 import BeautifulSoup as bs
import urllib 
import urllib2

baseurl = "http://m-selig.ae.illinois.edu/props/volume-1/data/"

soup = bs(urllib2.urlopen(baseurl), 'lxml')
links = soup.findAll("a")
for link in links:
    print link.text
    data = urllib.urlopen(baseurl+link.text.strip())
    withopen(link.text,"wb") as fs:
        fs.write(data.read())

Use strip() function to remove the spaces from your url and it will work fine.

Post a Comment for "Beautifulsoup Not Downloading Files As Expected"