Beautifulsoup Not Downloading Files As Expected
Solution 1:
You are reading the text of the links, not the href, and the text contains an extra space. This retrieves the hrefs:
links = soup.findAll("a", href=True)
forlinkin links:
printlink['href']
urllib.urlretrieve(baseurl+link['href'], link['href'])
I would actually like to only download specific files with names such as *geom.txt
Within the loop you can check, for example, if "geom" in link['href']:
.
Solution 2:
You need to pull the href to get the link, you can also get just links that contain geom.txt
using a css selector:
from bs4 importBeautifulSoupas bs
import urllib
import urllib2
from urlparse import urljoin
baseurl = "http://m-selig.ae.illinois.edu/props/volume-1/data/"
soup = bs(urllib2.urlopen(baseurl), 'lxml')
links = (a["href"] for a in soup.select("a[href*=geom.txt]"))
for link inlinks:
urllib.urlretrieve(urljoin(baseurl, link), link)
a[href*=geom.txt]
finds all anchor tags that have a href with geom.txt
, it is equivalent to using if substring in main_string
in python.
You could also use $=
in your css to find hrefs ending in geom.txt
:
links = (a["href"] for a in soup.select("a[href$=geom.txt]"))
Solution 3:
Maybe you should use link['href'] instead of the text. In this way you would not have the space which is in the presentation:
<li><ahref="ance_8.5x6_2850cm_5004.txt"> ance_8.5x6_2850cm_5004.txt</a></li>
In the text you have: " ance_8.5x6_2850cm_5004.txt" and in the 'href' field you have "ance_8.5x6_2850cm_5004.txt", without the spaces.
Solution 4:
from bs4 import BeautifulSoup as bs
import urllib
import urllib2
baseurl = "http://m-selig.ae.illinois.edu/props/volume-1/data/"
soup = bs(urllib2.urlopen(baseurl), 'lxml')
links = soup.findAll("a")
for link in links:
print link.text
data = urllib.urlopen(baseurl+link.text.strip())
withopen(link.text,"wb") as fs:
fs.write(data.read())
Use strip() function to remove the spaces from your url and it will work fine.
Post a Comment for "Beautifulsoup Not Downloading Files As Expected"