Unicodeerror: Url Contains Non-ascii Characters (python 2.7)
So I've managed to make a crawler, and I'm searchng for all links and when I arrive at a product link I make some finds and I take all product information, but when it arrives to c
Solution 1:
You need to percent encode the utf8 representation of your unicode string.
As explained here:
All non-ASCII code points in the IRI should next be encoded as UTF-8, and the resulting bytes percent-encoded, to produce a valid URI.
In python code, that means:
importurlliburl= urllib.quote(url.encode('utf8'), ':/')
The second argument to quote
, ':/'
, is to prevent the colon in the protocol part http:
, or path separator /
from being encoded.
(In Python 3, the quote
function has been moved to the urllib.parse module).
Solution 2:
You can try to encode the urls. Your code may look like:
defget_html_text(url):
try:
return urllib.urlopen(current_url.encode('ascii','ignore')).read()
except (URLError, HTTPError, urllib.ContentTooShortError):
print"Error getting " + current_url
Post a Comment for "Unicodeerror: Url Contains Non-ascii Characters (python 2.7)"