Skip to content Skip to sidebar Skip to footer

Unicodeerror: Url Contains Non-ascii Characters (python 2.7)

So I've managed to make a crawler, and I'm searchng for all links and when I arrive at a product link I make some finds and I take all product information, but when it arrives to c

Solution 1:

You need to percent encode the utf8 representation of your unicode string.

As explained here:

All non-ASCII code points in the IRI should next be encoded as UTF-8, and the resulting bytes percent-encoded, to produce a valid URI.

In python code, that means:

importurlliburl= urllib.quote(url.encode('utf8'), ':/')

The second argument to quote, ':/', is to prevent the colon in the protocol part http:, or path separator / from being encoded.

(In Python 3, the quote function has been moved to the urllib.parse module).

Solution 2:

You can try to encode the urls. Your code may look like:

defget_html_text(url):
    try:
        return urllib.urlopen(current_url.encode('ascii','ignore')).read()
    except (URLError, HTTPError, urllib.ContentTooShortError):
        print"Error getting " + current_url

Post a Comment for "Unicodeerror: Url Contains Non-ascii Characters (python 2.7)"