How To Extract Links From A Webpage Using Lxml, Xpath And Python?

March 03, 2024 Post a Comment

I've got this xpath query: /html/body//tbody/tr[*]/td[*]/a[@title]/@href It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on.

Solution 1:

I was able to make it work with the following code:

from lxml import html, etree
from StringIO import StringIO

html_string = '''<!DOCTYPE htmlPUBLIC"-//W3C//DTD HTML 4.01 Transitional//EN""http://www.w3.org/TR/html4/loose.dtd"><htmllang="en"><head/><body><tableborder="1"><tbody><tr><td><ahref="http://stackoverflow.com/foobar"title="Foobar">A link</a></td></tr><tr><td><ahref="http://stackoverflow.com/baz"title="Baz">Another link</a></td></tr></tbody></table></body></html>'''

tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

Solution 2:

Firefox adds additional html tags to the html when it renders, making the xpath returned by the firebug tool inconsistent with the actual html returned by the server (and what urllib/2 will return).

Removing the <tbody> tag generally does the trick.

Python Guru

How To Extract Links From A Webpage Using Lxml, Xpath And Python?

Solution 1:

Solution 2:

Post a Comment for "How To Extract Links From A Webpage Using Lxml, Xpath And Python?"