Retrieving Ad Urls
Solution 1:
BeautifulSoup alone isn't going to cut it. The ads are injected via javascript ( they're doubleclick ads ).
Your options are:
- script something like selenium to look for the urls 10-15seconds after page load
if you stay in pure python, you'll need to :
- make the initial request to parse with beautiful soup
- figure out what google was going to inject with javascript
- make a secondary request to doubleclick for the iframe or payload url
Those methods will only get you the doubleclick urls that handle conversion tracking. If you want to find out where they redirect to, you'll need to open those urls to discover their redirects.
Solution 2:
Beautiful Soup is exactly what you're looking for.
Solution 3:
I'd check out Scrapy.
It's a web scraping library. It allows you to crawl and scrape information from websites quite easily. The link above is to the official tutorial, which contains lots of sample code, including something that straddles the line of what it appears you need.
Simple scraping according to the site:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
That's a minor example of using the library to scrape links, which yields:
[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n], 'link': [u'http://gnosis.cx/TPiP/'], 'title': [u'Text Processing in Python']} [dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'], 'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'], 'title': [u'XML Processing with Python']}
Pretty cool.
Solution 4:
I assume you are talking about the textual ads at the top of the screen.
You won't be able to directly use a Python parsing library for this because these links are loaded in using JavaScript after the page has loaded.
One option would be to use a tool like selenium that allows you to load a page in a browser. Once you have done that you can use BeautifulSoup to scan for the links you are looking for:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
ads_div = soup.find('div', attrs={'class': 'ads'})
if ads_div:
forlinkin ads_div.find_all('a'):
printlink['href']
Post a Comment for "Retrieving Ad Urls"