Skip to content Skip to sidebar Skip to footer

Retrieving Ad Urls

I'm looking for a way to retrieve the ad URLs for this website. http://www.quiltingboard.com/resources/ What I want to do is probably write a script to continuously refresh the pag

Solution 1:

BeautifulSoup alone isn't going to cut it. The ads are injected via javascript ( they're doubleclick ads ).

Your options are:

  • script something like selenium to look for the urls 10-15seconds after page load
  • if you stay in pure python, you'll need to :

    1. make the initial request to parse with beautiful soup
    2. figure out what google was going to inject with javascript
    3. make a secondary request to doubleclick for the iframe or payload url

Those methods will only get you the doubleclick urls that handle conversion tracking. If you want to find out where they redirect to, you'll need to open those urls to discover their redirects.

Solution 2:

Beautiful Soup is exactly what you're looking for.

Solution 3:

I'd check out Scrapy.

It's a web scraping library. It allows you to crawl and scrape information from websites quite easily. The link above is to the official tutorial, which contains lots of sample code, including something that straddles the line of what it appears you need.

Simple scraping according to the site:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from tutorial.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.select('a/text()').extract()
           item['link'] = site.select('a/@href').extract()
           item['desc'] = site.select('text()').extract()
           items.append(item)
       return items

That's a minor example of using the library to scrape links, which yields:

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n], 'link': [u'http://gnosis.cx/TPiP/'], 'title': [u'Text Processing in Python']} [dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'], 'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'], 'title': [u'XML Processing with Python']}

Pretty cool.

Solution 4:

I assume you are talking about the textual ads at the top of the screen.

You won't be able to directly use a Python parsing library for this because these links are loaded in using JavaScript after the page has loaded.

One option would be to use a tool like selenium that allows you to load a page in a browser. Once you have done that you can use BeautifulSoup to scan for the links you are looking for:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)    
ads_div = soup.find('div', attrs={'class': 'ads'})
if ads_div:
    forlinkin ads_div.find_all('a'):
        printlink['href']

Post a Comment for "Retrieving Ad Urls"