Skip to content Skip to sidebar Skip to footer

Issue With Html Tags While Scraping Data Using Beautiful Soup

Common piece of code: # -*- coding: cp1252 -*- import csv import urllib2 import sys import time from bs4 import BeautifulSoup from itertools import islice page = urllib2.urlopen('

Solution 1:

The page uses a large JavaScript structure to load the prices. You can load just that structure:

scripts = soup.find_all('script')
script = next(s.text for s in scripts if s.string and 'window.rates' in s.string)
datastring = script.split('phones=')[1].split(';window.')[0]

This results in a large JavaScript structure, starting with:

{sku844082:{name:"Samsung Galaxy SII",image:"/images/m677391_300468.jpg",deliveryTime:"Vorauss. verfügbar ab Anfang Januar",sku1444291:{p:"prod954312",e:"19.90"},sku1444286:{p:"prod954312",e:"19.90"},sku1444283:{p:"prod954312",e:"39.90"},sku1444275:{p:"prod954312",e:"59.90"},sku1104261:{p:"prod954312",e:"99.90"}},sku894279:{name:"BlackBerry Torch 9810",image:"/images/m727477_300464.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod1004495",e:"179.90"},sku1104261:{p:"prod1004495",e:"259.90"},sku1444291:{p:"prod1004495",e:"29.90"},sku1444286:{p:"prod1004495",e:"29.90"},sku1444283:{p:"prod1004495",e:"49.90"}},sku864221:{name:"BlackBerry Bold 9900",image:"/images/m707491_300465.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod974431",e:"129.90"},sku1104261:{p:"prod974431",e:"169.90"},sku1444291:{p:"prod974431",e:"49.90"},sku1444286:{p:"prod974431",e:"49.90"},sku1444283:{p:"prod974431",e:"89.90"}}

Unfortunately, that's not directly loadable with the json module; although valid JavaScript, without quoting around the keys it is not valid JSON. You'd need to use regular expressions to clean that up further, or grab the p:"someprice" information directly from that string.

Luckily the structure can be fixed with a small amount of regular expression magic:

import re
import json

datastring = re.sub(ur'([{,])([a-z]\w*):', ur'\1"\2":', datastring)
data = json.loads(datastring)

This gives you a large dictionary, with SKU keys and dictionaries with nested dicts as data, including nested SKUs with p product codes and e prices:

>>> from pprint import pprint
>>> pprint(data['sku864221'])
{u'deliveryTime': u'Lieferbar innerhalb 48 Stunden',
 u'image': u'/images/m707491_300465.jpg',
 u'name': u'BlackBerry Bold 9900',
 u'sku1104261': {u'e': u'169.90', u'p': u'prod974431'},
 u'sku1444275': {u'e': u'129.90', u'p': u'prod974431'},
 u'sku1444283': {u'e': u'89.90', u'p': u'prod974431'},
 u'sku1444286': {u'e': u'49.90', u'p': u'prod974431'},
 u'sku1444291': {u'e': u'49.90', u'p': u'prod974431'}}

Post a Comment for "Issue With Html Tags While Scraping Data Using Beautiful Soup"