Scrapy Outputs [ Into My .json File
A genuine Scrapy and Python noob here so please be patient with any silly mistakes. I'm trying to write a spider to recursively crawl a news site and return the headline, date, and
Solution 1:
This usually means nothing was scraped, no items were extracted.
In your case, fix your allowed_domains
setting:
allowed_domains = ["news24.com"]
Aside from that, just a bit cleaning up from a perfectionist:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class BasicSpiderSpider(CrawlSpider):
name = "basic_spider"
allowed_domains = ["news24.com"]
start_urls = [
'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
]
rules = [
Rule(LinkExtractor(), callback="parse_items", follow=True),
]
def parse_items(self, response):
for title in response.xpath('//*[@id="aspnetForm"]'):
item = BasicItem()
item['Headline'] = title.xpath('//*[@id="article_special"]//h1/text()').extract()
item["Article"] = title.xpath('//*[@id="article-body"]/p[1]/text()').extract()
item["Date"] = title.xpath('//*[@id="spnDate"]/text()').extract()
yield item
Post a Comment for "Scrapy Outputs [ Into My .json File"