Skip to content Skip to sidebar Skip to footer

Scrapy Outputs [ Into My .json File

A genuine Scrapy and Python noob here so please be patient with any silly mistakes. I'm trying to write a spider to recursively crawl a news site and return the headline, date, and

Solution 1:

This usually means nothing was scraped, no items were extracted.

In your case, fix your allowed_domains setting:

allowed_domains = ["news24.com"]

Aside from that, just a bit cleaning up from a perfectionist:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class BasicSpiderSpider(CrawlSpider):
    name = "basic_spider"
    allowed_domains = ["news24.com"]
    start_urls = [
        'http://www.news24.com/SouthAfrica/News/56-children-hospitalised-for-food-poisoning-20150328',
    ]

    rules = [
        Rule(LinkExtractor(), callback="parse_items", follow=True),
    ]

    def parse_items(self, response):
        for title in response.xpath('//*[@id="aspnetForm"]'):
            item = BasicItem()
            item['Headline'] = title.xpath('//*[@id="article_special"]//h1/text()').extract()
            item["Article"] = title.xpath('//*[@id="article-body"]/p[1]/text()').extract()
            item["Date"] = title.xpath('//*[@id="spnDate"]/text()').extract()
            yield item

Post a Comment for "Scrapy Outputs [ Into My .json File"