Force Scrapy To Crawl Link In Order They Appear
I'm writing a spider with scrapy to crawl a website, the index page is a list of link like www.link1.com, www.link2.com, www.link3.com and that site is updated really often, so my
Solution 1:
Try this example.
Construct a list and append all the links to it.
Then pop them one by one to get your requests in order.
I recommend doing something like @Hassan mention and pipe your contents to a database.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy import log
class SymantecSpider(BaseSpider):
name = 'symantecSpider'
allowed_domains = ['symantec.com']
allLinks = []
base_url = "http://www.symantec.com"
def start_requests(self):
return [Request('http://www.symantec.com/security_response/landing/vulnerabilities.jsp', callback=self.parseMgr)]
def parseMgr(self, response):
# This grabs all the links and append them to allLinks=[]
self.allLinks.append(HtmlXPathSelector(response).select("//table[@class='defaultTableStyle tableFontMD tableNoBorder']/tbody/tr/td[2]/a/@href").extract())
return Request(self.base_url + self.allLinks[0].pop(0), callback=self.pageParser)
# Cycle through the allLinks[] in order
def pageParser(self, response):
log.msg('response: %s' % response.url, level=log.INFO)
return Request(self.base_url + self.allLinks[0].pop(0), callback=self.pageParser)
Solution 2:
SgmlLinkExtractor will extract links in the same order they appear on the page.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
links = SgmlLinkExtractor(
restrict_xpaths='//div[@class="mrgnMD"]/following-sibling::table',
).extract_links(response)
You can use them in the rules
in your CrawlSpider:
class ThreatSpider(CrawlSpider):
name = 'threats'
start_urls = [
'http://www.symantec.com/security_response/landing/vulnerabilities.jsp',
]
rules = (Rule(SgmlLinkExtractor(
restrict_xpaths='//div[@class="mrgnMD"]/following-sibling::table')
callback='parse_threats'))
Post a Comment for "Force Scrapy To Crawl Link In Order They Appear"