Skip to content Skip to sidebar Skip to footer

The Order Of Scrapy Crawling Urls With Long Start_urls List And Urls Yiels From Spider

Help! Reading the source code of Scrapy is not easy for me. I have a very long start_urls list. it is about 3,000,000 in a file. So,I make the start_urls like this: start_urls = re

Solution 1:

First of all, please see this thread - I think you'll find all the answers there.

the order of the urls used by downloader? Will the requests made by just_test1,just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No)

You are right, the answer is No. The behavior is completely asynchronous: when the spider starts, start_requests method is called (source):

defstart_requests(self):
    for url inself.start_urls:yieldself.make_requests_from_url(url)

defmake_requests_from_url(self, url):
    return Request(url, dont_filter=True)

What decides the order? Why and How is this order? How can we control it?

By default, there is no pre-defined order - you cannot know when Requests from make_requests_from_url will arrive - it's asynchronous.

See this answer on how you may control the order. Long story short, you can override start_requests and mark yielded Requests with priority key (like yield Request(url, meta={'priority': 0})). For example, the value of priority can be the line number where the url was found.

Is this a good way to deal with so many urls which are already in a file? What else?

I think you should read your file and yield urls directly in start_requests method: see this answer.

So, you should do smth like this:

defstart_requests(self):
    with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f:
        for index, line inenumerate(f):
            try:
                url = line.strip()
                yield Request(url, meta={'priority': index})
            except:
                continue

Hope that helps.

Post a Comment for "The Order Of Scrapy Crawling Urls With Long Start_urls List And Urls Yiels From Spider"