Skip to content Skip to sidebar Skip to footer

Scrapy Get All Links From Any Website

I have the following code for a web crawler in Python 3: import requests from bs4 import BeautifulSoup import re def get_links(link): return_links = [] r = requests.get(

Solution 1:

There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.

For recreating the behaviour you need in scrapy, you must

  • set your start url in your page.
  • write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls

An untested example (that can be, of course, refined):

classAllSpider(scrapy.Spider):
    name = 'all'

    start_urls = ['https://yourgithub.com']

    def__init__(self):
        self.links=[]

    defparse(self, response):
        self.links.append(response.url)
        for href in response.css('a::attr(href)'):
            yield response.follow(href, self.parse)

Solution 2:

If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.

A simple spider that follows all links:

classFollowAllSpider(CrawlSpider):
    name = 'follow_all'

    start_urls = ['https://example.com']
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

    defparse_item(self, response):
        pass

Post a Comment for "Scrapy Get All Links From Any Website"