Scrapy Get All Links From Any Website
I have the following code for a web crawler in Python 3: import requests from bs4 import BeautifulSoup import re def get_links(link): return_links = [] r = requests.get(
Solution 1:
There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.
For recreating the behaviour you need in scrapy, you must
- set your start url in your page.
- write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls
An untested example (that can be, of course, refined):
classAllSpider(scrapy.Spider):
name = 'all'
start_urls = ['https://yourgithub.com']
def__init__(self):
self.links=[]
defparse(self, response):
self.links.append(response.url)
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
Solution 2:
If you want to allow crawling of all domains, simply don't specify allowed_domains
, and use a LinkExtractor
which extracts all links.
A simple spider that follows all links:
classFollowAllSpider(CrawlSpider):
name = 'follow_all'
start_urls = ['https://example.com']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
defparse_item(self, response):
pass
Post a Comment for "Scrapy Get All Links From Any Website"