Skip to content Skip to sidebar Skip to footer

How To Use Scrapy With Both Splash And Tor Over Privoxy In Docker Compose

I'm trying to run a Scrapy spider with two 'extensions': Splash for rendering JavaScript, Tor-Privoxy to provide anonymity. As an example, I'm using the scraper of quotes.toscrap

Solution 1:

Following the Aquarium template project (https://github.com/TeamHG-Memex/aquarium), I found that the trick is to make Splash use Tor, not the spider directly.

My adapted project has the following structure:

.
├── docker-compose.yml
├── example
│   ├── Dockerfile
│   ├── scrapy.cfg
│   └── scrashtest
│       ├── __init__.py
│       ├── settings.py
│       └── spiders
│           ├── __init__.py
│           └── quotes.py
└── splash
    └── proxy-profiles
        └── default.ini

and the docker-compose.yml is

version: '3'

services:
  scraper:
    build: ./example
    links:
      - splash

  tor-privoxy:
    image: rdsubhas/tor-privoxy-alpine

  splash:
    image: scrapinghub/splash
    volumes:
      - ./splash/proxy-profiles:/etc/splash/proxy-profiles:ro
    links:
      - tor-privoxy

where I've mounted the proxy-profiles directory as a volume into the splash container following http://splash.readthedocs.io/en/stable/api.html#proxy-profiles. The default.ini reads

[proxy]

host=tor-privoxy
port=8118

(I also noticed it is essential to call it default.ini).

With this setup, upon docker-compose build and docker-compose up the scraper runs successfully using Splash.


Post a Comment for "How To Use Scrapy With Both Splash And Tor Over Privoxy In Docker Compose"