How To Use Scrapy With Both Splash And Tor Over Privoxy In Docker Compose
I'm trying to run a Scrapy spider with two 'extensions': Splash for rendering JavaScript, Tor-Privoxy to provide anonymity. As an example, I'm using the scraper of quotes.toscrap
Solution 1:
Following the Aquarium template project (https://github.com/TeamHG-Memex/aquarium), I found that the trick is to make Splash use Tor, not the spider directly.
My adapted project has the following structure:
.
├── docker-compose.yml
├── example
│ ├── Dockerfile
│ ├── scrapy.cfg
│ └── scrashtest
│ ├── __init__.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── quotes.py
└── splash
└── proxy-profiles
└── default.ini
and the docker-compose.yml
is
version: '3'
services:
scraper:
build: ./example
links:
- splash
tor-privoxy:
image: rdsubhas/tor-privoxy-alpine
splash:
image: scrapinghub/splash
volumes:
- ./splash/proxy-profiles:/etc/splash/proxy-profiles:ro
links:
- tor-privoxy
where I've mounted the proxy-profiles
directory as a volume into the splash
container following http://splash.readthedocs.io/en/stable/api.html#proxy-profiles. The default.ini
reads
[proxy]
host=tor-privoxy
port=8118
(I also noticed it is essential to call it default.ini
).
With this setup, upon docker-compose build
and docker-compose up
the scraper runs successfully using Splash.
Post a Comment for "How To Use Scrapy With Both Splash And Tor Over Privoxy In Docker Compose"