Renaming Downloaded Images In Scrapy 0.24 With Content From An Item Field While Avoiding Filename Conflicts?
I'm attempting to rename the images that are downloaded by my Scrapy 0.24 spider. Right now the downloaded images are stored with a SHA1 hash of their URLs as the file names. I'd l
Solution 1:
The pipelines.py:
from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy import log
classMyImagesPipeline(ImagesPipeline):
#Name download versiondeffile_path(self, request, response=None, info=None):
image_guid = request.meta['model'][0]
log.msg(image_guid, level=log.DEBUG)
return'full/%s' % (image_guid)
#Name thumbnail versiondefthumb_path(self, request, thumb_id, response=None, info=None):
image_guid = thumb_id + request.url.split('/')[-1]
log.msg(image_guid, level=log.DEBUG)
return'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
defget_media_requests(self, item, info):
yield Request(item['image_urls'][0], meta=item)
You're using the settings.py
wrong. You should use this:
ITEM_PIPELINES = {'allenheath.pipelines.MyImagesPipeline': 1}
For thumbsnails to work, add this to settings.py
:
IMAGES_THUMBS = {
'small': (50, 50),
'big': (100, 100),
}
Solution 2:
Since the URL hash will make sure you'll end up with a unique identifier, you could perhaps just write separately to a file the item's value and the URL hash.
After all is done, you can then just loop over this file and do the renaming (and using a Counter dictionary to make sure you rename them with a number appended based on how many Items with an equal value).
Post a Comment for "Renaming Downloaded Images In Scrapy 0.24 With Content From An Item Field While Avoiding Filename Conflicts?"