Renaming Downloaded Images In Scrapy 0.24 With Content From An Item Field While Avoiding Filename Conflicts?

June 10, 2023 Post a Comment

I'm attempting to rename the images that are downloaded by my Scrapy 0.24 spider. Right now the downloaded images are stored with a SHA1 hash of their URLs as the file names. I'd l

Solution 1:

The pipelines.py:

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy import log

classMyImagesPipeline(ImagesPipeline):

    #Name download versiondeffile_path(self, request, response=None, info=None):
        image_guid = request.meta['model'][0]
        log.msg(image_guid, level=log.DEBUG)
        return'full/%s' % (image_guid)

    #Name thumbnail versiondefthumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + request.url.split('/')[-1]
        log.msg(image_guid, level=log.DEBUG)
        return'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    defget_media_requests(self, item, info):
        yield Request(item['image_urls'][0], meta=item)

You're using the settings.py wrong. You should use this:

ITEM_PIPELINES = {'allenheath.pipelines.MyImagesPipeline': 1}

For thumbsnails to work, add this to settings.py:

IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (100, 100),
}

Solution 2:

Since the URL hash will make sure you'll end up with a unique identifier, you could perhaps just write separately to a file the item's value and the URL hash.

After all is done, you can then just loop over this file and do the renaming (and using a Counter dictionary to make sure you rename them with a number appended based on how many Items with an equal value).

Python Guru

Renaming Downloaded Images In Scrapy 0.24 With Content From An Item Field While Avoiding Filename Conflicts?

Solution 1:

Solution 2:

Post a Comment for "Renaming Downloaded Images In Scrapy 0.24 With Content From An Item Field While Avoiding Filename Conflicts?"