How to Download Files with Scrapy in Python

Scrapy is a powerful, high-performance web crawling and scraping framework for Python. While it's commonly used for extracting structured data from websites, Scrapy also includes a robust file download pipeline that makes downloading files from the web efficient and scalable.

In this guide, you'll learn how to set up a Scrapy project, build a crawl spider that discovers download links, configure the file download pipeline, and customize file naming, all with step-by-step instructions.

note

Always check a website's robots.txt file and terms of service before crawling. Not all websites permit automated access to their content.

Prerequisites

Install Scrapy using pip:

pip install scrapy

Step 1: Create a Scrapy Project

Create a directory for your project and initialize a new Scrapy project:

mkdir scrapy_downloads
cd scrapy_downloads

# Create a new Scrapy project called "file_downloader"
scrapy startproject file_downloader

This generates the following project structure:

file_downloader/
├── scrapy.cfg
└── file_downloader/
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders/
        └── __init__.py

Navigate into the project directory:

cd file_downloader

Step 2: Define Item Fields

Edit file_downloader/items.py to define the data fields your spider will collect. The file_urls field is required by Scrapy's file pipeline:

import scrapy

class FileDownloaderItem(scrapy.Item):
    file_urls = scrapy.Field()              # URLs of files to download
    files = scrapy.Field()                  # Metadata about downloaded files
    original_file_name = scrapy.Field()     # Original filename for reference

Required Fields

Scrapy's FilesPipeline expects a field called file_urls (a list of URLs). After downloading, it populates the files field with download results including file path, URL, and checksum.

Step 3: Create the Crawl Spider

Generate a crawl spider using Scrapy's template:

scrapy genspider -t crawl example_spider example.com

Replace example.com with the actual domain you want to crawl. Edit the generated spider file in file_downloader/spiders/example_spider.py:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from file_downloader.items import FileDownloaderItem


class ExampleSpider(CrawlSpider):
    name = 'example_spider'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/downloads/']

    # Define rules for which links to follow
    rules = (
        Rule(
            LinkExtractor(allow=r'/downloads/'),
            callback='parse_item',
            follow=True
        ),
    )

    def parse_item(self, response):
        # Extract download links from the page
        file_url = response.css('a.download-link::attr(href)').get()

        if file_url:
            # Convert relative URLs to absolute
            file_url = response.urljoin(file_url)

            # Filter by file extension
            file_extension = file_url.split('.')[-1].lower()
            if file_extension not in ('zip', 'exe', 'pdf', 'csv'):
                return

            item = FileDownloaderItem()
            item['file_urls'] = [file_url]
            item['original_file_name'] = file_url.split('/')[-1]
            yield item

Understanding the Key Components

rules defines which links the spider should follow:

rules = (
    Rule(
        LinkExtractor(allow=r'/downloads/'),   # Only follow links matching this pattern
        callback='parse_item',                 # Call this method for each matched page
        follow=True                            # Continue following links from matched pages
    ),
)

parse_item() processes each crawled page:

Uses a CSS selector to find the download link element.
Converts relative URLs to absolute with response.urljoin().
Filters files by extension to download only desired types.
Yields an item with the file URL for the pipeline to download.

Finding the Right CSS Selector

Use your browser's Inspect Element tool (Ctrl+Shift+C / Cmd+Shift+C) to examine download links on the target page. Look for the HTML element and its class or ID to build your CSS selector:

# Common patterns for download links
response.css('a.download-link::attr(href)').get()
response.css('a[download]::attr(href)').get()
response.css('.file-list a::attr(href)').getall()

Step 4: Configure Settings

Edit file_downloader/settings.py to enable the file pipeline and set the download location:

# Enable the file download pipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1,
}

# Set the destination folder for downloaded files
FILES_STORE = './downloads'

# Optional: Set a download delay to be polite
DOWNLOAD_DELAY = 1

# Optional: Respect robots.txt
ROBOTSTXT_OBEY = True

# Optional: Set a user agent
USER_AGENT = 'MyFileDownloader/1.0 (+https://example.com/bot)'

Step 5: Run the Spider

Execute the spider from the project directory (where scrapy.cfg is located):

scrapy crawl example_spider

Downloaded files will appear in the downloads/full/ directory.

Custom Pipeline: Preserving Original Filenames

By default, Scrapy saves downloaded files using their SHA1 hash as the filename (e.g., 0a79c461a4...e5d7.zip). To preserve the original human-readable filenames, create a custom pipeline.

Edit file_downloader/pipelines.py:

from scrapy.pipelines.files import FilesPipeline


class CustomFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        """Override to use the original filename instead of SHA1 hash."""
        filename = request.url.split('/')[-1]
        # Remove query parameters if present
        filename = filename.split('?')[0]
        return filename

Update settings.py to use your custom pipeline instead of the default:

ITEM_PIPELINES = {
    'file_downloader.pipelines.CustomFilesPipeline': 1,
}

Now when you run the spider, files will be saved with their original names like tool_v2.1.zip instead of hash codes.

Downloading Images Instead of Files

Scrapy also includes an ImagesPipeline for downloading images with automatic thumbnail generation and image processing:

# In items.py
class ImageDownloaderItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

# In settings.py
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = './images'

# In your spider
def parse_item(self, response):
    image_urls = response.css('img::attr(src)').getall()
    image_urls = [response.urljoin(url) for url in image_urls]

    item = ImageDownloaderItem()
    item['image_urls'] = image_urls
    yield item

Adding Download Limits and Filters

Limiting File Size

Add a custom pipeline that checks file size before downloading:

# In settings.py

# Maximum file size in bytes (e.g., 100 MB)
FILES_MAX_SIZE = 100 * 1024 * 1024

Filtering by File Type in Settings

# In settings.py

# Only allow specific media types
MEDIA_ALLOW_REDIRECTS = True

Filtering in the Spider

def parse_item(self, response):
    file_urls = response.css('a.download::attr(href)').getall()
    file_urls = [response.urljoin(url) for url in file_urls]

    # Filter by extension
    allowed_extensions = {'.zip', '.pdf', '.csv', '.xlsx'}
    file_urls = [
        url for url in file_urls
        if any(url.lower().endswith(ext) for ext in allowed_extensions)
    ]

    if file_urls:
        item = FileDownloaderItem()
        item['file_urls'] = file_urls
        yield item

Exporting Download Results to JSON

To save a log of all downloaded files:

scrapy crawl example_spider -o download_log.json

This creates a JSON file with download metadata:

[
    {
        "file_urls": ["https://example.com/files/data.zip"],
        "original_file_name": "data.zip",
        "files": [
            {
                "url": "https://example.com/files/data.zip",
                "path": "data.zip",
                "checksum": "a1b2c3d4e5f6...",
                "status": "downloaded"
            }
        ]
    }
]

Complete Spider Example

Here's the full spider file for reference:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from file_downloader.items import FileDownloaderItem


class ExampleSpider(CrawlSpider):
    name = 'example_spider'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/downloads/']

    rules = (
        Rule(
            LinkExtractor(allow=r'/downloads/'),
            callback='parse_item',
            follow=True
        ),
    )

    def parse_item(self, response):
        """Extract and yield download links from crawled pages."""
        download_links = response.css('a.download-link::attr(href)').getall()

        for link in download_links:
            file_url = response.urljoin(link)

            # Filter by file type
            extension = file_url.split('.')[-1].split('?')[0].lower()
            if extension not in ('zip', 'exe', 'pdf', 'csv', 'msi'):
                continue

            item = FileDownloaderItem()
            item['file_urls'] = [file_url]
            item['original_file_name'] = file_url.split('/')[-1].split('?')[0]
            yield item

Summary of Key Files

File	Purpose
`items.py`	Define data fields (`file_urls`, `files`)
`spiders/example_spider.py`	Spider logic: which pages to crawl, what links to extract
`pipelines.py`	Custom download behavior (e.g., filename preservation)
`settings.py`	Enable pipeline, set download path, configure limits

Conclusion

Scrapy makes file downloading efficient and scalable through its built-in FilesPipeline:

Create a project and define item fields with file_urls and files.
Build a crawl spider with rules to follow relevant links and a parse_item method to extract download URLs.
Enable the pipeline in settings and specify a download directory.
Customize the pipeline to preserve original filenames or add filtering logic.

Scrapy handles concurrent downloads, retries, deduplication, and checksum verification automatically, making it far more robust than manual download scripts for large-scale file retrieval tasks.

Prerequisites​

Step 1: Create a Scrapy Project​

Step 2: Define Item Fields​

Step 3: Create the Crawl Spider​

Understanding the Key Components​

Step 4: Configure Settings​

Step 5: Run the Spider​

Custom Pipeline: Preserving Original Filenames​

Downloading Images Instead of Files​

Adding Download Limits and Filters​

Limiting File Size​

Filtering by File Type in Settings​

Filtering in the Spider​

Exporting Download Results to JSON​

Complete Spider Example​

Summary of Key Files​

Conclusion​

Table of Contents