Skip to main content

How to Download Files with Scrapy in Python

Scrapy is a powerful, high-performance web crawling and scraping framework for Python. While it's commonly used for extracting structured data from websites, Scrapy also includes a robust file download pipeline that makes downloading files from the web efficient and scalable.

In this guide, you'll learn how to set up a Scrapy project, build a crawl spider that discovers download links, configure the file download pipeline, and customize file naming, all with step-by-step instructions.

note

Always check a website's robots.txt file and terms of service before crawling. Not all websites permit automated access to their content.

Prerequisites

Install Scrapy using pip:

pip install scrapy

Step 1: Create a Scrapy Project

Create a directory for your project and initialize a new Scrapy project:

mkdir scrapy_downloads
cd scrapy_downloads

# Create a new Scrapy project called "file_downloader"
scrapy startproject file_downloader

This generates the following project structure:

file_downloader/
├── scrapy.cfg
└── file_downloader/
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders/
└── __init__.py

Navigate into the project directory:

cd file_downloader

Step 2: Define Item Fields

Edit file_downloader/items.py to define the data fields your spider will collect. The file_urls field is required by Scrapy's file pipeline:

import scrapy

class FileDownloaderItem(scrapy.Item):
file_urls = scrapy.Field() # URLs of files to download
files = scrapy.Field() # Metadata about downloaded files
original_file_name = scrapy.Field() # Original filename for reference
Required Fields

Scrapy's FilesPipeline expects a field called file_urls (a list of URLs). After downloading, it populates the files field with download results including file path, URL, and checksum.

Step 3: Create the Crawl Spider

Generate a crawl spider using Scrapy's template:

scrapy genspider -t crawl example_spider example.com

Replace example.com with the actual domain you want to crawl. Edit the generated spider file in file_downloader/spiders/example_spider.py:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from file_downloader.items import FileDownloaderItem


class ExampleSpider(CrawlSpider):
name = 'example_spider'
allowed_domains = ['example.com']
start_urls = ['https://example.com/downloads/']

# Define rules for which links to follow
rules = (
Rule(
LinkExtractor(allow=r'/downloads/'),
callback='parse_item',
follow=True
),
)

def parse_item(self, response):
# Extract download links from the page
file_url = response.css('a.download-link::attr(href)').get()

if file_url:
# Convert relative URLs to absolute
file_url = response.urljoin(file_url)

# Filter by file extension
file_extension = file_url.split('.')[-1].lower()
if file_extension not in ('zip', 'exe', 'pdf', 'csv'):
return

item = FileDownloaderItem()
item['file_urls'] = [file_url]
item['original_file_name'] = file_url.split('/')[-1]
yield item

Understanding the Key Components

rules defines which links the spider should follow:

rules = (
Rule(
LinkExtractor(allow=r'/downloads/'), # Only follow links matching this pattern
callback='parse_item', # Call this method for each matched page
follow=True # Continue following links from matched pages
),
)

parse_item() processes each crawled page:

  1. Uses a CSS selector to find the download link element.
  2. Converts relative URLs to absolute with response.urljoin().
  3. Filters files by extension to download only desired types.
  4. Yields an item with the file URL for the pipeline to download.
Finding the Right CSS Selector

Use your browser's Inspect Element tool (Ctrl+Shift+C / Cmd+Shift+C) to examine download links on the target page. Look for the HTML element and its class or ID to build your CSS selector:

# Common patterns for download links
response.css('a.download-link::attr(href)').get()
response.css('a[download]::attr(href)').get()
response.css('.file-list a::attr(href)').getall()

Step 4: Configure Settings

Edit file_downloader/settings.py to enable the file pipeline and set the download location:

# Enable the file download pipeline
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1,
}

# Set the destination folder for downloaded files
FILES_STORE = './downloads'

# Optional: Set a download delay to be polite
DOWNLOAD_DELAY = 1

# Optional: Respect robots.txt
ROBOTSTXT_OBEY = True

# Optional: Set a user agent
USER_AGENT = 'MyFileDownloader/1.0 (+https://example.com/bot)'

Step 5: Run the Spider

Execute the spider from the project directory (where scrapy.cfg is located):

scrapy crawl example_spider

Downloaded files will appear in the downloads/full/ directory.

Custom Pipeline: Preserving Original Filenames

By default, Scrapy saves downloaded files using their SHA1 hash as the filename (e.g., 0a79c461a4...e5d7.zip). To preserve the original human-readable filenames, create a custom pipeline.

Edit file_downloader/pipelines.py:

from scrapy.pipelines.files import FilesPipeline


class CustomFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
"""Override to use the original filename instead of SHA1 hash."""
filename = request.url.split('/')[-1]
# Remove query parameters if present
filename = filename.split('?')[0]
return filename

Update settings.py to use your custom pipeline instead of the default:

ITEM_PIPELINES = {
'file_downloader.pipelines.CustomFilesPipeline': 1,
}

Now when you run the spider, files will be saved with their original names like tool_v2.1.zip instead of hash codes.

Downloading Images Instead of Files

Scrapy also includes an ImagesPipeline for downloading images with automatic thumbnail generation and image processing:

# In items.py
class ImageDownloaderItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
# In settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = './images'
# In your spider
def parse_item(self, response):
image_urls = response.css('img::attr(src)').getall()
image_urls = [response.urljoin(url) for url in image_urls]

item = ImageDownloaderItem()
item['image_urls'] = image_urls
yield item

Adding Download Limits and Filters

Limiting File Size

Add a custom pipeline that checks file size before downloading:

# In settings.py

# Maximum file size in bytes (e.g., 100 MB)
FILES_MAX_SIZE = 100 * 1024 * 1024

Filtering by File Type in Settings

# In settings.py

# Only allow specific media types
MEDIA_ALLOW_REDIRECTS = True

Filtering in the Spider

def parse_item(self, response):
file_urls = response.css('a.download::attr(href)').getall()
file_urls = [response.urljoin(url) for url in file_urls]

# Filter by extension
allowed_extensions = {'.zip', '.pdf', '.csv', '.xlsx'}
file_urls = [
url for url in file_urls
if any(url.lower().endswith(ext) for ext in allowed_extensions)
]

if file_urls:
item = FileDownloaderItem()
item['file_urls'] = file_urls
yield item

Exporting Download Results to JSON

To save a log of all downloaded files:

scrapy crawl example_spider -o download_log.json

This creates a JSON file with download metadata:

[
{
"file_urls": ["https://example.com/files/data.zip"],
"original_file_name": "data.zip",
"files": [
{
"url": "https://example.com/files/data.zip",
"path": "data.zip",
"checksum": "a1b2c3d4e5f6...",
"status": "downloaded"
}
]
}
]

Complete Spider Example

Here's the full spider file for reference:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from file_downloader.items import FileDownloaderItem


class ExampleSpider(CrawlSpider):
name = 'example_spider'
allowed_domains = ['example.com']
start_urls = ['https://example.com/downloads/']

rules = (
Rule(
LinkExtractor(allow=r'/downloads/'),
callback='parse_item',
follow=True
),
)

def parse_item(self, response):
"""Extract and yield download links from crawled pages."""
download_links = response.css('a.download-link::attr(href)').getall()

for link in download_links:
file_url = response.urljoin(link)

# Filter by file type
extension = file_url.split('.')[-1].split('?')[0].lower()
if extension not in ('zip', 'exe', 'pdf', 'csv', 'msi'):
continue

item = FileDownloaderItem()
item['file_urls'] = [file_url]
item['original_file_name'] = file_url.split('/')[-1].split('?')[0]
yield item

Summary of Key Files

FilePurpose
items.pyDefine data fields (file_urls, files)
spiders/example_spider.pySpider logic: which pages to crawl, what links to extract
pipelines.pyCustom download behavior (e.g., filename preservation)
settings.pyEnable pipeline, set download path, configure limits

Conclusion

Scrapy makes file downloading efficient and scalable through its built-in FilesPipeline:

  1. Create a project and define item fields with file_urls and files.
  2. Build a crawl spider with rules to follow relevant links and a parse_item method to extract download URLs.
  3. Enable the pipeline in settings and specify a download directory.
  4. Customize the pipeline to preserve original filenames or add filtering logic.

Scrapy handles concurrent downloads, retries, deduplication, and checksum verification automatically, making it far more robust than manual download scripts for large-scale file retrieval tasks.