How to Download Files with Scrapy in Python
Scrapy is a powerful, high-performance web crawling and scraping framework for Python. While it's commonly used for extracting structured data from websites, Scrapy also includes a robust file download pipeline that makes downloading files from the web efficient and scalable.
In this guide, you'll learn how to set up a Scrapy project, build a crawl spider that discovers download links, configure the file download pipeline, and customize file naming, all with step-by-step instructions.
Always check a website's robots.txt file and terms of service before crawling. Not all websites permit automated access to their content.
Prerequisites
Install Scrapy using pip:
pip install scrapy
Step 1: Create a Scrapy Project
Create a directory for your project and initialize a new Scrapy project:
mkdir scrapy_downloads
cd scrapy_downloads
# Create a new Scrapy project called "file_downloader"
scrapy startproject file_downloader
This generates the following project structure:
file_downloader/
├── scrapy.cfg
└── file_downloader/
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders/
└── __init__.py
Navigate into the project directory:
cd file_downloader
Step 2: Define Item Fields
Edit file_downloader/items.py to define the data fields your spider will collect. The file_urls field is required by Scrapy's file pipeline:
import scrapy
class FileDownloaderItem(scrapy.Item):
file_urls = scrapy.Field() # URLs of files to download
files = scrapy.Field() # Metadata about downloaded files
original_file_name = scrapy.Field() # Original filename for reference
Scrapy's FilesPipeline expects a field called file_urls (a list of URLs). After downloading, it populates the files field with download results including file path, URL, and checksum.
Step 3: Create the Crawl Spider
Generate a crawl spider using Scrapy's template:
scrapy genspider -t crawl example_spider example.com
Replace example.com with the actual domain you want to crawl. Edit the generated spider file in file_downloader/spiders/example_spider.py:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from file_downloader.items import FileDownloaderItem
class ExampleSpider(CrawlSpider):
name = 'example_spider'
allowed_domains = ['example.com']
start_urls = ['https://example.com/downloads/']
# Define rules for which links to follow
rules = (
Rule(
LinkExtractor(allow=r'/downloads/'),
callback='parse_item',
follow=True
),
)
def parse_item(self, response):
# Extract download links from the page
file_url = response.css('a.download-link::attr(href)').get()
if file_url:
# Convert relative URLs to absolute
file_url = response.urljoin(file_url)
# Filter by file extension
file_extension = file_url.split('.')[-1].lower()
if file_extension not in ('zip', 'exe', 'pdf', 'csv'):
return
item = FileDownloaderItem()
item['file_urls'] = [file_url]
item['original_file_name'] = file_url.split('/')[-1]
yield item
Understanding the Key Components
rules defines which links the spider should follow:
rules = (
Rule(
LinkExtractor(allow=r'/downloads/'), # Only follow links matching this pattern
callback='parse_item', # Call this method for each matched page
follow=True # Continue following links from matched pages
),
)
parse_item() processes each crawled page:
- Uses a CSS selector to find the download link element.
- Converts relative URLs to absolute with
response.urljoin(). - Filters files by extension to download only desired types.
- Yields an item with the file URL for the pipeline to download.
Use your browser's Inspect Element tool (Ctrl+Shift+C / Cmd+Shift+C) to examine download links on the target page. Look for the HTML element and its class or ID to build your CSS selector:
# Common patterns for download links
response.css('a.download-link::attr(href)').get()
response.css('a[download]::attr(href)').get()
response.css('.file-list a::attr(href)').getall()
Step 4: Configure Settings
Edit file_downloader/settings.py to enable the file pipeline and set the download location:
# Enable the file download pipeline
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1,
}
# Set the destination folder for downloaded files
FILES_STORE = './downloads'
# Optional: Set a download delay to be polite
DOWNLOAD_DELAY = 1
# Optional: Respect robots.txt
ROBOTSTXT_OBEY = True
# Optional: Set a user agent
USER_AGENT = 'MyFileDownloader/1.0 (+https://example.com/bot)'
Step 5: Run the Spider
Execute the spider from the project directory (where scrapy.cfg is located):
scrapy crawl example_spider
Downloaded files will appear in the downloads/full/ directory.
Custom Pipeline: Preserving Original Filenames
By default, Scrapy saves downloaded files using their SHA1 hash as the filename (e.g., 0a79c461a4...e5d7.zip). To preserve the original human-readable filenames, create a custom pipeline.
Edit file_downloader/pipelines.py:
from scrapy.pipelines.files import FilesPipeline
class CustomFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
"""Override to use the original filename instead of SHA1 hash."""
filename = request.url.split('/')[-1]
# Remove query parameters if present
filename = filename.split('?')[0]
return filename
Update settings.py to use your custom pipeline instead of the default:
ITEM_PIPELINES = {
'file_downloader.pipelines.CustomFilesPipeline': 1,
}
Now when you run the spider, files will be saved with their original names like tool_v2.1.zip instead of hash codes.
Downloading Images Instead of Files
Scrapy also includes an ImagesPipeline for downloading images with automatic thumbnail generation and image processing:
# In items.py
class ImageDownloaderItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
# In settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = './images'
# In your spider
def parse_item(self, response):
image_urls = response.css('img::attr(src)').getall()
image_urls = [response.urljoin(url) for url in image_urls]
item = ImageDownloaderItem()
item['image_urls'] = image_urls
yield item
Adding Download Limits and Filters
Limiting File Size
Add a custom pipeline that checks file size before downloading:
# In settings.py
# Maximum file size in bytes (e.g., 100 MB)
FILES_MAX_SIZE = 100 * 1024 * 1024
Filtering by File Type in Settings
# In settings.py
# Only allow specific media types
MEDIA_ALLOW_REDIRECTS = True
Filtering in the Spider
def parse_item(self, response):
file_urls = response.css('a.download::attr(href)').getall()
file_urls = [response.urljoin(url) for url in file_urls]
# Filter by extension
allowed_extensions = {'.zip', '.pdf', '.csv', '.xlsx'}
file_urls = [
url for url in file_urls
if any(url.lower().endswith(ext) for ext in allowed_extensions)
]
if file_urls:
item = FileDownloaderItem()
item['file_urls'] = file_urls
yield item
Exporting Download Results to JSON
To save a log of all downloaded files:
scrapy crawl example_spider -o download_log.json
This creates a JSON file with download metadata:
[
{
"file_urls": ["https://example.com/files/data.zip"],
"original_file_name": "data.zip",
"files": [
{
"url": "https://example.com/files/data.zip",
"path": "data.zip",
"checksum": "a1b2c3d4e5f6...",
"status": "downloaded"
}
]
}
]
Complete Spider Example
Here's the full spider file for reference:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from file_downloader.items import FileDownloaderItem
class ExampleSpider(CrawlSpider):
name = 'example_spider'
allowed_domains = ['example.com']
start_urls = ['https://example.com/downloads/']
rules = (
Rule(
LinkExtractor(allow=r'/downloads/'),
callback='parse_item',
follow=True
),
)
def parse_item(self, response):
"""Extract and yield download links from crawled pages."""
download_links = response.css('a.download-link::attr(href)').getall()
for link in download_links:
file_url = response.urljoin(link)
# Filter by file type
extension = file_url.split('.')[-1].split('?')[0].lower()
if extension not in ('zip', 'exe', 'pdf', 'csv', 'msi'):
continue
item = FileDownloaderItem()
item['file_urls'] = [file_url]
item['original_file_name'] = file_url.split('/')[-1].split('?')[0]
yield item
Summary of Key Files
| File | Purpose |
|---|---|
items.py | Define data fields (file_urls, files) |
spiders/example_spider.py | Spider logic: which pages to crawl, what links to extract |
pipelines.py | Custom download behavior (e.g., filename preservation) |
settings.py | Enable pipeline, set download path, configure limits |
Conclusion
Scrapy makes file downloading efficient and scalable through its built-in FilesPipeline:
- Create a project and define item fields with
file_urlsandfiles. - Build a crawl spider with rules to follow relevant links and a
parse_itemmethod to extract download URLs. - Enable the pipeline in settings and specify a download directory.
- Customize the pipeline to preserve original filenames or add filtering logic.
Scrapy handles concurrent downloads, retries, deduplication, and checksum verification automatically, making it far more robust than manual download scripts for large-scale file retrieval tasks.