Skip to main content

How to Start Web Scraping with Scrapy in Python

When web scraping projects grow beyond simple page fetches, you need a framework that handles concurrency, retries, rate limiting, and data pipelines automatically. Scrapy is Python's premier web scraping framework, designed for extracting data from websites efficiently at scale. Unlike simpler libraries, Scrapy provides an asynchronous architecture, built-in data export, and extensive middleware support: making it the professional choice for production-grade web crawlers.

Installation and Project Setup

Scrapy enforces a structured project layout that organizes spiders, settings, and data pipelines:

# Install Scrapy
pip install scrapy

# Create a new project
scrapy startproject bookstore_scraper

# Navigate into the project
cd bookstore_scraper

This generates a project structure:

bookstore_scraper/
├── scrapy.cfg
└── bookstore_scraper/
├── __init__.py
├── items.py # Data structure definitions
├── middlewares.py # Request/response processing
├── pipelines.py # Data processing pipelines
├── settings.py # Project configuration
└── spiders/ # Your spider classes
└── __init__.py

Creating Your First Spider

Spiders define how to crawl websites and extract data. Create a new file in the spiders/ directory:

# bookstore_scraper/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes" # Unique identifier for this spider

# URLs to begin crawling
start_urls = [
"https://quotes.toscrape.com/page/1/",
]

def parse(self, response):
"""Extract data from each page."""
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}

# Follow pagination links
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
The yield Keyword

Scrapy spiders use yield to return data items and new requests. This enables asynchronous processing: Scrapy handles multiple requests simultaneously while processing your extracted data.

Running the Spider

Execute your spider and export results in various formats:

# Run spider and output to JSON
scrapy crawl quotes -o quotes.json

# Output to CSV
scrapy crawl quotes -o quotes.csv

# Output to JSON Lines (better for large datasets)
scrapy crawl quotes -o quotes.jsonl

Using the Interactive Shell

Test selectors interactively before writing spider code:

# Launch shell with a URL
scrapy shell "https://quotes.toscrape.com"

In the shell, experiment with selectors:

# Test CSS selectors
response.css("div.quote span.text::text").get()
response.css("small.author::text").getall()

# Test XPath selectors
response.xpath("//div[@class='quote']//span[@class='text']/text()").get()
Selector Debugging

The Scrapy shell is invaluable for developing and testing selectors. Use .get() to retrieve the first match and .getall() for all matches.

Defining Data Structures

For cleaner code and validation, define item classes in items.py:

# bookstore_scraper/items.py
import scrapy

class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
scraped_date = scrapy.Field()

Then use them in your spider:

from ..items import QuoteItem
from datetime import datetime

def parse(self, response):
for quote in response.css("div.quote"):
item = QuoteItem()
item["text"] = quote.css("span.text::text").get()
item["author"] = quote.css("small.author::text").get()
item["tags"] = quote.css("div.tags a.tag::text").getall()
item["scraped_date"] = datetime.now().isoformat()
yield item

Essential Settings

Configure your spider's behavior in settings.py:

# Respect robots.txt rules
ROBOTSTXT_OBEY = True

# Add delay between requests (seconds)
DOWNLOAD_DELAY = 1

# Limit concurrent requests
CONCURRENT_REQUESTS = 16

# Identify your bot
USER_AGENT = "MyBot/1.0 (+http://mywebsite.com/bot)"

# Enable auto-throttling
AUTOTHROTTLE_ENABLED = True

Scrapy vs. BeautifulSoup

FeatureScrapyBeautifulSoup + Requests
ArchitectureAsynchronousSynchronous
SpeedVery fast (concurrent)Sequential requests
Built-in featuresRetries, caching, exportsManual implementation
Learning curveSteeperSimpler
Best forLarge-scale projectsQuick, simple tasks
Ethical Scraping Practices

Always check a website's robots.txt file before scraping. Keep ROBOTSTXT_OBEY = True in settings, add reasonable delays between requests, and identify your bot with a descriptive User-Agent string.

Quick Command Reference

CommandPurpose
scrapy startproject nameCreate new project
scrapy genspider name domainGenerate spider template
scrapy crawl spidernameRun a spider
scrapy crawl name -o file.jsonRun and export data
scrapy shell "url"Interactive testing
scrapy listList available spiders

By adopting Scrapy, you gain a robust framework for building maintainable, scalable web scrapers that handle real-world challenges like rate limiting, retries, and structured data export automatically.