How to Download Files via HTTP in Python
The requests library provides a simple and reliable interface for downloading files over HTTP. Whether you are fetching a small JSON response or downloading a multi-gigabyte dataset, choosing the right approach depends on file size, memory constraints, and how robust your error handling needs to be.
In this guide, you will learn how to download files using both simple and streaming methods, add progress bars for user-facing applications, implement retry logic and resumable downloads, and handle authenticated endpoints.
Installation
pip install requests
Simple Download for Small Files
For files that fit comfortably in memory, such as images, documents, or JSON responses, load the entire content at once with response.content:
import requests
url = "https://example.com/logo.png"
response = requests.get(url)
if response.status_code == 200:
with open("logo.png", "wb") as f:
f.write(response.content)
print(f"Download complete ({len(response.content):,} bytes)")
else:
print(f"Failed with status code: {response.status_code}")
Example output:
Download complete (24,576 bytes)
This approach is concise and works well for files under roughly 50 MB. The entire response body is stored in memory as a bytes object before being written to disk.
Streaming Download for Large Files
For large files, streaming prevents loading the entire content into RAM. The stream=True parameter tells requests to download the content in chunks rather than all at once:
import requests
url = "https://example.com/large_dataset.zip"
with requests.get(url, stream=True) as response:
response.raise_for_status()
with open("dataset.zip", "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print("Download complete")
The iter_content() method yields chunks of the specified size (8,192 bytes in this example), and each chunk is written to disk immediately. Only one chunk is held in memory at any given time.
Without stream=True, the entire file is loaded into memory before you can access it. Downloading a 4 GB file would consume over 4 GB of RAM, potentially crashing your application. Always use streaming for files of unknown or large size.
Adding a Progress Bar
For user-facing applications, displaying download progress provides a much better experience. The tqdm library creates a progress bar using the Content-Length header from the server:
pip install tqdm
import requests
from tqdm import tqdm
url = "https://example.com/video.mp4"
response = requests.get(url, stream=True)
response.raise_for_status()
total_size = int(response.headers.get("content-length", 0))
with open("video.mp4", "wb") as f:
with tqdm(total=total_size, unit="B", unit_scale=True, desc="Downloading") as progress:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
progress.update(len(chunk))
Example output:
Downloading: 100%|██████████| 150M/150M [00:45<00:00, 3.33MB/s]
Not all servers include a Content-Length header, especially when using chunked transfer encoding. When the header is missing, total_size will be 0 and tqdm will display progress without a percentage or ETA. The download still works correctly.
Handling Errors and Retries
Production code should handle network failures gracefully. The requests library supports automatic retries through the HTTPAdapter and Retry classes:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def download_with_retry(url, filepath, max_retries=3):
"""Download a file with automatic retry on failure."""
session = requests.Session()
retry_strategy = Retry(
total=max_retries,
backoff_factor=1, # Wait 1s, 2s, 4s between retries
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
try:
with session.get(url, stream=True, timeout=30) as response:
response.raise_for_status()
with open(filepath, "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print(f"Downloaded: {filepath}")
return True
except requests.exceptions.RequestException as e:
print(f"Download failed: {e}")
return False
success = download_with_retry("https://example.com/data.zip", "data.zip")
The backoff_factor=1 means the library waits 1 second after the first failure, 2 seconds after the second, and 4 seconds after the third. The status_forcelist specifies which HTTP status codes should trigger a retry.
Resumable Downloads
Some servers support resuming interrupted downloads via the HTTP Range header. This is especially valuable for large files over unreliable connections:
import requests
import os
def resume_download(url, filepath):
"""Resume a download from where it was interrupted."""
headers = {}
initial_pos = 0
if os.path.exists(filepath):
initial_pos = os.path.getsize(filepath)
headers["Range"] = f"bytes={initial_pos}-"
print(f"Resuming from byte {initial_pos:,}")
response = requests.get(url, stream=True, headers=headers)
# 206 = Partial Content (resume supported)
# 200 = Full content (server does not support resume, start over)
if response.status_code == 200:
initial_pos = 0
mode = "ab" if initial_pos > 0 else "wb"
with open(filepath, mode) as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print("Download complete")
resume_download("https://example.com/large_file.iso", "file.iso")
Example output (resumed):
Resuming from byte 52,428,800
Download complete
Before implementing resume logic, check whether the server supports range requests by looking for Accept-Ranges: bytes in the response headers. Not all servers support this feature.
Downloading with Authentication
Many APIs and private servers require authentication before allowing downloads:
import requests
# Basic authentication (username and password)
response = requests.get(
"https://api.example.com/report.pdf",
auth=("username", "password"),
stream=True
)
# Bearer token authentication
headers = {"Authorization": "Bearer YOUR_ACCESS_TOKEN"}
response = requests.get(
"https://api.example.com/report.pdf",
headers=headers,
stream=True
)
# Download the file if authentication succeeded
if response.status_code == 200:
with open("report.pdf", "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print("Download complete")
elif response.status_code == 401:
print("Authentication failed: check your credentials")
else:
print(f"Request failed with status: {response.status_code}")
Using urllib from the Standard Library
If you cannot install third-party packages, Python's built-in urllib module handles basic downloads without any external dependencies:
from urllib.request import urlretrieve
url = "https://example.com/data.csv"
filepath, headers = urlretrieve(url, "data.csv")
print(f"Downloaded to: {filepath}")
For more control over the download process:
from urllib.request import urlopen
url = "https://example.com/data.csv"
with urlopen(url) as response:
with open("data.csv", "wb") as f:
while True:
chunk = response.read(8192)
if not chunk:
break
f.write(chunk)
print("Download complete")
The requests library is preferred for most use cases because it provides a cleaner API, better error handling, and features like automatic retries and session management.
Method Comparison
| Method | Memory Usage | Best For |
|---|---|---|
response.content | Entire file in RAM | Small files (under 50 MB) |
iter_content() with stream=True | One chunk at a time | Large files, limited memory |
With tqdm progress bar | One chunk + display overhead | User-facing downloads |
| With retry logic | One chunk at a time | Unreliable networks |
urllib.request | Depends on usage | No external dependencies |
Conclusion
- Use
response.contentfor small files where simplicity matters and memory is not a concern. - For production scripts handling files of unknown or large size, always use
stream=Truewithiter_content()to keep memory usage constant. - Add retry logic with
HTTPAdapterandRetryfor reliability over unreliable networks, and include a progress bar withtqdmfor better user experience in interactive applications. - For interrupted downloads of very large files, implement resumable downloads using the
Rangeheader when the server supports it.