How to Uncompress a .tar.gz File Using Python
.tar.gz files (also called tarballs) combine TAR packaging with gzip compression and are widely used in Unix/Linux systems for distributing software, backups, and datasets. Python's built-in tarfile module handles these archives without requiring any external tools, making it straightforward to extract, inspect, and manipulate .tar.gz files directly in your code. This guide covers all common operations: from full extraction to selective file retrieval.
Understanding .tar.gz Files
A .tar.gz file is a two-layer archive:
- TAR (Tape Archive): Bundles multiple files and directories into a single file while preserving file structure, permissions, and metadata.
- gzip: Compresses the TAR file to reduce its size.
Python's tarfile module handles both layers transparently: you don't need to decompress and unpack separately.
Extracting All Contents
The most common operation is extracting everything from the archive into a directory:
import tarfile
with tarfile.open('archive.tar.gz', 'r:gz') as tar:
tar.extractall(path='./extracted_files')
print("Extraction complete.")
Output:
Extraction complete.
All files and directories from the archive are placed inside the extracted_files folder, which is created automatically if it doesn't exist.
Always use the with statement when working with tarfile. It automatically closes the archive when the block exits, even if an error occurs. This is cleaner than manually calling file.close().
Understanding the Mode String
The second argument to tarfile.open() specifies how the file is opened:
| Mode | Description |
|---|---|
'r:gz' | Read a gzip-compressed tar archive |
'r:bz2' | Read a bzip2-compressed tar archive |
'r:xz' | Read an xz-compressed tar archive |
'r:*' | Auto-detect the compression method |
'r' | Read an uncompressed tar archive |
If you're unsure of the compression type, use 'r:*' or simply 'r' and let Python detect it automatically:
import tarfile
# Auto-detect compression
with tarfile.open('archive.tar.gz') as tar:
tar.extractall(path='./output')
Listing Files Before Extracting
To inspect the contents of an archive without extracting, use getnames() or getmembers():
import tarfile
with tarfile.open('archive.tar.gz', 'r:gz') as tar:
# Get a list of file names
file_names = tar.getnames()
print(f"Archive contains {len(file_names)} items:")
for name in file_names:
print(f" {name}")
Output:
Archive contains 3 items:
data/
data/sample.txt
data/report.csv
Getting Detailed File Information
Use getmembers() to access metadata like file size, modification time, and type:
import tarfile
with tarfile.open('archive.tar.gz', 'r:gz') as tar:
for member in tar.getmembers():
file_type = "DIR" if member.isdir() else "FILE"
print(f" [{file_type}] {member.name} ({member.size} bytes)")
Output:
[DIR] data/ (0 bytes)
[FILE] data/sample.txt (1024 bytes)
[FILE] data/report.csv (4096 bytes)
Extracting a Single File
To extract only a specific file from the archive, use the extract() method:
import tarfile
with tarfile.open('archive.tar.gz', 'r:gz') as tar:
tar.extract('data/sample.txt', path='./output')
print("Extracted sample.txt successfully.")
Output:
Extracted sample.txt successfully.
Only sample.txt is extracted: other files in the archive are ignored.
Extracting Multiple Selected Files
To extract a subset of files, use extractall() with a members filter:
import tarfile
with tarfile.open('archive.tar.gz', 'r:gz') as tar:
# Extract only .csv files
csv_members = [m for m in tar.getmembers() if m.name.endswith('.csv')]
tar.extractall(path='./csv_output', members=csv_members)
print(f"Extracted {len(csv_members)} CSV file(s).")
Output:
Extracted 1 CSV file(s).
Reading a File Directly Without Extracting to Disk
You can read file contents directly from the archive without writing them to disk:
import tarfile
with tarfile.open('archive.tar.gz', 'r:gz') as tar:
# Open a specific file within the archive
member = tar.getmember('data/sample.txt')
file_obj = tar.extractfile(member)
if file_obj is not None:
content = file_obj.read().decode('utf-8')
print(f"Content of sample.txt ({len(content)} characters):")
print(content[:200]) # Print first 200 characters
Output:
Content of sample.txt (1024 characters):
This is the content of sample.txt...
This is useful when you only need to read data from the archive without creating files on disk.
extractfile() returns None for directories and special files (like symlinks). Always check for None before calling .read().
Security: Avoiding Path Traversal Attacks
Archives can contain files with paths like ../../etc/passwd that attempt to write outside the intended extraction directory. Always validate or filter members before extracting:
import tarfile
import os
def safe_extract(tar, path='.'):
"""Extract tar archive safely, preventing path traversal."""
for member in tar.getmembers():
member_path = os.path.join(path, member.name)
abs_path = os.path.abspath(member_path)
abs_dest = os.path.abspath(path)
# Ensure the file stays within the target directory
if not abs_path.startswith(abs_dest):
print(f"Skipping dangerous path: {member.name}")
continue
tar.extract(member, path)
with tarfile.open('archive.tar.gz', 'r:gz') as tar:
safe_extract(tar, path='./safe_output')
print("Safe extraction complete.")
Never use extractall() on untrusted archives without validation. Malicious .tar.gz files can contain path traversal entries (e.g., ../../sensitive_file) that overwrite files outside the target directory. In Python 3.12+, tarfile includes built-in filters: use tar.extractall(filter='data') for safe extraction.
Python 3.12+ Safe Extraction
import tarfile
with tarfile.open('archive.tar.gz', 'r:gz') as tar:
# 'data' filter blocks dangerous paths and preserves only file data
tar.extractall(path='./output', filter='data')
Creating a .tar.gz File
For completeness, here is how to create a .tar.gz archive:
import tarfile
with tarfile.open('new_archive.tar.gz', 'w:gz') as tar:
tar.add('my_folder', arcname='my_folder')
tar.add('single_file.txt')
print("Archive created successfully.")
The arcname parameter controls the name used inside the archive. Without it, the full path is used.
Common Mistake: Using gzip Module Instead of tarfile
A frequent error is trying to use Python's gzip module to extract .tar.gz files. The gzip module only handles gzip compression; it doesn't understand the TAR format:
import gzip
# WRONG: gzip can decompress but doesn't handle the TAR archive structure
with gzip.open('archive.tar.gz', 'rb') as f:
content = f.read()
# content is raw TAR data, not individual files
print(type(content)) # <class 'bytes'>
This gives you the raw TAR binary data, not the individual files inside.
The correct approach
import tarfile
# CORRECT: tarfile handles both gzip decompression and TAR unpacking
with tarfile.open('archive.tar.gz', 'r:gz') as tar:
tar.extractall(path='./output')
Quick Reference
| Operation | Code |
|---|---|
| Extract all files | tar.extractall(path='./output') |
| Extract one file | tar.extract('filename.txt', path='./output') |
| List all files | tar.getnames() |
| Get file metadata | tar.getmembers() |
| Read file without extracting | tar.extractfile(member).read() |
| Create a .tar.gz archive | tarfile.open('out.tar.gz', 'w:gz') |
| Safe extraction (Python 3.12+) | tar.extractall(filter='data') |
Python's built-in tarfile module provides everything you need to work with .tar.gz files: no external tools or additional packages required. Whether you need to extract everything, inspect contents, or pull out a single file, the API is straightforward and handles compression transparently.