Skip to main content

How to Uncompress a .tar.gz File Using Python

.tar.gz files (also called tarballs) combine TAR packaging with gzip compression and are widely used in Unix/Linux systems for distributing software, backups, and datasets. Python's built-in tarfile module handles these archives without requiring any external tools, making it straightforward to extract, inspect, and manipulate .tar.gz files directly in your code. This guide covers all common operations: from full extraction to selective file retrieval.

Understanding .tar.gz Files

A .tar.gz file is a two-layer archive:

  1. TAR (Tape Archive): Bundles multiple files and directories into a single file while preserving file structure, permissions, and metadata.
  2. gzip: Compresses the TAR file to reduce its size.

Python's tarfile module handles both layers transparently: you don't need to decompress and unpack separately.

Extracting All Contents

The most common operation is extracting everything from the archive into a directory:

import tarfile

with tarfile.open('archive.tar.gz', 'r:gz') as tar:
tar.extractall(path='./extracted_files')
print("Extraction complete.")

Output:

Extraction complete.

All files and directories from the archive are placed inside the extracted_files folder, which is created automatically if it doesn't exist.

tip

Always use the with statement when working with tarfile. It automatically closes the archive when the block exits, even if an error occurs. This is cleaner than manually calling file.close().

Understanding the Mode String

The second argument to tarfile.open() specifies how the file is opened:

ModeDescription
'r:gz'Read a gzip-compressed tar archive
'r:bz2'Read a bzip2-compressed tar archive
'r:xz'Read an xz-compressed tar archive
'r:*'Auto-detect the compression method
'r'Read an uncompressed tar archive

If you're unsure of the compression type, use 'r:*' or simply 'r' and let Python detect it automatically:

import tarfile

# Auto-detect compression
with tarfile.open('archive.tar.gz') as tar:
tar.extractall(path='./output')

Listing Files Before Extracting

To inspect the contents of an archive without extracting, use getnames() or getmembers():

import tarfile

with tarfile.open('archive.tar.gz', 'r:gz') as tar:
# Get a list of file names
file_names = tar.getnames()
print(f"Archive contains {len(file_names)} items:")
for name in file_names:
print(f" {name}")

Output:

Archive contains 3 items:
data/
data/sample.txt
data/report.csv

Getting Detailed File Information

Use getmembers() to access metadata like file size, modification time, and type:

import tarfile

with tarfile.open('archive.tar.gz', 'r:gz') as tar:
for member in tar.getmembers():
file_type = "DIR" if member.isdir() else "FILE"
print(f" [{file_type}] {member.name} ({member.size} bytes)")

Output:

  [DIR] data/ (0 bytes)
[FILE] data/sample.txt (1024 bytes)
[FILE] data/report.csv (4096 bytes)

Extracting a Single File

To extract only a specific file from the archive, use the extract() method:

import tarfile

with tarfile.open('archive.tar.gz', 'r:gz') as tar:
tar.extract('data/sample.txt', path='./output')
print("Extracted sample.txt successfully.")

Output:

Extracted sample.txt successfully.

Only sample.txt is extracted: other files in the archive are ignored.

Extracting Multiple Selected Files

To extract a subset of files, use extractall() with a members filter:

import tarfile

with tarfile.open('archive.tar.gz', 'r:gz') as tar:
# Extract only .csv files
csv_members = [m for m in tar.getmembers() if m.name.endswith('.csv')]
tar.extractall(path='./csv_output', members=csv_members)
print(f"Extracted {len(csv_members)} CSV file(s).")

Output:

Extracted 1 CSV file(s).

Reading a File Directly Without Extracting to Disk

You can read file contents directly from the archive without writing them to disk:

import tarfile

with tarfile.open('archive.tar.gz', 'r:gz') as tar:
# Open a specific file within the archive
member = tar.getmember('data/sample.txt')
file_obj = tar.extractfile(member)

if file_obj is not None:
content = file_obj.read().decode('utf-8')
print(f"Content of sample.txt ({len(content)} characters):")
print(content[:200]) # Print first 200 characters

Output:

Content of sample.txt (1024 characters):
This is the content of sample.txt...

This is useful when you only need to read data from the archive without creating files on disk.

info

extractfile() returns None for directories and special files (like symlinks). Always check for None before calling .read().

Security: Avoiding Path Traversal Attacks

Archives can contain files with paths like ../../etc/passwd that attempt to write outside the intended extraction directory. Always validate or filter members before extracting:

import tarfile
import os


def safe_extract(tar, path='.'):
"""Extract tar archive safely, preventing path traversal."""
for member in tar.getmembers():
member_path = os.path.join(path, member.name)
abs_path = os.path.abspath(member_path)
abs_dest = os.path.abspath(path)

# Ensure the file stays within the target directory
if not abs_path.startswith(abs_dest):
print(f"Skipping dangerous path: {member.name}")
continue

tar.extract(member, path)


with tarfile.open('archive.tar.gz', 'r:gz') as tar:
safe_extract(tar, path='./safe_output')
print("Safe extraction complete.")
danger

Never use extractall() on untrusted archives without validation. Malicious .tar.gz files can contain path traversal entries (e.g., ../../sensitive_file) that overwrite files outside the target directory. In Python 3.12+, tarfile includes built-in filters: use tar.extractall(filter='data') for safe extraction.

Python 3.12+ Safe Extraction

import tarfile

with tarfile.open('archive.tar.gz', 'r:gz') as tar:
# 'data' filter blocks dangerous paths and preserves only file data
tar.extractall(path='./output', filter='data')

Creating a .tar.gz File

For completeness, here is how to create a .tar.gz archive:

import tarfile

with tarfile.open('new_archive.tar.gz', 'w:gz') as tar:
tar.add('my_folder', arcname='my_folder')
tar.add('single_file.txt')
print("Archive created successfully.")

The arcname parameter controls the name used inside the archive. Without it, the full path is used.

Common Mistake: Using gzip Module Instead of tarfile

A frequent error is trying to use Python's gzip module to extract .tar.gz files. The gzip module only handles gzip compression; it doesn't understand the TAR format:

import gzip

# WRONG: gzip can decompress but doesn't handle the TAR archive structure
with gzip.open('archive.tar.gz', 'rb') as f:
content = f.read()
# content is raw TAR data, not individual files
print(type(content)) # <class 'bytes'>

This gives you the raw TAR binary data, not the individual files inside.

The correct approach

import tarfile

# CORRECT: tarfile handles both gzip decompression and TAR unpacking
with tarfile.open('archive.tar.gz', 'r:gz') as tar:
tar.extractall(path='./output')

Quick Reference

OperationCode
Extract all filestar.extractall(path='./output')
Extract one filetar.extract('filename.txt', path='./output')
List all filestar.getnames()
Get file metadatatar.getmembers()
Read file without extractingtar.extractfile(member).read()
Create a .tar.gz archivetarfile.open('out.tar.gz', 'w:gz')
Safe extraction (Python 3.12+)tar.extractall(filter='data')

Python's built-in tarfile module provides everything you need to work with .tar.gz files: no external tools or additional packages required. Whether you need to extract everything, inspect contents, or pull out a single file, the API is straightforward and handles compression transparently.