How to Unzip .gz Files in Python
This guide explains how to unzip .gz files (GZIP compressed files) in Python. We'll cover both extracting the contents to a new file and reading the uncompressed data directly into memory, using the built-in gzip and shutil modules.
Unzipping a .gz File to a New File (Extraction)
To unzip a .gz file and save the uncompressed contents to a new file, use the gzip and shutil modules:
import gzip
import shutil
with gzip.open('example.json.gz', 'rb') as file_in:
with open('example.json', 'wb') as file_out: # wb = write bytes
shutil.copyfileobj(file_in, file_out)
print('example.json file created')
gzip.open('example.json.gz', 'rb'): Opens the.gzfile in binary read mode ('rb').gzip.open()handles the decompression.with open('example.json', 'wb') as file_out:: Opens the output file in binary write mode ('wb'). It's crucial to use binary mode here, even if the uncompressed content is text, because the output ofgzip.openis bytes.shutil.copyfileobj(file_in, file_out): This efficiently copies the uncompressed data from the input file object (file_in) to the output file object (file_out). It handles reading and writing in chunks, so it works well even with large files.- The code assumes there is a file named
example.json.gzin the same directory as the Python script, but you can use any other path instead.
Reading the Uncompressed Contents of a .gz File
If you just want to read the uncompressed data into a Python variable (without creating a new file), you can use gzip.open() and read():
import gzip
with gzip.open('example.json.gz', 'rb') as file_in:
file_contents = file_in.read() # Reads as bytes
print(file_contents) # Output a bytes object: b'...'
#If you're working with TEXT data, decode it:
with gzip.open('example.json.gz', 'rt', encoding='utf-8') as file_in:
file_contents = file_in.read() # Read as a string directly
print(file_contents)
file_in.read(): Reads the entire uncompressed content into thefile_contentsvariable. If it is a text file, therbshould be replaced withrt, and the encoding should be provided.
By default, gzip.open() in binary mode ('rb') returns bytes. If you know the file contains text, open it in text mode ('rt') and specify the correct encoding (usually UTF-8): gzip.open('example.json.gz', 'rt', encoding='utf-8'). This will decode the content to a string as you read.
For line-by-line reading:
import gzip
with gzip.open('example.json.gz', 'rt', encoding='utf-8') as file_in:
for line in file_in:
print(line.strip())
Reading CSV or JSON Data from .gz Files (using pandas)
If your .gz file contains structured data like CSV or JSON, the pandas library provides convenient functions to read them directly:
import gzip # Still needed for decompression
import pandas as pd
# CSV Example:
with gzip.open('example.csv.gz', 'rt', encoding='utf-8') as file_in: # 'rt' for text mode
df = pd.read_csv(file_in)
print(df.head())
# JSON Example (one JSON object per line):
with gzip.open('example.json.gz', 'rt', encoding='utf-8') as file_in:
df = pd.read_json(file_in, lines=True) # Read line-delimited JSON
print(df.head())
# JSON Example (single JSON array):
with gzip.open('example.json.gz', 'rb') as file_in:
data = json.load(file_in) # Read data from json file.
df = pd.DataFrame(data) # Create dataframe.
print(df.head())
- CSV:
pd.read_csv()can directly read from a file-like object, so we pass thegzip.open()result to it. - JSON (lines=True): For line-delimited JSON, use
pd.read_json(..., lines=True). - JSON (Array) If the
.gzfile contains a single, large JSON array, usejson.load()to load the data into a Python object, and then convert it into a DataFrame.