How to Determine File Format in Python
Never trust a filename extension alone for identifying file types, especially in security-sensitive contexts. A file named image.jpg could actually be an executable, a script, or a completely different format. Proper file type detection requires inspecting the actual contents of the file, specifically the first few bytes known as the "magic number" or file signature.
In this guide, you will learn how to determine file formats using content-based inspection with python-magic, extension-based guessing with the built-in mimetypes module, and how to combine both approaches for secure file upload validation.
Inspecting Magic Numbers with python-magic
The most reliable method for determining file format examines the file header, which is a sequence of bytes at the beginning of the file that identifies the format. The python-magic library wraps the widely-used libmagic library to perform this detection:
pip install python-magic-bin # Windows (includes libmagic)
pip install python-magic # Linux/macOS (requires libmagic installed)
import magic
# Get the MIME type
mime_type = magic.from_file("unknown_file.dat", mime=True)
print(mime_type)
# Get a detailed human-readable description
description = magic.from_file("unknown_file.dat")
print(description)
Example output:
application/pdf
PDF document, version 1.4
For uploaded files or in-memory data where you have raw bytes instead of a file path, use from_buffer():
import magic
with open("unknown_file.dat", "rb") as f:
file_data = f.read(2048) # Read enough bytes for detection
mime_type = magic.from_buffer(file_data, mime=True)
print(mime_type)
Example output:
application/pdf
On Linux, python-magic requires the libmagic system library. Install it with your package manager:
- Debian/Ubuntu:
sudo apt install libmagic1 - macOS:
brew install libmagic - Windows: Use
pip install python-magic-bininstead, which bundles the library.
Using the Built-in mimetypes Module
The mimetypes module is part of Python's standard library and guesses file types based solely on the file extension. It requires no installation and is very fast, but it provides no security guarantees because it never examines the actual file content:
import mimetypes
mime_type, encoding = mimetypes.guess_type("document.pdf")
print(f"MIME type: {mime_type}")
print(f"Encoding: {encoding}")
# Also works with URLs
mime_type, _ = mimetypes.guess_type("https://example.com/image.png")
print(f"MIME type: {mime_type}")
Output:
MIME type: application/pdf
Encoding: None
MIME type: image/png
The encoding value is None for most files, but returns "gzip" or "compress" for files like data.csv.gz.
You can also reverse the lookup to find the appropriate extension for a MIME type:
import mimetypes
ext = mimetypes.guess_extension("image/jpeg")
print(ext)
Output:
.jpeg
This module is suitable for setting HTTP headers when serving files you already trust, but it should never be relied upon for validating untrusted input.
Why Extension-Based Detection Is Dangerous
Here is a concrete example of why relying on file extensions alone is a security risk:
import magic
import mimetypes
# A file named "photo.jpg" that is actually a PHP script
claimed_type, _ = mimetypes.guess_type("photo.jpg")
actual_type = magic.from_file("photo.jpg", mime=True)
print(f"Extension claims: {claimed_type}")
print(f"Content actually: {actual_type}")
Possible output:
Extension claims: image/jpeg
Content actually: text/x-php
The extension says JPEG, but the actual content is a PHP script. Without content-based detection, this file could be accepted and executed on a web server.
Validating Uploaded Files Securely
Combine both methods for defense in depth when handling user uploads:
import magic
import mimetypes
ALLOWED_TYPES = {
"image/jpeg",
"image/png",
"image/gif",
"application/pdf"
}
def validate_upload(filename: str, file_data: bytes) -> bool:
"""Validate an uploaded file type using content inspection."""
# Check actual content (the authoritative check)
actual_type = magic.from_buffer(file_data, mime=True)
if actual_type not in ALLOWED_TYPES:
print(f"Rejected: actual type is {actual_type}")
return False
# Optionally verify that the extension matches the content
claimed_type, _ = mimetypes.guess_type(filename)
if claimed_type and claimed_type != actual_type:
print(f"Warning: extension claims {claimed_type}, content is {actual_type}")
return True
# Example usage
with open("user_upload.jpg", "rb") as f:
data = f.read()
if validate_upload("user_upload.jpg", data):
print("File accepted")
else:
print("File rejected")
Example output (legitimate file):
File accepted
Example output (disguised file):
Rejected: actual type is text/x-python
File rejected
For image files specifically, consider going a step further and attempting to open the file with a library like Pillow. A valid magic number in the header does not guarantee that the rest of the file is well-formed. Opening it with an image library verifies the entire file structure:
from PIL import Image
try:
img = Image.open("user_upload.jpg")
img.verify() # Check file integrity without loading full image
print("Valid image file")
except Exception:
print("Corrupted or invalid image")
Manual Magic Number Checking
For situations where you cannot install python-magic or only need to detect a few specific formats, you can check magic bytes directly:
def detect_format(filepath: str) -> str:
"""Detect file format by reading magic bytes."""
signatures = {
b"\xff\xd8\xff": "image/jpeg",
b"\x89PNG": "image/png",
b"GIF8": "image/gif",
b"%PDF": "application/pdf",
b"PK\x03\x04": "application/zip",
b"\x1f\x8b": "application/gzip",
}
with open(filepath, "rb") as f:
header = f.read(8)
for magic_bytes, mime_type in signatures.items():
if header.startswith(magic_bytes):
return mime_type
return "application/octet-stream" # Unknown format
print(detect_format("photo.jpg"))
print(detect_format("archive.zip"))
Example output:
image/jpeg
application/zip
This approach is fast and dependency-free but only covers the specific formats you define.
Common File Signatures
| Format | Magic Bytes (Hex) | ASCII Representation | MIME Type |
|---|---|---|---|
| JPEG | FF D8 FF | (binary) | image/jpeg |
| PNG | 89 50 4E 47 | .PNG | image/png |
| GIF | 47 49 46 38 | GIF8 | image/gif |
25 50 44 46 | %PDF | application/pdf | |
| ZIP | 50 4B 03 04 | PK.. | application/zip |
| GZIP | 1F 8B | (binary) | application/gzip |
Detection Methods Comparison
| Method | Accuracy | Speed | Dependencies | Security Use |
|---|---|---|---|---|
python-magic | High | Medium | External library + libmagic | Recommended |
mimetypes | Low (extension only) | Very fast | None (stdlib) | Not suitable |
| Manual header check | High (for known formats) | Very fast | None | Suitable for specific formats |
| Pillow verification | Very high (for images) | Slower | Pillow library | Recommended for images |
Conclusion
- For any security-sensitive context like file uploads, always use
python-magicto inspect actual file contents rather than trusting extensions. - The built-in
mimetypesmodule is fine for convenience tasks like setting HTTP headers on files you already trust, but it provides no protection against malicious or mislabeled files. - For maximum security, combine both approaches: use content inspection as the authoritative check and extension matching as a secondary verification.
- When you cannot install external libraries, manual magic byte checking covers the most common formats with minimal code.