How to Convert XML to Pandas DataFrame in Python
XML remains a common format for data exchange, configuration files, and API responses. Since Pandas 1.3, the read_xml() function provides native XML parsing, converting hierarchical structures into flat DataFrames efficiently.
Parse XML Directly with pd.read_xml()
The simplest approach uses Pandas' built-in XML parser with XPath selection.
import pandas as pd
# Sample XML structure:
# <catalog>
# <book id="1">
# <title>Python Basics</title>
# <author>Alice</author>
# <price>29.99</price>
# </book>
# <book id="2">
# <title>Data Science</title>
# <author>Bob</author>
# <price>39.99</price>
# </book>
# </catalog>
df = pd.read_xml("books.xml", xpath=".//book")
print(df)
Output:
id title author price
0 1 Python Basics Alice 29.99
1 2 Data Science Bob 39.99
Parse XML from String
import pandas as pd
xml_string = """
<users>
<user id="1">
<name>Alice</name>
<email>alice@example.com</email>
</user>
<user id="2">
<name>Bob</name>
<email>bob@example.com</email>
</user>
</users>
"""
df = pd.read_xml(xml_string, xpath=".//user")
print(df)
Output:
id name email
0 1 Alice alice@example.com
1 2 Bob bob@example.com
Install lxml for significantly faster parsing: pip install lxml. Pandas uses it automatically when available.
Filter Data with XPath Expressions
XPath allows selective loading, reducing memory usage for large files.
import pandas as pd
xml_data = """
<products>
<product category="electronics">
<name>Laptop</name>
<price>999</price>
</product>
<product category="books">
<name>Python Guide</name>
<price>49</price>
</product>
<product category="electronics">
<name>Phone</name>
<price>699</price>
</product>
</products>
"""
# Load only electronics products
df = pd.read_xml(xml_data, xpath=".//product[@category='electronics']")
print(df)
Output:
category name price
0 electronics Laptop 999
1 electronics Phone 699
Common XPath Patterns
| XPath Expression | Description |
|---|---|
.//element | All elements with this name |
.//element[@attr='value'] | Filter by attribute value |
.//parent/child | Direct child elements |
.//element[position()<=10] | First 10 elements |
Pandas uses XPath 1.0, which has limited filtering capabilities. For complex queries, load the data first and filter with Pandas operations.
Handle Nested XML Structures
Deeply nested XML requires specifying the correct path or flattening manually.
import pandas as pd
nested_xml = """
<company>
<department name="Engineering">
<employees>
<employee>
<name>Alice</name>
<role>Developer</role>
</employee>
<employee>
<name>Bob</name>
<role>Designer</role>
</employee>
</employees>
</department>
</company>
"""
# Target the deeply nested employee elements
df = pd.read_xml(nested_xml, xpath=".//employee")
print(df)
Output:
name role
0 Alice Developer
1 Bob Designer
Extract Attributes and Elements
import pandas as pd
xml_with_attrs = """
<orders>
<order id="1001" status="shipped">
<customer>Alice</customer>
<total>150.00</total>
</order>
<order id="1002" status="pending">
<customer>Bob</customer>
<total>89.50</total>
</order>
</orders>
"""
# Attributes become columns automatically
df = pd.read_xml(xml_with_attrs, xpath=".//order")
print(df)
Output:
id status customer total
0 1001 shipped Alice 150.0
1 1002 pending Bob 89.5
Parse Complex XML with BeautifulSoup
For malformed XML, inconsistent structures, or custom extraction logic, use BeautifulSoup.
from bs4 import BeautifulSoup
import pandas as pd
xml_content = """
<items>
<item id="1">
<name>Widget A</name>
<specs>
<weight unit="kg">2.5</weight>
<dimensions>10x20x5</dimensions>
</specs>
</item>
<item id="2">
<name>Widget B</name>
<specs>
<weight unit="kg">1.8</weight>
<dimensions>8x15x4</dimensions>
</specs>
</item>
</items>
"""
soup = BeautifulSoup(xml_content, "xml")
data = []
for item in soup.find_all("item"):
data.append({
"id": item.get("id"),
"name": item.find("name").text,
"weight": float(item.find("weight").text),
"weight_unit": item.find("weight").get("unit"),
"dimensions": item.find("dimensions").text
})
df = pd.DataFrame(data)
print(df)
Output:
id name weight weight_unit dimensions
0 1 Widget A 2.5 kg 10x20x5
1 2 Widget B 1.8 kg 8x15x4
Handle Missing Elements Safely
from bs4 import BeautifulSoup
import pandas as pd
xml_with_missing = """
<products>
<product>
<name>Item A</name>
<price>10.00</price>
</product>
<product>
<name>Item B</name>
<!-- price is missing -->
</product>
</products>
"""
soup = BeautifulSoup(xml_with_missing, "xml")
data = []
for product in soup.find_all("product"):
name_tag = product.find("name")
price_tag = product.find("price")
data.append({
"name": name_tag.text if name_tag else None,
"price": float(price_tag.text) if price_tag else None
})
df = pd.DataFrame(data)
print(df)
Output:
name price
0 Item A 10.0
1 Item B NaN
Use Standard Library for Zero Dependencies
The built-in xml.etree.ElementTree module requires no external packages.
import xml.etree.ElementTree as ET
import pandas as pd
xml_string = """
<records>
<record>
<id>1</id>
<value>100</value>
</record>
<record>
<id>2</id>
<value>200</value>
</record>
</records>
"""
root = ET.fromstring(xml_string)
data = []
for record in root.findall(".//record"):
data.append({
"id": record.find("id").text,
"value": int(record.find("value").text)
})
df = pd.DataFrame(data)
print(df)
Output:
id value
0 1 100
1 2 200
Handle Namespaces in XML
XML namespaces require special handling in XPath expressions.
import pandas as pd
namespaced_xml = """
<root xmlns:ns="http://example.com/ns">
<ns:item>
<ns:name>Product 1</ns:name>
</ns:item>
<ns:item>
<ns:name>Product 2</ns:name>
</ns:item>
</root>
"""
# Define namespace mapping
namespaces = {"ns": "http://example.com/ns"}
df = pd.read_xml(
namespaced_xml,
xpath=".//ns:item",
namespaces=namespaces
)
print(df)
Output:
name
0 Product 1
1 Product 2
Namespace handling can be tricky. If read_xml() fails, try BeautifulSoup with soup.find_all("item") which often ignores namespaces by default.
Quick Reference
| Method | Best For | Speed | Dependencies |
|---|---|---|---|
pd.read_xml() | Standard XML files | ⚡ Fast | pandas, lxml (optional) |
BeautifulSoup | Malformed/complex XML | 🐢 Moderate | beautifulsoup4, lxml |
xml.etree | Simple XML, no dependencies | 🚀 Fast | None (standard library) |
Conclusion
Use pd.read_xml() for most XML parsing tasks. It's fast, concise, and handles attributes automatically. Upgrade Pandas to the latest version and install lxml for optimal performance. For malformed XML or custom extraction logic, BeautifulSoup provides flexible parsing. Reserve xml.etree.ElementTree for environments where external dependencies aren't available.