Skip to main content

How to Convert XML to Pandas DataFrame in Python

XML remains a common format for data exchange, configuration files, and API responses. Since Pandas 1.3, the read_xml() function provides native XML parsing, converting hierarchical structures into flat DataFrames efficiently.

Parse XML Directly with pd.read_xml()

The simplest approach uses Pandas' built-in XML parser with XPath selection.

import pandas as pd

# Sample XML structure:
# <catalog>
# <book id="1">
# <title>Python Basics</title>
# <author>Alice</author>
# <price>29.99</price>
# </book>
# <book id="2">
# <title>Data Science</title>
# <author>Bob</author>
# <price>39.99</price>
# </book>
# </catalog>

df = pd.read_xml("books.xml", xpath=".//book")

print(df)

Output:

   id          title author  price
0 1 Python Basics Alice 29.99
1 2 Data Science Bob 39.99

Parse XML from String

import pandas as pd

xml_string = """
<users>
<user id="1">
<name>Alice</name>
<email>alice@example.com</email>
</user>
<user id="2">
<name>Bob</name>
<email>bob@example.com</email>
</user>
</users>
"""

df = pd.read_xml(xml_string, xpath=".//user")

print(df)

Output:

   id   name              email
0 1 Alice alice@example.com
1 2 Bob bob@example.com
tip

Install lxml for significantly faster parsing: pip install lxml. Pandas uses it automatically when available.

Filter Data with XPath Expressions

XPath allows selective loading, reducing memory usage for large files.

import pandas as pd

xml_data = """
<products>
<product category="electronics">
<name>Laptop</name>
<price>999</price>
</product>
<product category="books">
<name>Python Guide</name>
<price>49</price>
</product>
<product category="electronics">
<name>Phone</name>
<price>699</price>
</product>
</products>
"""

# Load only electronics products
df = pd.read_xml(xml_data, xpath=".//product[@category='electronics']")

print(df)

Output:

      category    name  price
0 electronics Laptop 999
1 electronics Phone 699

Common XPath Patterns

XPath ExpressionDescription
.//elementAll elements with this name
.//element[@attr='value']Filter by attribute value
.//parent/childDirect child elements
.//element[position()<=10]First 10 elements
info

Pandas uses XPath 1.0, which has limited filtering capabilities. For complex queries, load the data first and filter with Pandas operations.

Handle Nested XML Structures

Deeply nested XML requires specifying the correct path or flattening manually.

import pandas as pd

nested_xml = """
<company>
<department name="Engineering">
<employees>
<employee>
<name>Alice</name>
<role>Developer</role>
</employee>
<employee>
<name>Bob</name>
<role>Designer</role>
</employee>
</employees>
</department>
</company>
"""

# Target the deeply nested employee elements
df = pd.read_xml(nested_xml, xpath=".//employee")

print(df)

Output:

    name       role
0 Alice Developer
1 Bob Designer

Extract Attributes and Elements

import pandas as pd

xml_with_attrs = """
<orders>
<order id="1001" status="shipped">
<customer>Alice</customer>
<total>150.00</total>
</order>
<order id="1002" status="pending">
<customer>Bob</customer>
<total>89.50</total>
</order>
</orders>
"""

# Attributes become columns automatically
df = pd.read_xml(xml_with_attrs, xpath=".//order")

print(df)

Output:

     id   status customer  total
0 1001 shipped Alice 150.0
1 1002 pending Bob 89.5

Parse Complex XML with BeautifulSoup

For malformed XML, inconsistent structures, or custom extraction logic, use BeautifulSoup.

from bs4 import BeautifulSoup
import pandas as pd

xml_content = """
<items>
<item id="1">
<name>Widget A</name>
<specs>
<weight unit="kg">2.5</weight>
<dimensions>10x20x5</dimensions>
</specs>
</item>
<item id="2">
<name>Widget B</name>
<specs>
<weight unit="kg">1.8</weight>
<dimensions>8x15x4</dimensions>
</specs>
</item>
</items>
"""

soup = BeautifulSoup(xml_content, "xml")

data = []
for item in soup.find_all("item"):
data.append({
"id": item.get("id"),
"name": item.find("name").text,
"weight": float(item.find("weight").text),
"weight_unit": item.find("weight").get("unit"),
"dimensions": item.find("dimensions").text
})

df = pd.DataFrame(data)

print(df)

Output:

  id      name  weight weight_unit dimensions
0 1 Widget A 2.5 kg 10x20x5
1 2 Widget B 1.8 kg 8x15x4

Handle Missing Elements Safely

from bs4 import BeautifulSoup
import pandas as pd

xml_with_missing = """
<products>
<product>
<name>Item A</name>
<price>10.00</price>
</product>
<product>
<name>Item B</name>
<!-- price is missing -->
</product>
</products>
"""

soup = BeautifulSoup(xml_with_missing, "xml")

data = []
for product in soup.find_all("product"):
name_tag = product.find("name")
price_tag = product.find("price")

data.append({
"name": name_tag.text if name_tag else None,
"price": float(price_tag.text) if price_tag else None
})

df = pd.DataFrame(data)

print(df)

Output:

     name  price
0 Item A 10.0
1 Item B NaN

Use Standard Library for Zero Dependencies

The built-in xml.etree.ElementTree module requires no external packages.

import xml.etree.ElementTree as ET
import pandas as pd

xml_string = """
<records>
<record>
<id>1</id>
<value>100</value>
</record>
<record>
<id>2</id>
<value>200</value>
</record>
</records>
"""

root = ET.fromstring(xml_string)

data = []
for record in root.findall(".//record"):
data.append({
"id": record.find("id").text,
"value": int(record.find("value").text)
})

df = pd.DataFrame(data)

print(df)

Output:

  id  value
0 1 100
1 2 200

Handle Namespaces in XML

XML namespaces require special handling in XPath expressions.

import pandas as pd

namespaced_xml = """
<root xmlns:ns="http://example.com/ns">
<ns:item>
<ns:name>Product 1</ns:name>
</ns:item>
<ns:item>
<ns:name>Product 2</ns:name>
</ns:item>
</root>
"""

# Define namespace mapping
namespaces = {"ns": "http://example.com/ns"}

df = pd.read_xml(
namespaced_xml,
xpath=".//ns:item",
namespaces=namespaces
)

print(df)

Output:

        name
0 Product 1
1 Product 2
warning

Namespace handling can be tricky. If read_xml() fails, try BeautifulSoup with soup.find_all("item") which often ignores namespaces by default.

Quick Reference

MethodBest ForSpeedDependencies
pd.read_xml()Standard XML files⚡ Fastpandas, lxml (optional)
BeautifulSoupMalformed/complex XML🐢 Moderatebeautifulsoup4, lxml
xml.etreeSimple XML, no dependencies🚀 FastNone (standard library)

Conclusion

Use pd.read_xml() for most XML parsing tasks. It's fast, concise, and handles attributes automatically. Upgrade Pandas to the latest version and install lxml for optimal performance. For malformed XML or custom extraction logic, BeautifulSoup provides flexible parsing. Reserve xml.etree.ElementTree for environments where external dependencies aren't available.