XML Data Processing

What is XML?

XML stands for eXtensible Markup Language. It's a text format for storing structured data using tags.

What XML looks like:

code.txtXML

<person>
  <name>John</name>
  <age>25</age>
  <city>New York</city>
</person>

Where XML is used:

Configuration files
Data exchange between systems
Web services (SOAP APIs)
Office documents (docx, xlsx)
RSS feeds

XML vs JSON:

XML: More verbose, tags, used in older systems
JSON: Simpler, lighter, modern APIs

The xml.etree.ElementTree Module

Python has built-in XML support.

code.pyPython

import xml.etree.ElementTree as ET

What this does: Imports XML parser with shorter name (ET).

Reading XML String

code.pyPython

import xml.etree.ElementTree as ET

xml_string = """
<person>
    <name>John</name>
    <age>25</age>
    <city>New York</city>
</person>
"""

root = ET.fromstring(xml_string)

print("Tag:", root.tag)
print("Name:", root.find("name").text)
print("Age:", root.find("age").text)

What this does:

fromstring() parses XML text
root is the top element
find() locates child elements
.text gets content inside tags

Reading XML File

code.pyPython

import xml.etree.ElementTree as ET

tree = ET.parse("data.xml")
root = tree.getroot()

print("Root tag:", root.tag)

for child in root:
    print("Child:", child.tag, "Value:", child.text)

What this does:

parse() reads XML file
getroot() gets top element
Loops through child elements

Finding Elements

Find First Match

code.pyPython

import xml.etree.ElementTree as ET

tree = ET.parse("students.xml")
root = tree.getroot()

first_student = root.find("student")
name = first_student.find("name").text
print("First student:", name)

What find() does: Returns first element that matches tag name.

Find All Matches

code.pyPython

import xml.etree.ElementTree as ET

tree = ET.parse("students.xml")
root = tree.getroot()

students = root.findall("student")

for student in students:
    name = student.find("name").text
    grade = student.find("grade").text
    print("Student:", name, "Grade:", grade)

What findall() does: Returns list of all matching elements.

Reading XML Attributes

XML tags can have attributes.

XML with attributes:

code.txtXML

<student id="1" status="active">
    <name>John</name>
</student>

Reading attributes:

code.pyPython

import xml.etree.ElementTree as ET

tree = ET.parse("students.xml")
root = tree.getroot()

for student in root.findall("student"):
    student_id = student.get("id")
    status = student.get("status")
    name = student.find("name").text

    print("ID:", student_id)
    print("Status:", status)
    print("Name:", name)
    print()

What .get() does: Gets attribute value from element.

Nested XML

XML can have multiple levels.

Example XML:

code.txtXML

<school>
    <classroom>
        <student>
            <name>John</name>
            <subjects>
                <subject>Math</subject>
                <subject>Science</subject>
            </subjects>
        </student>
    </classroom>
</school>

Reading nested data:

code.pyPython

import xml.etree.ElementTree as ET

tree = ET.parse("school.xml")
root = tree.getroot()

classroom = root.find("classroom")
student = classroom.find("student")
name = student.find("name").text

print("Student:", name)
print("Subjects:")

subjects = student.find("subjects")
for subject in subjects.findall("subject"):
    print("-", subject.text)

What this does: Navigates through multiple levels to get data.

Using XPath

XPath is a powerful way to find elements.

code.pyPython

import xml.etree.ElementTree as ET

tree = ET.parse("students.xml")
root = tree.getroot()

names = root.findall(".//name")

for name in names:
    print(name.text)

What .// means: Find all elements with this tag anywhere in the tree.

More XPath examples:

code.pyPython

root.findall("./student")

root.findall("./student/name")

root.findall(".//student[@status='active']")

XPath patterns:

. current element
.. parent element
.// all descendants
[@attr='value'] filter by attribute

Creating XML

Build XML from Python.

code.pyPython

import xml.etree.ElementTree as ET

root = ET.Element("students")

student1 = ET.SubElement(root, "student")
student1.set("id", "1")

name1 = ET.SubElement(student1, "name")
name1.text = "John"

age1 = ET.SubElement(student1, "age")
age1.text = "20"

tree = ET.ElementTree(root)
tree.write("output.xml", encoding="utf-8", xml_declaration=True)

print("XML file created")

What this creates:

code.txtXML

<?xml version='1.0' encoding='utf-8'?>
<students>
    <student id="1">
        <name>John</name>
        <age>20</age>
    </student>
</students>

Practice Example

The scenario: Process product catalog XML file.

Example XML (products.xml):

code.txtXML

<catalog>
    <product id="1" category="Electronics">
        <name>Laptop</name>
        <price>999.99</price>
        <stock>5</stock>
    </product>
    <product id="2" category="Electronics">
        <name>Phone</name>
        <price>599.99</price>
        <stock>10</stock>
    </product>
    <product id="3" category="Accessories">
        <name>Mouse</name>
        <price>25.99</price>
        <stock>50</stock>
    </product>
</catalog>

Python program:

code.pyPython

import xml.etree.ElementTree as ET

tree = ET.parse("products.xml")
root = tree.getroot()

print("Product Catalog")
print("=" * 40)

total_value = 0
product_count = 0

for product in root.findall("product"):
    product_id = product.get("id")
    category = product.get("category")
    name = product.find("name").text
    price = float(product.find("price").text)
    stock = int(product.find("stock").text)

    value = price * stock
    total_value = total_value + value
    product_count = product_count + 1

    print("Product ID:", product_id)
    print("Name:", name)
    print("Category:", category)
    print("Price:", price)
    print("Stock:", stock)
    print("Value:", value)
    print()

print("=" * 40)
print("Total products:", product_count)
print("Total inventory value:", total_value)

electronics = root.findall(".//product[@category='Electronics']")
print("Electronics count:", len(electronics))

What this program does:

Parses XML file
Loops through all products
Extracts attributes and child elements
Calculates inventory value
Uses XPath to filter by category

Converting XML to Dictionary

code.pyPython

import xml.etree.ElementTree as ET

tree = ET.parse("student.xml")
root = tree.getroot()

student_dict = {}
for child in root:
    student_dict[child.tag] = child.text

print(student_dict)

What this creates: {'name': 'John', 'age': '25', 'city': 'New York'}

Key Points to Remember

XML uses tags to structure data. Tags come in pairs: opening and closing.

ET.parse() reads XML files, ET.fromstring() reads XML strings.

find() gets first match, findall() gets all matches. Use .text to get content, .get() for attributes.

XPath (.// pattern) helps find elements anywhere in tree.

XML is more verbose than JSON but still widely used in enterprise systems.

Common Mistakes

Mistake 1: Forgetting .text

code.pyPython

name = root.find("name")  # This is element object
name = root.find("name").text  # This is actual text

Mistake 2: Wrong method

code.pyPython

students = root.find("student")  # Only gets first one
students = root.findall("student")  # Gets all

Mistake 3: Not checking if element exists

code.pyPython

name = root.find("name").text  # Error if name doesn't exist!

Better:

code.pyPython

name_element = root.find("name")
if name_element is not None:
    name = name_element.text

What's Next?

You now know XML basics. Next, you'll learn about Introduction to APIs - how to connect your Python programs to web services and get data from the internet.