Parsing XML in Python – The Ultimate Guide 2024
Standards are a means to clear and define communication between people and things in the world. For example, the human language, USB sockets on computers, or the fact that you must add cereal before pouring milk. When it comes to computer applications and systems, one standard stands out above the rest as the most popular choice for developers – XML (eXtensible Markup Language). In this article, we’ll explore how you can parse data from XML files using Python’s built-in libraries, see the best methods to do so, and understand the importance of effectively reading information.
What is XML?
XML (eXtensible Markup Language) is a markup language used in many applications designed to store and transport data. It’s a standard for creating structured documents and exchanging data between different systems and applications on the web. An XML file uses tags to define elements within a document, similar to HTML (Hypertext Markup Language). However, while HTML is designed to display data, XML is designed to store and use data interchangeably in various applications.
XML example
Let’s take a look at an example XML file:
<?xml version="1.0" encoding="UTF-8"?><animal><item><name>cat</name><description> Cats are small, carnivorous mammals that are often kept as pets. Known for their agility, flexibility, and independent nature, cats make delightful companions.</description><breeds><breed>Persian</breed><breed>Siamese</breed><breed>Maine Coon</breed><breed>Bengal</breed><breed>Ragdoll</breed></breeds></item></animal>
Here, you can see many elements are organized and structured hierarchically in the content within a document. In this example, these are the following:
- XML declaration – a brief overview of the XML document, version, and encoding;
- <animal> is the root element;
- <item> represents an instance of an animal, specifically a cat;
- <name> contains the name of the animal, which is “cat”;
- <description> provides a short description of the “cat”;
- <breeds> is a container for listing different cat breeds using the <breed> element.
These lists are easy for humans and computers to read and understand. That’s why they’re so popular in storing and exchanging data between different systems – it doesn’t overcomplicate things and allows complete freedom to create organized item lists in any way you want.
How do you parse XML with Python?
Python offers several built-in ways of parsing XML files, so you don’t need to rush to install any external libraries. The methods are the xml.dom.minidom and the xml.etree.ElementTree modules and XPath expressions. Let’s take a closer look at each of them.
xml.dom.minidom is a module in Python's standard library that provides a minimal implementation of the Document Object Model (DOM) for working with XML documents. The DOM represents the structure of XML documents as a tree of objects, making it simple to manage, traverse, and modify XML file content through code. It provides a full representation of the document, allowing you to manipulate and look through the XML data quickly. While it’s a very convenient way of doing so, this approach is memory-intensive and may not be efficient for large XML documents.
xml.etree.ElementTree provides a slightly different method to parse and manipulate XML documents. The module is based on the ElementTree API, which is more lightweight, efficient, and designed to be easy for everyday XML file processing tasks. It represents XML documents as a tree of Element objects. Each Element object represents an XML element in the document, and it can have child elements, attributes, and text content. While it lacks some of the features of the full DOM, it’s often faster and uses less memory, making it suitable for large documents.
XPath is another method you can use to navigate through an XML file. While it’s not a Python library, it’s a universal method to parse XML files that can be used in many other programming languages, such as JavaScript, Java, C, or C++. You can even use it to scrape the web with Google Sheets! It follows the logic of building an expression by writing a path from the root element to the desired destination node that details how an XML element should be reached. It’s not the easiest nut to crack, but it’s useful.
These are just a few ways of parsing XML files in Python. While there are many other methods to do it, the Python coding language stands out as a convenient and easy tool due to its lightweight, easy-to-use modules and simple implementation. In the following sections, we’ll explore the xml.etree.ElementTree module and XPath expression language in greater detail and see how they can be used with actual data to extract valuable data from XML files.
Parsing an XML document with ElementTree
Let’s begin with Python’s xml.etree.ElementTree module. Since data in an XML file is structured hierarchically, it makes the most sense to represent it as a tree of elements that branch out from one another. The module has two classes for this purpose – ElementTree represents the whole XML document as a tree, while Element is a single node in a tree. Think of it as a real tree – it starts at the roots and branches out in several ways, but there's an apple at the end of each branch. You need to navigate through the branches like a nimble squirrel to get a taste of that juicy fruit of XML data.
Using the ElementTree
For this example, we’ll be using the following XML file:
<?xml version="1.0" encoding="UTF-8"?><pets><cat><name>Jinx</name><age>2</age><color>Gray</color><breed>Maine Coon</breed></cat><cat><name>Kafka</name><age>3</age><color>White</color><breed>Siamese</breed></cat><cat><name>Mori</name><age>1</age><color>Orange</color><breed>Tabby</breed></cat></pets>
First, create a Python script file and import the xml.etree.ElementTree library. It’s commonly shortened with an ET alias.
import xml.etree.ElementTree as ET
Next, import the data from the XML file. You can directly paste the XML file string in the code and parse from it. Notice the triple quotation marks (""") that are used on both sides of the XML document string when assigning it to a variable:
import xml.etree.ElementTree as ETxml = """<?xml version="1.0" encoding="UTF-8"?><pets><cat><name>Jinx</name><age>2</age><color>Gray</color><breed>Maine Coon</breed></cat><cat><name>Kafka</name><age>3</age><color>White</color><breed>Siamese</breed></cat><cat><name>Mori</name><age>1</age><color>Orange</color><breed>Tabby</breed></cat></pets>"""root_element = ET.fromstring(xml)print(root_element)
Alternatively, a better practice is to save the XML in a separate file. Let’s name it data.xml. You can then parse information from it like this:
import xml.etree.ElementTree as ETtree = ET.parse("data.xml")root_element = tree.getroot()print(root_element)
Both code examples will return the same result – the root element of the XML tree. You can navigate through the entire tree to get the XML data you want from here.
Navigating the XML tree
Following our code example, we extracted the root element of the XML file. Every node will be a descendant of the root element pets. Keeping this in mind, you can easily navigate through the tree by specifying the nodes you want to get by iterating through child elements or by index.
Loop method
You’ll need to write a loop to navigate through the tree by searching for child elements:
for child in root_element:print(child.tag)
The script will only print the direct child elements from the <pets> tag, resulting in just several <cat> tags being printed. To iterate further, you’ll need to iterate through each <cat> and print the text information:
for child in root_element:for desc in child:print(desc.text)
Index method
While the loop method helps print as much information as possible, writing so many nested loops can be a hassle. To get to a specific node more quickly, you can use indexes to point to where the element is. For example, let’s get the color of the first cat. If you take a brief look at our XML, you can assign a number to each node based on where it’s located relevant to its parent element:
...<pets> #0<cat> #0<name>Jinx</name> #0<age>2</age> #1<color>Gray</color> #2 (the element you want to get)<breed>Maine Coon</breed> #3</cat>...
Python is a zero-based language, so you’ll need to remember to start counting from 0 instead of 1. Since you want to get the first cat, you’ll write the first index as 0. Next, you’ll need to see which node is the <color> tag – it’s the 3rd node under <cat>; therefore, assign the index of 2.
cat_info = root_element[0][2].textprint(cat_info)
Other methods
The Element class has several other valuable methods to get the desired information. One is Element.iter(), which iterates through every node below it. You can even specify which element you want to print. For example, let’s get information about how old each cat is by specifying the “age”:
for age in root_element.iter("age"):print(age.text)
Element.find() will find the first child of a node with the particular tag. Element.findall() will find all nodes that are direct child elements of a current node:
for animal in root_element.findall("cat"): # Loop and find all direct child elements of <pets>, which are <cat>name = animal.find("name").text # Find the first <name> element under a particular <cat>print(name) # Print the name
Using any of these methods, you can build a precise script to navigate to a piece of information in an XML file. In a real-world application, the XML files might be far more complex than the example here, and you’ll need to write longer loops, dynamic indexes, and take advantage of the various Element class methods.
Extracting data from XML
To extract data from an XML file, use one of the examples above to navigate the XML file and get only the needed XML data. The next step depends entirely on what you’re trying to do – it may be enough to print the information to the terminal, or you could save it to a CSV file for later use and analysis.
To print it to the Terminal or an equivalent command line tool, simply use the print() function as before.
Saving the information to a CSV file will require a little more effort. You’ll need to import the csv library to write the XML data to a file. Here, the script will only get each cat's name and age and save that information to a CSV file. Follow the comments in the code if you’re unsure what each line is for:
import xml.etree.ElementTree as ETimport csv # Import the CSV libraryimport xml.etree.ElementTree as ETtree = ET.parse("data.xml")root_element = tree.getroot()with open("cats_data.csv", 'w', newline='') as csvfile: # Open the CSV file to write incsv_writer = csv.writer(csvfile) # Create a new CSV writer objectcsv_writer.writerow(['Name', 'Age']) # Write the header informationfor cat in root_element.findall('cat'): # Iterate through all cat elements in the XMLname = cat.find('name').text # Extract the nameage = cat.find('age').text # Extract the agecsv_writer.writerow([name, age]) # Write the information into the CSV fileprint(f"Data saved to CSV file.")
After you run the script, you should see the cats_data.csv file appear in the same directory as your script file. If you check it, you’ll see it matches the data from the XML file:
Name,AgeJinx,2Kafka,3Mori,1
That’s it! This data can be used for further analysis with other tools and applications. No matter how much extra data will be added to the XML file, this script will always print the name and age of each cat and export it to CSV.
Parsing an XML document with XPath
XPath (XML Path Language) is a query language for selecting nodes from an XML document. It provides a way to navigate and query the hierarchical structure of XML data by specifying the location of elements and attributes. XPath uses a path notation, similar to file paths in a file system, to traverse the XML document tree and get to specific nodes.
XPath is a vibrant expression language featuring 200+ functions for various purposes, from simple to complex. For a more in-depth look into all the possible methods and ways XPath can be used, check out the W3Schools tutorial or refer to the devhints.io cheatsheet.
Using XPath
To use XPath, we’ll continue using the Python xml.etree.ElementTree library. This is because Python can’t read or understand XPath queries on its own. The Element class has a method we’ve briefly explored – Element.findall(). To write XPath queries, you’ll simply need to use this method with the paths inside it.
Searching XML with XPath
We’ll be working with the same XML data as before, so copy and save the XML file from above if you haven’t done so already. Let’s start with a familiar example – find the root element of the XML file tree. Begin by importing the xml.etree.ElementTree library and get the XML data from the data.xml file.
import xml.etree.ElementTree as ETtree = ET.parse("data.xml")root_element = tree.getroot()
The next step is to use the Element.findall() method and pass the “.” string to it. This is XPath syntax for selecting the current node. Since you haven’t navigated anywhere yet, the current node will always be the root of the tree. Here’s what it looks like in the code:
root = root_element.findall(".")print(root)
Those are the basics of using XPath queries with Python to get data from an XML file. You may also use libraries such as lxml or libxml2 that can do the same job for you. Read their dedicated documentation to learn more about using them together with Python to parse XML files.
Using XPath to parse XML
To get more specific data, the XPath becomes slightly more complex. Let’s say you want to extract the breeds of each cat. The XPath query will look something like this:
# Use .findall() with XPath to get all breedsbreeds = root_element.findall('.//cat/breed')# Print the breedsfor breed in breeds:print(breed.text)
This script first selects the current root node to start with. Then, with a double slash (//) it selects all subelements beneath the root. In this case, //cat selects all <cat> nodes. Finally, the single slash (/) will select one element from the parent node, so /breed gets the first <breed> element under <cat>. All of this information is stored in the breeds array, so for the final step, you’ll need to write a loop that will select each Element item from the array and print its text value. This is the result:
Maine CoonSiameseTabby
You can even write more complex XPath queries that only pick out items with specific parameters. Say you wanted to get only the names of cats that are three years old – this is how the script would look like:
# Use .findall() with XPath to get names of cats that are 3 years oldcat_names_3_years_old = root_element.findall('.//cat[age="3"]/name')# Print the namesfor cat_name in cat_names_3_years_old:print(cat_name.text)
This XPath is similar to the one previously used, but it uses brackets ([]) with a parameter to only select <cat> nodes that have the <age> value of 3. From here, it’ll select the names from the filtered results and print their names.
While XPath might initially seem daunting, it offers unlimited possibilities and flexibility when selecting nodes from a tree. Experiment with other queries, modify the example data or run the scripts through different XML files and see what they return.
Best practices
Now that you’ve learned the basics of parsing XML documents with Python, you can write your scripts and integrate them into your environment for effective XML data management. While trial and error are the best ways to succeed in writing scripts, here are some valuable tips to keep in mind that will help you out in the long run:
- Handle errors. When parsing an XML document, be prepared for errors such as invalid or missing elements. Implement error-handling mechanisms to handle these situations, preventing unexpected crashes and making your code more robust.
- Compatibility. Be mindful of XML encoding and ensure that your Python environment supports the encoding in the XML document. Additionally, consider using libraries that support the latest XML standards to ensure compatibility with various XML formats and specifications.
- Iterative parsing. For large XML documents, consider using iterative parsing to process the XML in a memory-efficient manner. This involves iterating through the XML elements one at a time instead of loading the entire document into memory, reducing the risk of memory issues.
- Close files. If you're parsing XML from a file, it's good to close it afterward. This ensures system resources are released promptly, preventing potential issues with file locks or resource leaks.
Wrapping up
In this comprehensive guide, you’ve learned how to use Python and its various libraries to parse XML documents. ElementTree and XPath are invaluable tools that do the job flawlessly, so choosing which to use is entirely up to you. Employ this knowledge by building upon the example scripts or writing your own from scratch to parse XML data from any XML file.
About the author
Zilvinas Tamulis
Technical Copywriter
Zilvinas is an experienced technical copywriter specializing in web development and network technologies. With extensive proxy and web scraping knowledge, he’s eager to share valuable insights and practical tips for confidently navigating the digital world.
All information on Smartproxy Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Smartproxy Blog or any third-party websites that may be linked therein.