Fixing minidom.toprettyxml’s Silly Whitespace

Python’s xml.dom.minidom.toprettyxml has a feature/flaw that renders it useless for many common applications.

Someone was kind enough to post a hack which works around the problem. That hack had a small bug, which I’ve fixed; you’ll find the revised code below.

UPDATED Forget the hack; I’ve found another, better solution. See below. And please leave a comment if you find these workarounds helpful, or if you come across a better solution.

METAUPDATE Since writing this post, I’ve started using lxml for all my xml processing Python. Incidentally, it fixes the pretty printing problem with toprettyxml, but the actual reason I switched was for performance: lxml parses xml significantly faster (an order of magnitude faster) than minidom. So my new recommendation is: consider using lxml instead of minidom.

The Problem

First, a short summary of the problem. (Other descriptions can be found here and here.) Feel free to jump ahead to all the workarounds, or straight to my solution of choice.

toprettyxml adds extra white space when printing the contents of text nodes. This may not sound like a serious drawback, but it is. Consider a simple xml snippet:

<Author>Ron Rothman</Author>

This Python script:

# python 2.4
import xml.dom.minidom as dom
myText = '''<Author>Ron Rothman</Author>'''
print xml.dom.minidom.parseString(myText).toprettyxml()

generates this output:

<?xml version="1.0" ?>
<Author>
        Ron Rothman
</Author>

Note the extra line breaks: the text “Ron Rothman” is printed on its own line, and indented. That may not matter much to a human reading the output, but it sure as hell matters to an XML parser. (Recall: whitespace is significant in XML)

To put it another way: the DOM object that represents the output (with line breaks) is NOT identical to the DOM object that represented the input.

Semantically, the author in the original XML is "Ron Rothman", but the author in the “pretty” XML is [approximately] " Ron Rothman ".

This is devastating news to anyone who hopes to re-parse the “pretty” XML in some other context. It means that you can’t use minidom.toprettyxml() to produce XML that will be parsed downstream.

Workarounds

UPDATED If you’re in a rush, skip ahead to the best solution, #4.

Sidebar: Things that don’t solve the problem:

normalize()
calling toprettyxml with “creative” (non-default) parameters

1. Don’t use minidom

There are plenty of other XML packages to choose from.
But: minidom is appealing because it’s lightweight, and is included with the Python distribution. Seems a shame to toss it for just one flaw.
UPDATEI’ve started using lxml and I highly recommend it as a replacement for minidom or PyXML.

2. Use minidom, but don’t use toprettyxml()

Use minidom.toxml(), which doesn’t suffer from the same problem (because it doesn’t insert any whitespace).
But: Your machine-readable XML will make heads spin, should someone be foolish enough to try to read it.

3. Hack toprettyxml to do The Right Thing

Replace toprettyxml by using the code below.
But: It smells. Like a hack. Fragile; likely to break with future releases of minidom.
On the other hand: It’s not that bad. And hey, it does the trick. (But YMMV.)

def fixed_writexml(self, writer, indent="", addindent="", newl=""):
    # indent = current indentation
    # addindent = indentation to add to higher levels
    # newl = newline string
    writer.write(indent+"<" + self.tagName)

    attrs = self._get_attributes()
    a_names = attrs.keys()
    a_names.sort()

    for a_name in a_names:
        writer.write(" %s=\"" % a_name)
        xml.dom.minidom._write_data(writer, attrs[a_name].value)
        writer.write("\"")
    if self.childNodes:
        if len(self.childNodes) == 1 \
          and self.childNodes[0].nodeType == xml.dom.minidom.Node.TEXT_NODE:
            writer.write(">")
            self.childNodes[0].writexml(writer, "", "", "")
            writer.write("</%s>%s" % (self.tagName, newl))
            return
        writer.write(">%s"%(newl))
        for node in self.childNodes:
            node.writexml(writer,indent+addindent,addindent,newl)
        writer.write("%s</%s>%s" % (indent,self.tagName,newl))
    else:
        writer.write("/>%s"%(newl))
# replace minidom's function with ours
xml.dom.minidom.Element.writexml = fixed_writexml

I just copied the original toprettyxml code from /usr/lib/python2.4/xml/dom/minidom.py and made the modifications that are highlighted in yellow. It ain’t pretty, but it seems to work. (Suggestions for improvements (I’m a Python n00b) are welcome.)

[Credit to Oluseyi at gamedev.net for the original hack; I just fixed it so that it worked with character entities.]

UPDATE! 4. Use xml.dom.ext.PrettyPrint

Who knew? All along, an alternative to toprettyxml was available to me. Works like a charm. Robust. 100% Kosher Python. Definitely the method I’ll be using.
But: Need to have PyXML installed. In my case, it was already installed, so this is my method of choice. (It’s worth pointing out that if you already have PyXML installed, you might want to consider using it exclusively, in lieu of minidom.)

We just write a simple wrapper, and we’re done:

from xml.dom.ext import PrettyPrint
from StringIO import StringIO

def toprettyxml_fixed (node, encoding='utf-8'):
    tmpStream = StringIO()
    PrettyPrint(node, stream=tmpStream, encoding=encoding)
    return tmpStream.getvalue()

Conclusion

One lesson from all this: TMTOWTDI applies to more than just Perl. :)

Please–let me know what you think.

This entry was posted by Ron on Sunday, June 15^th, 2008 and is filed under Python ×Technology. Use this RSS 2.0 feed to follow responses to this entry. You can leave a response below, or trackback from your own site.

27 Responses to “Fixing minidom.toprettyxml’s Silly Whitespace” [Leave yours »]

June 15^th, 2008 at 4:07 am

Ron [author of post] tracked back:

Yikes, I just came across a better solution on comp.lang.python: use xml.dom.ext.PrettyPrint.

I’ve updated the main post to reflect this option. (See #4.)

1
July 8^th, 2008 at 12:03 pm

gmolleda tracked back:

I have found errors in fixed_writexml function:
1.- Errors in number %s into strings
2.- Errors in ending labels, the functions write label and not

The correct is:

def fixed_writexml(self, writer, indent=””, addindent=””, newl=””):
# indent = current indentation
# addindent = indentation to add to higher levels
# newl = newline string
writer.write(indent+””)
self.childNodes[0].writexml(writer, “”, “”, “” )
writer.write(“%s” % (self.tagName, newl))
return
writer.write(“>%s” % (newl))
for node in self.childNodes:
node.writexml(writer,indent+addindent,addindent,newl)
writer.write(“%s%s” % (indent,self.tagName,newl))
else:
writer.write(“/>%s”%(newl))

# replace minidomâ€™s function with ours
xml.dom.minidom.Element.writexml = fixed_writexml

2
July 8^th, 2008 at 7:05 pm

Ron [author of post] tracked back:

gmodella, thanks for your feedback. You’re right, the code as published wasn’t working, because WordPress converted my “” to html tag delimiters, instead of converting them to “<” and “>”. (Notice that it did the same thing in your comment; your ‘write.write(“%s%s”…’ should have a , but WordPress stripped them in your comment as well.)

I’ve fixed it manually, and now WordPress is rendering it correctly. Thank you for alerting me!

3
July 9^th, 2008 at 4:08 am

gmolleda tracked back:

Hi, if you add a conditional in loop for, the extra newline character is not added in next calls to function:
writer.write(">%s" % (newl)) for node in self.childNodes: + if node.nodeType is not xml.dom.minidom.Node.TEXT_NODE: # 3: node.writexml(writer,indent+addindent,addindent,newl)

4
July 9^th, 2008 at 6:39 am

gmolleda tracked back:

Hi, I have fixed other problem [in minidom]:
When you use the writexml function for write a beautiful and readable file, and read the file, and other time use writexml function, then you could view a lot of ugly newline characters.

Adding a new conditional in the loop “for”, we could remove the blank lines with only indent and newline characters

for node in self.childNodes: + if node.nodeType is not xml.dom.minidom.Node.TEXT_NODE: # 3: + node.writexml(writer,indent+addindent,addindent,newl) - #node.writexml(writer,indent+addindent,addindent,newl)

5
November 3^rd, 2008 at 11:59 am

Thomas Pluempe tracked back:

Thanks for the fixed_writexml fix, made my day. I’m using this successfully with Python 2.6.

However, there’s a tiny bug there: In line 20, that’s the “writer.write(“<%s”%(newl))” line, the less than sign should be a greater than sign.

Only with this fix does it generate valid XML.

Response from Ron: Thomas, thanks for pointing out the typo! I’ve fixed it in the post. Much appreciated,
Ron

6
May 13^th, 2009 at 6:11 am

sorin tracked back:

It seams that on Python 2.6.1 (Windows x64) xml.dom.ext does not exist !

7
May 13^th, 2009 at 10:01 am

Thomas Pluempe tracked back:

As Ron states further up “Need to have PyXML installed.”

Also of note: Apparently pyxml doesn’t have a maintainer anymore (see https://mail.python.org/pipermail/xml-sig/2007-January/011642.html), although that may have changed since.

8
June 1^st, 2009 at 1:18 pm

Ron [author of post] tracked back:

Thanks Thomas, you’re right. Indeed, I no longer use PyXML; I’ve been using lxml exclusively.

9
July 10^th, 2009 at 6:50 am

IÃ±aki Silanes tracked back:

Interesting reading. Just one stupid question. If you do:

import xml.dom.minidom as dom

then why do you use:

xml.dom.minidom.parseString

instead of:

dom.parseString

10
September 29^th, 2009 at 11:59 am

Tyler Mitchell tracked back:

Thanks for posting this. I was playing with “indent”, “addindent”, “newl”, changing the file mode during the open(…) etc â€“ thinking the whole time I was just doing something wrong.

It’s unfortunate that a patch has been available for so long (and there are several patches floating around, some in the bugs database) but nothing fixed yet.

I find it irritating that I should have to use some external library, but maybe in future projects that’s the way to go. PyXML appears to be unmaintained(?) so maybe lxml is best. Honestly, having just completed another project with MSXML â€“ which works wonderfullyÂ â€“ I was very tempted to just use pythoncom and MSXML.

11
November 1^st, 2009 at 7:41 pm

For Python, lxml superior to dom/minidom | JeLlib Journal tracked back:

[…] This is the source that pointed me to lxml. […]

12
June 1^st, 2010 at 12:12 am

BrendanM tracked back:

I recently was working on a project where I ran into this problem and I was stuck with an older version of python so lxml and PyXML were not available. I believe I have come up with another solution. With due respect to Jamie Zawinski, you can fix it regular expressions.

fix = re.compile(r'((?)(n[t]*)(?=[^<t]))|(?t])(n[t]*)(?=<)') fixed_output = re.sub(fix, '', input_string)

Yes, it’s ugly as sin, but the regular expression above will match a newline character followed by tab characters if they come between “>” and other text, or between text and “<". In other words, it lets you strip out the unwanted whitespace that toprettxml() adds.

In case this comment form mangles the code with emoticons, I posted the regex here: https://pastebin.com/QH2YE8ki

13
October 13^th, 2010 at 8:06 am

plok tracked back:

BrendanM’s solution is particularly amazing. Thanks.

14
October 29^th, 2010 at 1:16 pm

Aaron Bell tracked back:

Install new libs to do something this simple? No way.

1. The writer hack only solves part of the problem. Text nodes are better, but we still get double newlines after every element, and the indents are 8 spaces wide. Am I doing it wrong?
2. BrendanM’s regex fix works perfectly for me – great stuff.

Aaron

15
December 1^st, 2010 at 1:42 pm

Khiken Saken tracked back:

Pardon the newbie question, but how does one apply BrendanM’s regex?? I have created a program to pull the data from Excel into XML, and I save into a file. I have been trying to figure out how to use the regex fix, but I’m lost…

16
January 21^st, 2011 at 5:33 pm

firebush tracked back:

BrendanM’s response worked for me, but use the link he included. My conf file used spaces instead of tabs, so I had to replace the \t characters with a space character in the regex.

17
March 9^th, 2011 at 2:15 pm

reading/writing xml with python and maxscript « eduardo simioni tracked back:

[…] textNodes. And they are obviously read afterwards. A couple of different solutions are discussed on Ron Rothman’s blog, the easiest one using xml.dom.ext.PrettyPrint. :maxscript, python, reference, tips, […]

18
May 5^th, 2011 at 1:17 am

Mark tracked back:

I threw this together https://pastebin.com/2zyVvBJ4, not sure if it’s particularly stable, but it seems to pass most of my quick tests.

19
May 25^th, 2011 at 1:44 pm

ex4 tracked back:

Thank you very much! I really needed your hack.

20
July 1^st, 2011 at 7:38 am

badas tracked back:

Works great for me, thanks

21
July 5^th, 2011 at 3:21 pm

krazywar tracked back:

I am a noob as well to python programming, and i was wondering where i should put BrendanM’s piece of code?

22
June 28^th, 2012 at 6:34 pm

Don tracked back:

Hey Ron, Thanks for creating this dialog. I used your fixed_writexml code a little over a year ago, and it’s worked great for us ever since. – Don

23
January 3^rd, 2013 at 10:19 am

Jogi tracked back:

I think it is worth to present a simple post production ;) solution for stripping out whitespace of toprettyxml text node:

xml_string = xml_document.toprettyxml() start = xml_string.find('>') while start >= 0: . end = xml_string.find('<', start + 1) . if end - start > 2: . . space_between_elements = xml_string[start + 1:end] . . possible_text_node = space_between_elements.strip() . . if len(possible_text_node) > 0 and len(possible_text_node) != len(space_between_elements): . . # possible_text_node is a text node with whitespace!!! . . xml_string = xml_string.replace(space_between_elements, possible_text_node) . start = xml_string.find('>', start + 1)

24
January 9^th, 2013 at 3:50 pm

Bill H tracked back:

I found a simple fix. It does require that all the XML text is in-memory, although that’s usually not a big issue.

For my problem, I had to print the text to the display using python 2.7. The fix might have to be different for unicode. Also, note that the check for chr(9) probably really should be to check whether that string from the list only contains chr(9) character – but I have found no cases where it doesn’t work the way it is now:

pretty_text = dom.toprettyxml()
pretty_text_list = pretty_text.split(chr(10))
pretty_text = ”
for text in pretty_text_list:
if chr(9) in text:
pretty_text = ‘{0}{1}’.format(pretty_text, text)
else:
pretty_text = ‘{0}{1}\n’.format(pretty_text, text)
print pretty_text

25
January 17^th, 2014 at 3:23 am

Reblog: Python Minidom and Whitespace | Acodemics tracked back:

[…] https://www.ronrothman.com/public/leftbraned/xml-dom-minidom-toprettyxml-and-silly-whitespace/ […]

26
March 25^th, 2022 at 1:11 pm

Problem with newlines when I use toprettyxml() tracked back:

[…] Some workarounds available at http://ronrothman.com/public/leftbraned/xml-dom-minidom-toprettyxml-and-silly-whitespace […]

27