{"id":252,"date":"2008-06-15T03:21:52","date_gmt":"2008-06-15T08:21:52","guid":{"rendered":"http:\/\/ronrothman.com\/public\/leftbraned\/?p=252"},"modified":"2019-02-15T20:51:22","modified_gmt":"2019-02-16T01:51:22","slug":"xml-dom-minidom-toprettyxml-and-silly-whitespace","status":"publish","type":"post","link":"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/","title":{"rendered":"Fixing minidom.toprettyxml&#8217;s Silly Whitespace"},"content":{"rendered":"<p>Python&#8217;s xml.dom.minidom.toprettyxml has a feature\/flaw that renders it useless for many common applications.<br \/>\n<!--more--><\/p>\n<p><span style=\"text-decoration: line-through;\">Someone was kind enough to post a hack which works around the problem.  That hack had a small bug, which I&#8217;ve fixed; you&#8217;ll find the revised code below.<\/span><\/p>\n<p><span class=\"newButton\">UPDATED<\/span> Forget the hack; I&#8217;ve found another, better solution.  See <a href=\"#best-solution\">below<\/a>.  And please <a href=\"#commentform\">leave a comment<\/a> if you find these workarounds helpful, or if you come across a better solution.<\/p>\n<p><span class=\"newButton\">METAUPDATE<\/span> Since writing this post, I&#8217;ve started using <a href=\"https:\/\/codespeak.net\/lxml\/\">lxml<\/a> for all my xml processing Python.  Incidentally, it fixes the pretty printing problem with toprettyxml, but the actual reason I switched was for performance: lxml parses xml significantly faster (an order of magnitude faster) than minidom.  So my new recommendation is: consider using lxml instead of minidom.<\/p>\n<h3>The Problem<\/h3>\n<p>First, a short summary of the problem.  (Other descriptions can be found <a href=\"https:\/\/groups.google.com\/group\/comp.lang.python\/browse_thread\/thread\/fe11b9ba4d3b0120\">here<\/a> and <a href=\"https:\/\/www.velocityreviews.com\/forums\/t541462-toprettyxml-messes-up-with-whitespaces.html\">here<\/a>.)  Feel free to jump ahead to all the <a href=\"#workarounds\">workarounds<\/a>, or straight to my <a href=\"#best-solution\">solution of choice<\/a>.<\/p>\n<p>toprettyxml adds <strong>extra white space<\/strong> when printing the contents of text nodes.  This may not sound like a serious drawback, but it is.  Consider a simple xml snippet:<\/p>\n<pre class=\"codeoutput\" style=\"width: 40%\">&lt;Author&gt;Ron Rothman&lt;\/Author&gt;\n<\/pre>\n<p>This Python script:<\/p>\n<pre class=\"code\"><span class=\"comment\"># python 2.4<\/span>\nimport xml.dom.minidom as dom\nmyText = '''&lt;Author&gt;Ron Rothman&lt;\/Author&gt;'''\nprint xml.dom.minidom.parseString(myText).toprettyxml()\n<\/pre>\n<p>generates this output:<\/p>\n<pre class=\"codeoutput\" style=\"width: 40%\"><span style=\"color: #aaa\">&lt;?xml version=\"1.0\" ?&gt;<\/span>\n&lt;Author&gt;\n        Ron Rothman\n&lt;\/Author&gt;\n<\/pre>\n<p>Note the extra line breaks: the text &#8220;Ron Rothman&#8221; is printed on its own line, and indented.  That may not matter much to a human reading the output, but it sure as hell matters to an XML parser.  (Recall: <a href=\"https:\/\/www.oracle.com\/technology\/pub\/articles\/wang-whitespace.html\">whitespace <em>is significant<\/em> <\/a> in XML)<\/p>\n<p>To put it another way: the DOM object that represents the output (with line breaks) is NOT identical to the DOM object that represented the input.<\/p>\n<p>Semantically, the author in the original XML is <code>\"Ron Rothman\"<\/code>, but the author in the &#8220;pretty&#8221; XML is [approximately] <code>\"&nbsp;&nbsp;&nbsp;&nbsp;Ron Rothman&nbsp;&nbsp;&nbsp;&nbsp;\"<\/code>.<\/p>\n<p>This is devastating news to anyone who hopes to re-parse the &#8220;pretty&#8221; XML in some other context.  It means that <strong>you can&#8217;t use minidom.toprettyxml() to produce XML that will be parsed downstream<\/strong>.<\/p>\n<p><a name=\"workarounds\"><\/a><\/p>\n<h3>Workarounds<\/h3>\n<p><span class=\"newButton\">UPDATED<\/span> If you&#8217;re in a rush, skip ahead to the <a href=\"#best-solution\">best solution<\/a>, #4.<\/p>\n<div class=\"callout1 callout-r\">\n<div class=\"inner\" style=\"text-align: left;\">\nSidebar: Things that <em>don&#8217;t<\/em> solve the problem:<\/p>\n<ul>\n<li>normalize()<\/li>\n<li>calling toprettyxml with &#8220;creative&#8221; (non-default) parameters<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<h4>1. Don&#8217;t use minidom<\/h4>\n<p>There are plenty of other XML packages to choose from.<br \/>\n<strong>But<\/strong>: minidom is appealing because it&#8217;s lightweight, and is included with the Python distribution.  Seems a shame to toss it for just one flaw.<br \/>\n<span class=\"newButton\">UPDATE<\/span>I&#8217;ve started using <a href=\"https:\/\/codespeak.net\/lxml\/\">lxml<\/a> and I highly recommend it as a replacement for minidom or PyXML.<\/p>\n<h4>2. Use minidom, but don&#8217;t use toprettyxml()<\/h4>\n<p>Use minidom.toxml(), which doesn&#8217;t suffer from the same problem (because it doesn&#8217;t insert any whitespace).<br \/>\n<strong>But<\/strong>: Your machine-readable XML will make heads spin, should someone be foolish enough to try to read it.<\/p>\n<h4>3. Hack toprettyxml to do The Right Thing<\/h4>\n<p>Replace toprettyxml by using the code below.<br \/>\n<strong>But<\/strong>: It smells.  Like a hack.  Fragile; likely to break with future releases of minidom.<br \/>\nOn the other hand: It&#8217;s not <em>that<\/em> bad.  And hey, it does the trick.  (But YMMV.)<\/p>\n<pre class=\"code\">def fixed_writexml(self, writer, indent=\"\", addindent=\"\", newl=\"\"<!-- html comment to prevent wordpress from inserting emoticon -->):\n    # indent = current indentation\n    # addindent = indentation to add to higher levels\n    # newl = newline string\n    writer.write(indent+\"&lt;\" + self.tagName)\n\n    attrs = self._get_attributes()\n    a_names = attrs.keys()\n    a_names.sort()\n\n    for a_name in a_names:\n        writer.write(\" %s=\\\"\" % a_name)\n        <span style=\"color: #cc3;\">xml.dom.minidom.<\/span>_write_data(writer, attrs[a_name].value)\n        writer.write(\"\\\"\"<!-- html comment to prevent wordpress from inserting emoticon -->)\n    if self.childNodes:\n        <span style=\"color: #cc3;\">if len(self.childNodes) == 1 \\\n          and self.childNodes[0].nodeType == xml.dom.minidom.Node.TEXT_NODE:\n            writer.write(\"&gt;\"<!-- html comment to prevent wordpress from inserting emoticon -->)\n            self.childNodes[0].writexml(writer, \"\", \"\", \"\"<!-- html comment to prevent wordpress from inserting emoticon -->)\n            writer.write(\"&lt;\/%s&gt;%s\" % (self.tagName, newl))\n            return<\/span>\n        writer.write(\"&gt;%s\"%(newl))\n        for node in self.childNodes:\n            node.writexml(writer,indent+addindent,addindent,newl)\n        writer.write(\"%s&lt;\/%s&gt;%s\" % (indent,self.tagName,newl))\n    else:\n        writer.write(\"\/&gt;%s\"%(newl))\n<span style=\"color: #cc3;\"># replace minidom's function with ours\nxml.dom.minidom.Element.writexml = fixed_writexml<\/span>\n<\/pre>\n<p>I just copied the original toprettyxml code from <code>\/usr\/lib\/python2.4\/xml\/dom\/minidom.py<\/code> and made the modifications that are highlighted in <span style=\"background: #000; color: #cc3;\">yellow<\/span>.  It ain&#8217;t pretty, but it seems to work.  (Suggestions for improvements (I&#8217;m a Python n00b) are welcome.)<\/p>\n<p>[Credit to <a href=\"https:\/\/www.gamedev.net\/community\/forums\/topic.asp?topic_id=497185\">Oluseyi at gamedev.net<\/a> for the original hack; I just fixed it so that it worked with character entities.]<\/p>\n<p><a name=\"best-solution\"><\/a><\/p>\n<h4><span class=\"newButton\" style=\"margin-left: 0;\">UPDATE!<\/span> 4. Use xml.dom.ext.PrettyPrint<\/h4>\n<p>Who knew?  All along, an alternative to toprettyxml was available to me.  Works like a charm.  Robust.  100% Kosher Python.  Definitely the method I&#8217;ll be using.<br \/>\n<strong>But<\/strong>: Need to have PyXML installed.  In my case, it was already installed, so this is my method of choice.  (It&#8217;s worth pointing out that if you already have PyXML installed, you might want to consider using it exclusively, in lieu of minidom.)<\/p>\n<p>We just write a simple wrapper, and we&#8217;re done:<\/p>\n<pre class=\"code\">from xml.dom.ext import PrettyPrint\nfrom StringIO import StringIO\n\ndef toprettyxml_fixed (node, encoding='utf-8'):\n    tmpStream = StringIO()\n    PrettyPrint(node, stream=tmpStream, encoding=encoding)\n    return tmpStream.getvalue()\n<\/pre>\n<h3>Conclusion<\/h3>\n<p>One lesson from all this: TMTOWTDI applies to more than just Perl. :)<\/p>\n<p>Please&#8211;<a href=\"#commentform\">let me know<\/a> what you think.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Python&#8217;s xml.dom.minidom.toprettyxml has a feature\/flaw that renders it useless for many common applications.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"categories":[52,5],"tags":[47,46,54],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.9 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Fixing minidom.toprettyxml&#039;s Silly Whitespace - \u00ableftbraned<\/title>\n<meta name=\"description\" content=\"Fixing minidom.toprettyxml&#039;s Silly Whitespace\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Fixing minidom.toprettyxml&#039;s Silly Whitespace - \u00ableftbraned\" \/>\n<meta property=\"og:description\" content=\"Fixing minidom.toprettyxml&#039;s Silly Whitespace\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/\" \/>\n<meta property=\"og:site_name\" content=\"\u00ableftbraned\" \/>\n<meta property=\"article:published_time\" content=\"2008-06-15T08:21:52+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-02-16T01:51:22+00:00\" \/>\n<meta name=\"author\" content=\"Ron\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Ron\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/\",\"url\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/\",\"name\":\"Fixing minidom.toprettyxml's Silly Whitespace - \u00ableftbraned\",\"isPartOf\":{\"@id\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/#website\"},\"datePublished\":\"2008-06-15T08:21:52+00:00\",\"dateModified\":\"2019-02-16T01:51:22+00:00\",\"author\":{\"@id\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/#\/schema\/person\/86056901135a054d2fa85dcd1f43555a\"},\"description\":\"Fixing minidom.toprettyxml's Silly Whitespace\",\"breadcrumb\":{\"@id\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Fixing minidom.toprettyxml&#8217;s Silly Whitespace\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/#website\",\"url\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/\",\"name\":\"\u00ableftbraned\",\"description\":\"go. figure. \u00ab\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/#\/schema\/person\/86056901135a054d2fa85dcd1f43555a\",\"name\":\"Ron\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/a060c4a10d3c4fecd6555ad6b7e9c08d?s=96&d=identicon&r=pg\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/a060c4a10d3c4fecd6555ad6b7e9c08d?s=96&d=identicon&r=pg\",\"caption\":\"Ron\"},\"description\":\"https:\/\/www.ronrothman.com\/public\/about+me.shtml\",\"sameAs\":[\"https:\/\/www.ronrothman.com\/\"],\"url\":\"https:\/\/www.ronrothman.com\/public\/leftbraned\/author\/ron\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Fixing minidom.toprettyxml's Silly Whitespace - \u00ableftbraned","description":"Fixing minidom.toprettyxml's Silly Whitespace","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/","og_locale":"en_US","og_type":"article","og_title":"Fixing minidom.toprettyxml's Silly Whitespace - \u00ableftbraned","og_description":"Fixing minidom.toprettyxml's Silly Whitespace","og_url":"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/","og_site_name":"\u00ableftbraned","article_published_time":"2008-06-15T08:21:52+00:00","article_modified_time":"2019-02-16T01:51:22+00:00","author":"Ron","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Ron","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/","url":"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/","name":"Fixing minidom.toprettyxml's Silly Whitespace - \u00ableftbraned","isPartOf":{"@id":"https:\/\/www.ronrothman.com\/public\/leftbraned\/#website"},"datePublished":"2008-06-15T08:21:52+00:00","dateModified":"2019-02-16T01:51:22+00:00","author":{"@id":"https:\/\/www.ronrothman.com\/public\/leftbraned\/#\/schema\/person\/86056901135a054d2fa85dcd1f43555a"},"description":"Fixing minidom.toprettyxml's Silly Whitespace","breadcrumb":{"@id":"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.ronrothman.com\/public\/leftbraned\/xml-dom-minidom-toprettyxml-and-silly-whitespace\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.ronrothman.com\/public\/leftbraned\/"},{"@type":"ListItem","position":2,"name":"Fixing minidom.toprettyxml&#8217;s Silly Whitespace"}]},{"@type":"WebSite","@id":"https:\/\/www.ronrothman.com\/public\/leftbraned\/#website","url":"https:\/\/www.ronrothman.com\/public\/leftbraned\/","name":"\u00ableftbraned","description":"go. figure. \u00ab","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.ronrothman.com\/public\/leftbraned\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.ronrothman.com\/public\/leftbraned\/#\/schema\/person\/86056901135a054d2fa85dcd1f43555a","name":"Ron","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.ronrothman.com\/public\/leftbraned\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/a060c4a10d3c4fecd6555ad6b7e9c08d?s=96&d=identicon&r=pg","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a060c4a10d3c4fecd6555ad6b7e9c08d?s=96&d=identicon&r=pg","caption":"Ron"},"description":"https:\/\/www.ronrothman.com\/public\/about+me.shtml","sameAs":["https:\/\/www.ronrothman.com\/"],"url":"https:\/\/www.ronrothman.com\/public\/leftbraned\/author\/ron\/"}]}},"_links":{"self":[{"href":"https:\/\/www.ronrothman.com\/public\/leftbraned\/wp-json\/wp\/v2\/posts\/252"}],"collection":[{"href":"https:\/\/www.ronrothman.com\/public\/leftbraned\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ronrothman.com\/public\/leftbraned\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ronrothman.com\/public\/leftbraned\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ronrothman.com\/public\/leftbraned\/wp-json\/wp\/v2\/comments?post=252"}],"version-history":[{"count":11,"href":"https:\/\/www.ronrothman.com\/public\/leftbraned\/wp-json\/wp\/v2\/posts\/252\/revisions"}],"predecessor-version":[{"id":493,"href":"https:\/\/www.ronrothman.com\/public\/leftbraned\/wp-json\/wp\/v2\/posts\/252\/revisions\/493"}],"wp:attachment":[{"href":"https:\/\/www.ronrothman.com\/public\/leftbraned\/wp-json\/wp\/v2\/media?parent=252"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ronrothman.com\/public\/leftbraned\/wp-json\/wp\/v2\/categories?post=252"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ronrothman.com\/public\/leftbraned\/wp-json\/wp\/v2\/tags?post=252"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}