While lxml has some excellent benchmarks about the speed of lxml.etree vs. ElementTree, I wanted to run some tests that were as close as possible to my own use case (fairly simple multi-megabyte XML files).
Here are the results of my little test script lxml-v-etree.py (times are in milliseconds):
name generate | tostring | total | write | parse | find | total ------------------------+----------+-------+-------+-------+------+------ xml.cElementTree 132 | 2430 | 2562 | 2433 | 158 | 58 | 216 xml.cElementTree 112 | 2384 | 2497 | 2387 | 158 | 25 | 183 xml.cElementTree 113 | 2393 | 2507 | 2396 | 161 | 25 | 187 xml.ElementTree 591 | 2571 | 3163 | 2574 | 3613 | 25 | 3638 xml.ElementTree 619 | 2567 | 3187 | 2570 | 3589 | 55 | 3644 xml.ElementTree 609 | 2578 | 3188 | 2581 | 3564 | 55 | 3619 lxml 333 | 75 | 409 | 82 | 200 | 0 | 201 lxml 355 | 93 | 448 | 95 | 182 | 32 | 214 lxml 310 | 94 | 404 | 96 | 156 | 56 | 213 ------------------------+----------+-------+-------+-------+------+------ name generate | tostring | total | write | parse | find | total ------------------------+----------+-------+-------+-------+------+------
Note that the first “total” is “generate + tostring” while the second “total” is for the 2 parsing related tests (previous 2 columns summed).
My parsing tests are basically “etree.parse” and then running “Element.getchildren()” 3 times, which is ridiculously simplistic and should probably be ignored. My writing tests are far more thorough/realistic.
I’m running Python 2.6.2 with lxml 2.1.5 and libxml2 2.6.32 on Ubuntu 9.04 x86_64.
Tags: benchmarks, ElementTree, lxml, xml
Thanks for doing this – I use lxml and ElemenTree quite a bit and, maybe just for the sake of simplicity, have stuck to “from lxml import etree”. For apps that do a bunch of reading and writing, that seems like a pretty reasonable bet, but it is interesting the cElementTree is faster under these real-world circumstances.
@Jon:
Only those first 4 columns (with the 3rd being the most important as an aggregate of the first 2) are the only ones generated using what I would consider a real world use case.
Those show lxml to be a very clear winner at XML generation which lxml’s benchmarks have already shown.
The last couple columns are the most interesting as they show lxml and etree being awfully similar for parsing XML which isn’t what lxml’s benchmarks say. Unfortunately my parsing tests are nearly worthless and not at all real world unless you’re opening XML files to read out single values.
I hope this post is still interesting and not misleading.
Hmm, now I’m starting to wonder if using lxml in my tools that read XML is going to be of any benefit over cElementTree. I read XML for configuration and read XML generated by other systems/tools. Sticking with cetree, would be better for compatibility/portability since it is (part) of the standard lib. I guess I’m going to have to benchmark this on my own and find out. Thanks for posting this!
the big win of lxml is that you can actually get at the DOCTYPE node using the native API. cElementTree has no way of doing this (you have to write a custom subclass using ElementTree). looking at the source you can see it just throws DOCTYPE away. Why this node wouldn’t be considered important is beyond me.
my general impression is that lxml is much more actively maintained.
Don’t forget that lxml also supplies XML Schema, RELAX-NG, DTD, and Schematron validation.
Enumerating every feature lxml has that ElementTree lacks would take some time, but thanks for trying Jon.
IIRC XML Schema support is incomplete, but don’t quote me on that as I’ve never used it.
Feature-wise lxml definitely wins, but lxml even improves the standard ElementTree interface with features such as iterating all descendants of an element easily and parent node traversal (instead of just forward/child traversal).
CSS selectors are my favorite lxml feature, and for the jQuery enthusiast, pyquery is a dream come true: it builds a jQuery-like interface on top of lxml.
Thanks for the comments all!
Every time I see a table of numbers in a blog post I wish the author used something like Google Charts to make a pretty picture.