Seeking Elegant Pythonic Solution

So, I have some code that queries a data source, and that data source sends me back an XML message. I have to parse the XML message so I can store information from it into a relational database. So, let’s say my XML response looks like this:

<xml>
<response>
<results=2>
  <result>
    <fname>Brian</fname>
    <lname>Jones</lname>
    <gender>M</gender>
    <office_phone_ext>777</office_phone_ext>
    <mobile_phone>201-555-1212</mobile_phone>
  </result>
  <result>
    <fname>Molly</fname>
    <lname>Jones</lname>
    <home_phone>201-555-1234</home_phone>
  </result>
</results>
</xml>

So, as you can see, the attributes for each result returned for a query can differ, and if a result doesn’t have a value for some attribute, the corresponding xml element isn’t included at all for that result. If it were just 2 or 3 attributes, I could easily enough get around it by doing something like this:

def __init__(self, xmlresult):
  self.xmlresult = xmlresult
  if self.xmlresult.xpath('fname') is not None:
    self.fname = self.xmlresult.xpath('fname')
  if self.xmlresult.xpath('lname') is not None:
    self.lname = self.xmlresult.xpath('lname')

Like I said, if it were just a few things I needed to check for, I’d do it this way and be done with it. It’s not just a few though — it’s like 50 attributes. Now what?

I decided lxml.objectify would be a great way to go. It would allow me to access these things as object attributes, which should mean I can do something like this:

self.fname = getattr(self.xmlresult, 'fname', None)
self.lname = getattr(self.xmlresult, 'lname', None)
...

So, you *can* do this, technically speaking. Trouble is, you’re asking for an attribute of an ObjectifiedElement object, and when you do that, it returns an object that is not a native Python datatype, which I did not realize when I first started using lxml.objectify. So, in the above, ‘self.fname’ will not be a Python string — it’ll be an lxml.objectify.StringElement object. Of course, my database driver, my ‘join()’ operations, and everything else in my code that relies on native Python datatypes is now broken.

What I actually need to do is get the ‘.pyval’ attribute of self.xmlresult.fname, if that attribute exists at all. So, something that does what I mean, which is “self.fname = getattr(self.xmlresult, ‘fname.pyval’, None). And, of course, doing ‘getattr(self.xmlresult, ‘fname’, None).pyval’ doesn’t work because None has no attribute ‘pyval’. I’ve tried a couple of other hacks too, but I’ve learned enough Python to know that if it feels like a hack, there’s probably a better way. But I can’t find that better way. Ideas?

  • http://morgangoose.com Morgan Goose

    There is a cool module that xmonader made to turn a etree string into a pure python class: http://bitbucket.org/xmonader/happymapper/

    I have used it in a project in conjunction with the findall() of etree, to drill down the xml tree so the mapper doesn’t have to much to parse. I then feed it like this:

    for account in server.findall(“from_api/listaccts/acct”):
    account_class = happymapper.get_root_document(etree.tostring(account))

    domain_name = account_class.domain[‘content’]

    Its pretty useful sometimes. I did a bit more with it and sqlalchemy so that I was using that class to populate an sqlalchemy class, so that it went straight from xml to rows in my database.

  • m0j0

    Thanks, @Morgan — funny thing is, I already have some code that looks like the mapper, but I was hoping to rely on code maintained elsewhere, and I thought lxml.objectify would be it. I’ll have a look at happymapper. Thanks.

    I sort of expected that there’d be something within Python I was overlooking, but I guess if others are looking elsewhere for this functionality then… not so much :)

    Thanks again.

  • http://www.aleax.it Alex Martelli

    Why is it a problem if there are 50 attributes, or 100? Just use a loop:

    for n in allthenames:
    v = self.xmlresult.xpath(n)
    if v is not None:
    setattr(self, n, v)

  • http://ionrock.org/blog/ Eric Larson

    You could write a little XSLT to add empty attributes. That might be a pain, but at least you’d have a single place to normalize/validate the XML, even if it is not Python.

  • m0j0

    @Alex — perhaps the problem is really that my brain got too close to the problem. That appears to make all the sense in the world. Shame on me for not taking enough breaks. :-/

    Thanks!

  • http://www.voidspace.org.uk/ Michael Foord

    Whilst you *should* use the solution suggested by Alex, I still like my horrid little one liner with reduce and a lambda:

    reduce(lambda result, attr: getattr(result, attr), [the_object] + list(the_string.split(‘.’)))

  • http://www.funsize.net Kevin Horn

    I probably would have used lxml.etree to get a list of “result” elements, and iterated over each of those to get a list of the children of each, then created a dictionary from the keys (elem.tag), and values (elem.text) and then proceeded from there. If this makes any sense to you, then you’ve probbaly stayed up too late. I know I have. :)

    Alex’s solution is probably better though.

  • http://listbot.org James

    While it looks like you already have a solution, I tend to use BeautifulStoneSoup for XML parsing like this:

    http://gist.github.com/294341

    BeautifulStoneSoup gives you Unicode *and* the class it gives you appears to inherit all the standard methods for type str.

    Hope that helps!

  • Tom Lynn

    import lxml.html

    xml = lxml.html.fromstring(xml)
    results = [dict((node.tag, node.text_content()) for node in result)
    for result in xml.xpath(‘//result’)]

  • Tom Lynn

    … or if you really want attribute access use the Python Cookbook Bunch recipe:

    import lxml.html

    class Result(object):
    def __init__(self, pairs):
    self.__dict__.update(pairs)

    results = [Result((node.tag, node.text_content()) for node in result)
    for result in lxml.html.fromstring(xml).xpath(‘//result’)]

    The choice between using node.text and node.text_content() depends how you want to handle unexpected child nodes. If you might get embedded HTML (e.g. a sup tag to superscript a trademark sign), use node.text_content() as above, otherwise you’re ok with node.text.