今さらPython3 (52) - XML - Deutschina's Tech Diary

第8章。

作者: Bill Lubanovic,斎藤康毅,長尾高弘
出版社/メーカー: オライリージャパン
発売日: 2015/12/01
メディア: 単行本（ソフトカバー）
この商品を含むブログを見る

XML

XMLとは？的な話については省略。でも、業界フォーマット的なものは必要に応じて押さえておいた方がいい。

後で使うかも知れないので、一応menu.xmlというファイルを作っておいた。

<?xml version="1.0"?>
<menu>
	<breakfast hours="7-11">
		<item price="$6.00">breakfast burritos</item>
		<item price="$4.00">pancakes</item>
	</breakfast>
	<lunch hours="11-3">
		<item price="$5.00">hamburger</item>
	</lunch>
	<dinner hours="3-10">
		<item price="$8.00">spaghetti</item>
	</dinner>
</menu>

ここから実験。XMLを手軽にハンドルするにはElementTreeが良いらしい。

>>> import xml.etree.ElementTree as et
>>> tree = et.ElementTree(file='menu.xml')
>>> root = tree.getroot()
>>> root.tag
'menu'
>>> for child in root:
...     print('tag:', child.tag, 'attributes:', child.attrib)
...     for grandchild in child:
...         print('\ttag:', grandchild.tag, 'attributes:', grandchild.attrib)
... 
tag: breakfast attributes: {'hours': '7-11'}
	tag: item attributes: {'price': '$6.00'}
	tag: item attributes: {'price': '$4.00'}
tag: lunch attributes: {'hours': '11-3'}
	tag: item attributes: {'price': '$5.00'}
tag: dinner attributes: {'hours': '3-10'}
	tag: item attributes: {'price': '$8.00'}
>>> 
>>> len(root)
3
>>> len(root[0])
2
>>> len(root[1])
1
>>> len(root[2])
1
>>>

なんか物足りないと思ったのは、ブリトーとかパンケーキとか出て来てないからだね。ドキュメントによると、textが使えるそうなので、こんな感じでやってみた。

20.5. xml.etree.ElementTree — ElementTree XML API — Python 3.4.3 ドキュメント

>>> for child in root:
...     print('tag:', child.tag, 'attributes:', child.attrib, child.text)
...     for grandchild in child:
...         print('\ttag:', grandchild.tag, 'attributes:', grandchild.attrib, grandchild.text)
... 
tag: breakfast attributes: {'hours': '7-11'} 
		
	tag: item attributes: {'price': '$6.00'} breakfast burritos
	tag: item attributes: {'price': '$4.00'} pancakes
tag: lunch attributes: {'hours': '11-3'} 
		
	tag: item attributes: {'price': '$5.00'} hamburger
tag: dinner attributes: {'hours': '3-10'} 
		
	tag: item attributes: {'price': '$8.00'} spaghetti
>>>

余計な改行が入っているけど、child.textが不要だったかな。実際に値もないし。

>>> for child in root:
...     print('tag:', child.tag, 'attributes:', child.attrib)
...     for grandchild in child:
...         print('\ttag:', grandchild.tag, 'attributes:', grandchild.attrib, grandchild.text)
... 
tag: breakfast attributes: {'hours': '7-11'}
	tag: item attributes: {'price': '$6.00'} breakfast burritos
	tag: item attributes: {'price': '$4.00'} pancakes
tag: lunch attributes: {'hours': '11-3'}
	tag: item attributes: {'price': '$5.00'} hamburger
tag: dinner attributes: {'hours': '3-10'}
	tag: item attributes: {'price': '$8.00'} spaghetti
>>>

本の中では、以下の2つの標準ライブラリの存在を明かしただけで終了。

20.6. xml.dom — 文書オブジェクトモデル (DOM) API — Python 3.4.3 ドキュメント
 20.9. xml.sax — SAX2 パーサのサポート — Python 3.4.3 ドキュメント

ちょっと物足りない。

ということで、ドキュメントから少し引っ張って遊んでみることにしました。

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

このxmlをcountry_data.xmlとして保存してから開始。

>>> import xml.etree.ElementTree as et
>>> tree = et.parse('country_data.xml')
>>> root = tree.getroot()
>>> root.tag
'data'
>>> root.attrib
{}

ファイルを開くところも、ドキュメントではparseを使っている。

>>> for child in root:
...     print(child.tag, child.attrib)
... 
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}
>>>

子ノードが入れ子になっているので、こんな事ができる。

>>> root[0][1].text
'2008'
>>> root[0][1].tag
'year'

リヒテンシュタインの2つ目の要素、つまりyearが出力されたと。

>>> for neighbor in root.iter('neighbor'):
...     print(neighbor.attrib)
... 
{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}
>>>

これは、root以下にある'neighbor'の属性を拾って出力している。

>>> for country in root.findall('country'):
...     rank = country.find('rank').text
...     name = country.get('name')
...     print(name, rank)
... 
Liechtenstein 1
Singapore 4
Panama 68
>>>

countryの中にあるrankから順位の数字と国名を拾ってきて出力しているんだね。

>>> for rank in root.iter('rank'):
...     new_rank = int(rank.text) + 1
...     rank.text = str(new_rank)
...     rank.set('updated', 'yes')
... 
>>> tree.write('output.xml')
>>>

これは、順位を1つ足して、updatedという新しい属性を追加している。念のためファイルを確認。

<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor direction="E" name="Austria" />
        <neighbor direction="W" name="Switzerland" />
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor direction="N" name="Malaysia" />
    </country>
    <country name="Panama">
        <rank updated="yes">69</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor direction="W" name="Costa Rica" />
        <neighbor direction="E" name="Colombia" />
    </country>
</data>

続いては、50位以下の国をノードごと消しちゃおうと。

>>> for country in root.findall('country'):
...     rank = int(country.find('rank').text)
...     if rank > 50:
...        root.remove(country)
... 
>>> tree.write('output.xml')
>>>

さっき書き忘れたけど、write()でファイルに出力してますね。

<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor direction="E" name="Austria" />
        <neighbor direction="W" name="Switzerland" />
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor direction="N" name="Malaysia" />
    </country>
    </data>

パナマが消えました。

（つづく）