We are ready to extract some content from XML files!
We will use file-a.xml again or something
similar as our XML input file. For the purpose of this lesson, it
looks like:
<?xml version="1.0" encoding="UTF-8"?>
...
<html ... >
<head>
...
<title>what is here?</title>
...
and the parse tree looks like:
---XTag "/"
|
.
|
+---XTag "html"
|
.
+---XTag "head"
| |
. .
| +---XTag "title"
| | |
| | +---XText "what is here?"
...
We are interested in the text content of the title
element, i.e., what goes into the "what is here?".
The following program will find out. You can also download it as
lesson-2.hs.
IOSArrow XmlTree String processor filename = readDocument [withValidate no] filename >>> getChildren >>> isElem >>> hasName "html" >>> getChildren >>> isElem >>> hasName "head" >>> getChildren >>> isElem >>> hasName "title" >>> getChildren >>> getText ]]>
How to run this program? At a GHCi prompt, Prelude> :load lesson-2.hs *Main> play "file-a.xml" Try to run it, modify it for variations, run it with a slightly different XML file, look up the HXT doc for the functions used... until you're thoroughly satisfied or utterly confused. Then you're ready to read on.
Now let's walk through the program.
>> ... ]]>
As we learned from the previous two lessons, this parses the XML document specified by the file name and passes on the parse tree to the next arrow.
>> getChildren ]]>
As the name suggests, getChildren
is an arrow that inputs
a parse tree and outputs the child nodes/subtrees of the root. Now,
the root has an arbitrary number of children, and
getChildren
needs to output all of them. So this is the
main reason why an HXT arrow is capable of outputting multiple values
by passing them on in a list internally: it can output the list of
children.
I will use "node", "subtree", and "subtree rooted at node" interchangeably; to distinguish them too much is counterproductive in a high-level language.
Specifically, at this stage, the children of the document root are possibly: processing instructions, comments, whitespace text, and the (one and only) top-level element. The list of these things (more precisely the list of subtrees rooted at these things) is passed on.
>> ... ]]>
Recall that the >>]]>
operator chains two
arrows by taking the list from the upstream and calling the downstream
multiple times, once for each item in the list. This is the right
thing to do if we want to apply the same downstream arrow operation to
all items. (We will quickly see that it is the case for our task.)
>> isElem ]]>
isElem
passes on the input to the output (as a singleton
list) if the input's root is an element; otherwise it outputs nothing,
i.e., the empty list.
Recall that from the upstream we may receive a processing instruction,
a comment, a segment of whitespace text, or an element; and for our
task we only care about the last case. So isElem
is a
great way to discard the irrelevant cases and let through the relevant
case for further processing. It can be thought as a filter or test.
Now we have the element, but we also want to make sure it is
html
. This is accomplished by:
>> isElem >>> hasName "html" ]]>
hasName
passes on the input to the output if the input's
root is an element or an attribute with the desired name; otherwise it
outputs nothing. So this serves as a test that our element has the
right name.
Now we have the html
element for sure, but it may have
many possible children: a head
element, some other
elements, and some text nodes. We need to proceed to the
head
element and discard all the other cases. But taking
a hint from the whole ordeal above, we see the way: get all children,
then keep only elements, then keep only those with the right name.
And afterwards we can do the same to get to the title
element too! So here it goes.
>> isElem >>> hasName "html" getChildren >>> isElem >>> hasName "head" >>> getChildren >>> isElem >>> hasName "title" >>> ... ]]>
In general a strategy emerges for tasks of the form "walk a specific
path and ignore everything else": use getChildren
to
travel to the next hop, and use test arrows to narrow down to the
desired path.
To obtain the text inside the title
element, recall that
the string is stored in a text node as a child node of the owning
element. (See the parse tree above.) So, we will call
getChildren
one last time to get to the child nodes of
title
, then apply an arrow getText
that
combines two jobs into one: discard input if it is not a text node,
and output the text string if the input is a text node - exactly the
right operation for our purpose!
>> isElem >>> hasName "html" getChildren >>> isElem >>> hasName "head" >>> getChildren >>> isElem >>> hasName "title" >>> getChildren >>> getText ]]>
Whew! That's it!
In the beginning, I gave a mental model for such arrow chaining as
>> g
]]>
Namely, f
may output many values, so call g
just as many times, once for each value from f
; at the
end, pool together all output values from all the calls of
g
.
This is an operational model, meaning it tells you how to execute
things. Many people love operational models as a first step towards
an understanding. But operational models are hard to keep track of
in the head once we lengthen the chain:
>> f1 >>> f2 >>> f3 >>> f4 >>> ...
]]>
The cascade of multiple values is harder to imagine.
The strategy for the task of this lesson suggests an abstract model,
tractable for long chains: the chain selects paths in the tree and
walks them. Some of the arrows in the chain, such as
getChildren
, jump hops; some others, such as
isElem
, decide whether to continue or not. All in all,
the chain is a path specification, and it applies to those paths in the
tree that satisfy the conditions on the chain. This model invites you
to think of one path at a time, and so it is easier to reason with;
it also fits the path-oriented paradigm of XML querying.
So for example, the solution program in this lesson picks out paths of
the pattern
head -> title -> text node
]]>
Then you may like to ask: what if the XML file contains many such
paths? The program will match all of them and report all of the
strings found (recall that the very end result is a list of strings
anyway). Similarly if the file contains no such path. I encourage
you to modify the XML file to contain more or fewer matching paths, or
the program to match some other paths, and verify the result.
The next question you may ask is: you don't like this, how to modify
the program to reject files with none or too many matching paths?
There are two ways. One is to write a DTD dictating existence and
uniqueness, and tells readDocument
to validate. Another
way will be covered in a later lesson.
Name | from Module | Summary |
---|---|---|
getChildren | Control.Arrow.ArrowTree | outputs child nodes/subtrees |
isElem | Text.XML.HXT.Arrow.XmlArrow | lets through elements only |
hasName | Text.XML.HXT.Arrow.XmlArrow | lets through nodes with the given name only |
getText | Text.XML.HXT.Arrow.XmlArrow | extracts the text in the input text node (if only it is a text node) |
There are more tree-traversing arrows in Control.Arrow.ArrowTree; they
are not XML-specific. There are more XML-specific arrows in
Text.XML.HXT.Arrow.XmlArrow: those named isXXX
and
hasXXX
are filters, and those named getXXX
extract data (and usually double as filters too).