In this lesson, we extract and report slightly more structured data.
It sounds like we have been extracting data just fine in the last
lesson, but in a moment you will see this time we need something new.
In a file like file-a.xml, there are
meta
elements:
<html ... >
<head>
<meta attr="value" ... />
...
We want to fetch the attribute name and the value, turn the name into
uppercase, and put them in a tuple. (In general there may be none or
there may be many, so we actually return a list of tuples. But HXT
arrows do this for free anyway.)
Each HXT arrow we have learned before returns or processes just one datum. There is one arrow that returns a name (works for both tags and attributes). There is one arrow that returns an attribute value. If we just wanted one of them, the technique in the last lesson would suffice. But now we want both of them at the same time, and we want to process them both and combine them into a tuple. How to do that? This requires new operators. The purpose of this lesson is to introduce the new operators.
The following program performs this task. You can also download it as lesson-3.hs.
IOSArrow XmlTree (String,String) processor filename = readDocument [withValidate no] filename >>> getChildren >>> isElem >>> hasName "html" >>> getChildren >>> isElem >>> hasName "head" >>> getChildren >>> isElem >>> hasName "meta" >>> getTuple getTuple :: IOSArrow XmlTree (String,String) getTuple = getAttrl >>> getName &&& (getChildren >>> getText) >>> arr (map toUpper) *** returnA ]]>
How to run this program? At a GHCi prompt, Prelude> :load lesson-3.hs *Main> play "file-a.xml"
Up to the call to getTuple
we are just inputting
a file and walking paths to meta
nodes. This is
covered in previous lessons. I now describe what's new.
>> ... ]]>
getAttrl
gets all the attributes of the current
node (which is a meta
node in this case). As usual,
the >>>
operator causes the next arrow
to receive these attributes, one at a time.
>> getText) >>> ... ]]>
To get the name of an attribute, getName
on the attribute
node does the job. To get the value of an attribute, it is stored
further down in a child node that is also a text element, and so
getChildren >>> getText
on the attribute node
does the job. Now, how do we do both to the same node? That is exactly
what the &&&
operator does.
IOSArrow x y1 -> IOSArrow x (y0,y1)
]]>
It calls the arrow on its left, then it calls the arrow on its right
— both are called with the same input; then it tuples up the two
results.
In general usage, each of the two arrows may output many results
(although in this lesson we get single results).
&&&
multiplies up all combinations. E.g., if
the left arrow produces ["a","b"], and the right arrow ["x"], the
overall result is [("a","x"),("b","x")]; likewise, if one of the
arrows produces the empty list, the overall result is also the empty
list.
Note: even if one of the two arrows produces the empty list, both arrows are still called regardless. The empty result of one of them does not short-circuit the other. This is important to know when you use HXT arrows with side effects.
Now we have a tuple with the name and the value. We still want to
change the name component before outputting the result. The essence of
this is calling two arrows for the two components respectively, so
that one arrow changes the name to uppercase, and the other passes
through the value unchanged. This is done by the ***
operator:
IOSArrow x1 y1 -> IOSArrow (x0,x1) (y0,y1)
]]>
It calls the arrow on the left with the left component as input, and
the arrow on the right with the right component as input; then it
tuples up the two results. So we just have to construe an arrow for
uppercasing a string (it can be done by a pure function, so we just
lift that to the arrow level), another to change nothing, and combine
them with ***
. That is,
In general usage, the two component arrows output none or multiple
results, and ***
mutiplies up all combinations, without
short-circuiting.
In practice, you probably go much further than this toy. Maybe you
compute something based on the name and the value; maybe you pass them
to another function or arrow for further processing; maybe in some
stage you need both of them and in some other stage you process them
separately. Still, the &&&
and
***
operators are an essential stepping stone: they fork
off dataflow, and then you can do whatever you want.
Name | from Module | Summary |
---|---|---|
&&& | Control.Arrow (GHC) | fork processing |
*** | Control.Arrow (GHC) | separate processing |
getAttrl | Text.XML.HXT.Arrow.XmlArrow | outputs attribute nodes |