Parsec Generally
Albert Y. C. Lai, trebla [at] vex [dot] netLet's begin with a very basic Parsec example. It parses a lot of numbers with + signs between them, e.g., 3+1+4, and computes the sum.
import Text.Parsec import Text.Parsec.String play :: String -> Either ParseError Integer play s = parse pmain "parameter" s pmain :: Parser Integer pmain = do x <- pnum `chainl1` pplus eof return x pnum = read `fmap` many1 digit pplus = char '+' >> return (+)
Parser
is a special case. There are 3 directions of
generalization:
- user state (like StateT):
()
here - input type:
[Char]
here - combine with another monad:
Identity
here
And so Text.Parsec.String defines
type Parser a = ParsecT [Char] () Identity a
The next 3 sections show how to use other states, other input types, and other monads.
User State
This example shows using user state. I use Int
for my state,
and it counts how many numbers are successfully parsed.
import Text.Parsec play :: String -> Either ParseError (Integer,Int) play s = runParser pmain 0 "parameter" s -- 0 is the initial value -- type Parsec s u a = ParsecT s u Identity a pmain :: Parsec [Char] Int (Integer,Int) pmain = do x <- pnum `chainl1` pplus eof n <- getState -- obtain the user state value return (x,n) pnum = do x <- read `fmap` many1 digit modifyState (1 +) -- change the user state value return x pplus = char '+' >> return (+)
Input Type
Text and Bytestring
Besides [Char]
as input type, Text
(both lazy and
non-lazy) and Bytestring
(both lazy and non-lazy) are also
supported out of the box. This example shows non-lazy Text
;
feel free to try Bytestring
and lazy variants too.
import Text.Parsec import Text.Parsec.Text import Data.Text play :: Text -> Either ParseError Integer play s = parse pmain "parameter" s example :: Text example = pack "3+1+4" -- from Text.Parsec.Text: type Parser a = ParsecT Text () Identity a pmain :: Parser Integer pmain = do x <- pnum `chainl1` pplus eof return x pnum = read `fmap` many1 digit pplus = char '+' >> return (+)
Almost no change from the basic example! This is because
Text.Parsec.Text provides an appropriate Stream
instance:
instance (Monad m) => Stream Text m Char
This instance says that Text
is an input type with
basic unit Char
. Then digit
and
char
work because they only require the basic unit to be
Char
, e.g., digit
has type:
digit :: Stream s m Char => ParsecT s u m Char
User Token Type
Another generalization: the basic unit of parsing does not have to be
Char
; it can be your own token type. Parsec already provides
this general Stream
instance:
instance Monad m => Stream [tok] m tok
Setting tok=Char
is exactly how [Char]
is a
supported input type with basic unit Char
.
We can set tok=Token
for our own Token
type,
and the input type is then [Token]
.
Our Token
type will have two cases: numbers and plus signs.
It also records line numbers and column numbers: Presumably, a tokenizer
converts a string or a file to these tokens, which is where line numbers
and column numbers come from. We have to record them because we have to pass
them to Parsec for error message purposes. Here is our Token
type:
data Token = N {lin,col :: Int, val :: Integer} | Plus {lin,col :: Int} instance Show Token where show N{val=x} = "number " ++ show x show Plus{} = "plus" -- instance Show Token is needed for eof so it can produce error messages
With our own basic unit instead of Char
, we cannot use
the nice basic parsers in Text.Parsec.Char; we must write our own. But we
can learn from how Text.Parsec.Char does it: by calling tokenPrim
.
tokenPrim
parses one token. It is customized by 3 parameters:
- a function for showing a token, for error messages
- a function for updating Parsec's internal state of file position (filename, line number, column number); this function receives the current position, the current token, and the input after; it should return the position of the next (not current) token
- a function to determine whether a token is accepted or reject, and
if accepted, what value to return; so the function returns
Nothing
to reject, aJust
to accept
tokenPrim :: (Stream s m t) => (t -> String) -> (SourcePos -> t -> s -> SourcePos) -> (t -> Maybe a) -> ParsecT s u m a
We use tokenPrim
to re-implement pnum
and
pplus
. pnum
accepts an N
token
and extracts its number value. pplus
accepts a Plus
token and returns (+)
.
Here is the complete code:
import Text.Parsec data Token = N {lin,col :: Int, val :: Integer} | Plus {lin,col :: Int} instance Show Token where show N{val=x} = "number " ++ show x show Plus{} = "plus" -- instance Show Token is needed for eof so it can produce error messages example, wrong, bad :: [Token] example = [N 1 3 111, Plus 1 10, N 2 1 222] wrong = [N 1 3 111, Plus 1 10, N 2 1 222, N 2 5 333] bad = [N 1 3 111, Plus 1 10, N 2 1 222, Plus 2 5] play :: [Token] -> Either ParseError Integer play s = parse pmain "parameter" s -- type Parsec s u a = ParsecT s u Identity a pmain :: Parsec [Token] () Integer pmain = do x <- pnum `chainl1` pplus eof return x pnum = tokenPrim show update_pos get_num where get_num N{val=x} = Just x get_num _ = Nothing pplus = tokenPrim show update_pos is_plus where is_plus Plus{} = Just (+) is_plus _ = Nothing update_pos :: SourcePos -> Token -> [Token] -> SourcePos update_pos pos _ (tok:_) = setSourceLine (setSourceColumn pos (col tok)) (lin tok) update_pos pos _ [] = pos {- If there is no next token, I keep the position unchanged for simplicity. A better answer: use the length of the current token to deduce the ending position. -}
User Input Type
We now see how to support more input types. Still using Token
as the basic unit, let's just switch the container type: let's use
Vector
. We just need to provide a Stream
instance.
Its method uncons
simply needs to tell: Is the input empty?
If not, what are the head and the tail?
instance (Monad m) => Stream (Vector a) m a where uncons v | V.null v = return Nothing | otherwise = return (Just (V.head v, V.tail v))
With this instance, the parser needs little change, since it is not too sensitive to the input type. Here is the complete code:
{-# LANGUAGE MultiParamTypeClasses, FlexibleInstances #-} import Text.Parsec import qualified Data.Vector as V import Data.Vector(Vector) data Token = N {lin,col :: Int, val :: Integer} | Plus {lin,col :: Int} instance Show Token where show N{val=x} = "number " ++ show x show Plus{} = "plus" -- Show Token is needed for eof so it can produce error messages example, wrong, bad :: Vector Token example = V.fromList [N 1 3 111, Plus 1 10, N 2 1 222] wrong = V.fromList [N 1 3 111, Plus 1 10, N 2 1 222, N 2 5 333] bad = V.fromList [N 1 3 111, Plus 1 10, N 2 1 222, Plus 2 5] play :: Vector Token -> Either ParseError Integer play s = parse pmain "parameter" s -- type Parsec s u = ParsecT s u Identity pmain :: Parsec (Vector Token) () Integer pmain = do x <- pnum `chainl1` pplus eof return x pnum = tokenPrim show update_pos get_num where get_num N{val=x} = Just x get_num _ = Nothing pplus = tokenPrim show update_pos is_plus where is_plus Plus{} = Just (+) is_plus _ = Nothing update_pos :: SourcePos -> Token -> Vector Token -> SourcePos update_pos pos _ v | V.null v = pos -- for simplicity; there is a better answer | otherwise = setSourceLine (setSourceColumn pos (col tok)) (lin tok) where tok = V.head v instance (Monad m) => Stream (Vector a) m a where uncons v | V.null v = return Nothing | otherwise = return (Just (V.head v, V.tail v))
Monad
Parsec's monad is actually a monad transformer. In common usage it just
transforms Identity
to give a plain monad, but we can make it
transform other monads for extra features.
IO Monad
The first example transforms IO
. The parser can now print
messages after parsing each number. ParsecT
is also an instance
of MonadIO
when used this way, so we can just use
liftIO
.
import Text.Parsec import Control.Monad.IO.Class play :: String -> IO (Either ParseError Integer) play s = runParserT pmain () "parameter" s pmain :: ParsecT [Char] () IO Integer pmain = do x <- pnum `chainl1` pplus eof return x pnum = do x <- read `fmap` many1 digit liftIO (putStrLn "bling!") return x pplus = char '+' >> return (+)
ContT Monad
If you know me, you know that I always sneak in ContT
whenever I get to choose a monad. Here it is: I make Parsec transform
Cont
(ContT
of Identity
).
ParsecT
is also an instance of MonadCont
when
used this way, so we can directly use callCC
.
In the following example, we add some bad code to pnum
, but we
use callCC
to skip the bad code. The parser behaves as original.
import Text.Parsec import Control.Monad.Cont play :: String -> Either ParseError Integer play s = runCont (runParserT pmain () "parameter" s) id pmain :: ParsecT [Char] () (Cont r) Integer pmain = do x <- pnum `chainl1` pplus eof return x pnum :: ParsecT [Char] () (Cont r) Integer pnum = callCC $ \k -> do x <- read `fmap` many1 digit k x -- the following two lines are skipped many1 letter return (x+10000) pplus :: ParsecT [Char] () (Cont r) (Integer -> Integer -> Integer) pplus = char '+' >> return (+)
Yielding Parser: Introduction
With call/cc and mutable references, we can make and use generators (the yield command), which is in many people's Parsec wish list! Because in common usage, Parsec parses the input completely and returns one complete answer, or parses until a parse error and returns no answer, rather than incrementally reporting intermediate results. This is not what some applications want: they may want the intermediate results up to the parse error, or they may want to avoid spending memory for the complete, huge answer. But now we wield the power to yield!
My basic implementation of generators is in the Appendix.
Here I just say how to use it. Use mkgen
to
create a generator from a body, like this:
g <- mkgen (\yield c0 -> do c1 <- yield a0 c2 <- yield a1 return b )
The body receives its yield
command and an initial value as
function parameters. Calling mkgen
does not start the body yet; it
only returns the generator. Calling the generator g
starts it
(providing the initial value c0
) and runs until hitting
yield
. Calling the generator again resumes (providing the return
values of yield
, here c1
or c2
). Like
this:
dot0 <- g c0 -- dot0 = More a0 dot1 <- g c1 -- dot1 = More a1 dot2 <- g c2 -- dot2 = End b
g
takes a parameter to be communicated to the body,
and returns a More
or End
depending on whether
the body is yielding or finishing. More
and End
are defined in the Yield module.
Here is a simple Parsec example with yield. The parser normally fetches a
character, yields it, increments a count, and repeats; but if it is told
False
when resuming, it finishes and returns the count as the
final answer. The caller of the parser just calls the generator a few times,
with the last time passing in False
and getting the count.
Therefore, though we use the infinite file /dev/zero as input, the parser
reads and finishes finitely.
import Text.Parsec import Control.Monad.Cont import Yield main :: IO (Either ParseError Int) main = runContT cmain return cmain :: ContT r IO (Either ParseError Int) cmain = do s <- liftIO (readFile "/dev/zero") g <- mkgen (\yield x -> runParserT (pmain (lift . yield) x) () "/dev/zero" s) g True >>= liftIO . print g True >>= liftIO . print End z <- g False return z pmain yield b = loop b 0 where loop False n = return n loop True n = do c <- anyChar b <- yield c loop b $! (n+1)
You may note that the yield
command and the generator do
not live in ParsecT
, but rather simply in ContT r IO
.
Accordingly, I need to add a level of lift
to the
yield
command for use inside the parser. I do this because it
suffices for my purpose. It can be done either way; there is no problem
creating and using a generator at the ParsecT
level, but it is
usually unnecessary because we usually do not use Parsec operations
between two invocations of the generator.
Yielding Parser: Tree Application
And now, we put Yield to good use. When we talk about incremental yielding parsers, we likely have XML parsing in mind! So here is a language much simpler than XML but with the same basic idea:
A tree is a tag, optionally followed by subtrees; subtrees must be enclosed in square brackets. A tag is one or more letters and/or digits. Spacing is usually optional, except that naturally two consecutive tags need spaces to separate.
Tree ::= Tag [ '[
' {Tree} ']
' ]
Tag ::= alphanum{alphanum}
An example is “f[x g[a b]]
”. The root tree has tag f and two
children. The first child has tag x and is a leaf. The second child has tag g
and two leaves, a and b.
The parser yields two kinds of events to mark the beginning and the end of
each subtree: after parsing a tag, yield a Start
event giving that
tag; after detecting the end of a subtree, yield a Stop
event
giving that subtree.
Here is the complete example:
import Control.Applicative((<*)) import Data.List(intercalate) import Text.Parsec import Control.Monad.Cont import Yield(mkgen) data Tree = Tree { tag :: String, sub :: [Tree] } instance Show Tree where show Tree{tag=t, sub=s} = case s of [] -> t _ -> t ++ "[" ++ intercalate " " (map show s) ++ "]" data Event = Start String | Stop Tree deriving Show pmain yield = spaces >> (ptree <* eof) where ptree = do name <- ptag yield (Start name) cs <- option [] (between popen pclose (many ptree)) let tree = Tree name cs yield (Stop tree) return tree ptag = (many1 alphaNum <?> "tagname") <* spaces popen = char '[' <* spaces pclose = char ']' <* spaces main :: IO () main = runContT cmain return cmain :: ContT r IO () cmain = do let s = "f [ x g[a b]" ++ cycle "huge[" g <- mkgen (\yield _ -> runParserT (pmain (lift . yield)) () "hardcoded" s) replicateM_ 12 (g () >>= liftIO . print)
Despite the infinite input, the whole program finishes because it does not resume the parser infinitely. It just obtains the first few events and quit.
This parser still spends memory to build the whole tree. You can change this
if you want and if the caller can live with it: do not build any subtree, do not
return any tree (return ()
instead), and at Stop
events, give just the tag and not a subtree.
Appendix: Generator and Yield
Here is the Yield module. It is explained in my Cont Monad article.
{-| Yield-style generators. -} module Yield where import Control.Monad.Cont import Data.IORef {-| Type of return values of generators. @More@ means the generator yields. @End@ means the generator finishes. See below for examples. -} data Dot a b = More a | End b deriving Show {-| Create a yield-style generator from a body. Example: > g <- mkgen (\yield c0 -> do > c1 <- yield a0 > c2 <- yield a1 > return b > ) Then we can use: > dot0 <- g c0 -- dot0 = More a0 > dot1 <- g c1 -- dot1 = More a1 > dot2 <- g c2 -- dot2 = End b Further calls to @g@ return @End b@ too. Although usually @m=n@, i.e., @g@ and the body are in the same monad as the @mkgen@ call, technically they can be different. The @mkgen@ call is in m with MonadIO for allocating IORef. The body and @g@ are in n with MonadIO and MonadCont for using IORef and callCC. -} mkgen :: (MonadIO m, MonadIO n, MonadCont n) => ((a -> n c) -> c -> n b) -> m (c -> n (Dot a b)) mkgen body = do inside <- liftIO (newIORef undefined) outside <- liftIO (newIORef undefined) let yield y = do o <- liftIO (readIORef outside) callCC (\ki -> do liftIO (writeIORef inside ki) o (More y) ) next x = do i <- liftIO (readIORef inside) callCC (\ko -> do liftIO (writeIORef outside ko) i x ) start x = do e <- body yield x liftIO (writeIORef inside (\_ -> return (End e))) o <- liftIO (readIORef outside) o (End e) undefined liftIO (writeIORef inside start) return next
I have more Haskell Notes and Examples