Parsec Generally

Albert Y. C. Lai, trebla [at] vex [dot] net

Let's begin with a very basic Parsec example. It parses a lot of numbers with + signs between them, e.g., 3+1+4, and computes the sum.

import Text.Parsec
import Text.Parsec.String

play :: String -> Either ParseError Integer
play s = parse pmain "parameter" s

pmain :: Parser Integer
pmain = do
  x <- pnum `chainl1` pplus
  eof
  return x

pnum = read `fmap` many1 digit

pplus = char '+' >> return (+)

Parser is a special case. There are 3 directions of generalization:

user state (like StateT): () here
input type: [Char] here
combine with another monad: Identity here

And so Text.Parsec.String defines

type Parser a = ParsecT [Char] () Identity a

The next 3 sections show how to use other states, other input types, and other monads.

User State

This example shows using user state. I use Int for my state, and it counts how many numbers are successfully parsed.

import Text.Parsec

play :: String -> Either ParseError (Integer,Int)
play s = runParser pmain 0 "parameter" s  -- 0 is the initial value

-- type Parsec s u a = ParsecT s u Identity a
pmain :: Parsec [Char] Int (Integer,Int)
pmain = do
  x <- pnum `chainl1` pplus
  eof
  n <- getState  -- obtain the user state value
  return (x,n)

pnum = do
  x <- read `fmap` many1 digit
  modifyState (1 +)  -- change the user state value
  return x

pplus = char '+' >> return (+)

Input Type

Text and Bytestring

Besides [Char] as input type, Text (both lazy and non-lazy) and Bytestring (both lazy and non-lazy) are also supported out of the box. This example shows non-lazy Text; feel free to try Bytestring and lazy variants too.

import Text.Parsec
import Text.Parsec.Text
import Data.Text

play :: Text -> Either ParseError Integer
play s = parse pmain "parameter" s

example :: Text
example = pack "3+1+4"

-- from Text.Parsec.Text: type Parser a = ParsecT Text () Identity a
pmain :: Parser Integer
pmain = do
  x <- pnum `chainl1` pplus
  eof
  return x

pnum = read `fmap` many1 digit

pplus = char '+' >> return (+)

Almost no change from the basic example! This is because Text.Parsec.Text provides an appropriate Stream instance:

instance (Monad m) => Stream Text m Char

This instance says that Text is an input type with basic unit Char. Then digit and char work because they only require the basic unit to be Char, e.g., digit has type:

digit :: Stream s m Char => ParsecT s u m Char

User Token Type

Another generalization: the basic unit of parsing does not have to be Char; it can be your own token type. Parsec already provides this general Stream instance:

instance Monad m => Stream [tok] m tok

Setting tok=Char is exactly how [Char] is a supported input type with basic unit Char. We can set tok=Token for our own Token type, and the input type is then [Token].

Our Token type will have two cases: numbers and plus signs. It also records line numbers and column numbers: Presumably, a tokenizer converts a string or a file to these tokens, which is where line numbers and column numbers come from. We have to record them because we have to pass them to Parsec for error message purposes. Here is our Token type:

data Token = N {lin,col :: Int, val :: Integer}
           | Plus {lin,col :: Int}
instance Show Token where
  show N{val=x} = "number " ++ show x
  show Plus{} = "plus"
-- instance Show Token is needed for eof so it can produce error messages

With our own basic unit instead of Char, we cannot use the nice basic parsers in Text.Parsec.Char; we must write our own. But we can learn from how Text.Parsec.Char does it: by calling tokenPrim.

tokenPrim parses one token. It is customized by 3 parameters:

a function for showing a token, for error messages
a function for updating Parsec's internal state of file position (filename, line number, column number); this function receives the current position, the current token, and the input after; it should return the position of the next (not current) token
a function to determine whether a token is accepted or reject, and if accepted, what value to return; so the function returns Nothing to reject, a Just to accept

tokenPrim :: (Stream s m t) =>
      (t -> String)
   -> (SourcePos -> t -> s -> SourcePos)
   -> (t -> Maybe a)
-> ParsecT s u m a

We use tokenPrim to re-implement pnum and pplus. pnum accepts an N token and extracts its number value. pplus accepts a Plus token and returns (+).

Here is the complete code:

import Text.Parsec

data Token = N {lin,col :: Int, val :: Integer}
           | Plus {lin,col :: Int}
instance Show Token where
  show N{val=x} = "number " ++ show x
  show Plus{} = "plus"
-- instance Show Token is needed for eof so it can produce error messages

example, wrong, bad :: [Token]
example = [N 1 3 111, Plus 1 10, N 2 1 222]
wrong = [N 1 3 111, Plus 1 10, N 2 1 222, N 2 5 333]
bad = [N 1 3 111, Plus 1 10, N 2 1 222, Plus 2 5]

play :: [Token] -> Either ParseError Integer
play s = parse pmain "parameter" s

-- type Parsec s u a = ParsecT s u Identity a
pmain :: Parsec [Token] () Integer
pmain = do
  x <- pnum `chainl1` pplus
  eof
  return x

pnum = tokenPrim show update_pos get_num where
  get_num N{val=x} = Just x
  get_num _ = Nothing

pplus = tokenPrim show update_pos is_plus where
  is_plus Plus{} = Just (+)
  is_plus _ = Nothing

update_pos :: SourcePos -> Token -> [Token] -> SourcePos
update_pos pos _ (tok:_) = setSourceLine (setSourceColumn pos (col tok)) (lin tok)
update_pos pos _ [] = pos
{-
If there is no next token, I keep the position unchanged for simplicity.
A better answer: use the length of the current token to deduce the ending
position.
-}

User Input Type

We now see how to support more input types. Still using Token as the basic unit, let's just switch the container type: let's use Vector. We just need to provide a Stream instance. Its method uncons simply needs to tell: Is the input empty? If not, what are the head and the tail?

instance (Monad m) => Stream (Vector a) m a where
  uncons v | V.null v = return Nothing
           | otherwise = return (Just (V.head v, V.tail v))

With this instance, the parser needs little change, since it is not too sensitive to the input type. Here is the complete code:

{-# LANGUAGE MultiParamTypeClasses, FlexibleInstances #-}
import Text.Parsec
import qualified Data.Vector as V
import Data.Vector(Vector)

data Token = N {lin,col :: Int, val :: Integer}
           | Plus {lin,col :: Int}
instance Show Token where
  show N{val=x} = "number " ++ show x
  show Plus{} = "plus"
-- Show Token is needed for eof so it can produce error messages

example, wrong, bad :: Vector Token
example = V.fromList [N 1 3 111, Plus 1 10, N 2 1 222]
wrong = V.fromList [N 1 3 111, Plus 1 10, N 2 1 222, N 2 5 333]
bad = V.fromList [N 1 3 111, Plus 1 10, N 2 1 222, Plus 2 5]

play :: Vector Token -> Either ParseError Integer
play s = parse pmain "parameter" s

-- type Parsec s u = ParsecT s u Identity
pmain :: Parsec (Vector Token) () Integer
pmain = do
  x <- pnum `chainl1` pplus
  eof
  return x

pnum = tokenPrim show update_pos get_num where
  get_num N{val=x} = Just x
  get_num _ = Nothing

pplus = tokenPrim show update_pos is_plus where
  is_plus Plus{} = Just (+)
  is_plus _ = Nothing

update_pos :: SourcePos -> Token -> Vector Token -> SourcePos
update_pos pos _ v
  | V.null v = pos  -- for simplicity; there is a better answer
  | otherwise = setSourceLine (setSourceColumn pos (col tok)) (lin tok)
  where tok = V.head v

instance (Monad m) => Stream (Vector a) m a where
  uncons v | V.null v = return Nothing
           | otherwise = return (Just (V.head v, V.tail v))

Monad

Parsec's monad is actually a monad transformer. In common usage it just transforms Identity to give a plain monad, but we can make it transform other monads for extra features.

IO Monad

The first example transforms IO. The parser can now print messages after parsing each number. ParsecT is also an instance of MonadIO when used this way, so we can just use liftIO.

import Text.Parsec
import Control.Monad.IO.Class

play :: String -> IO (Either ParseError Integer)
play s = runParserT pmain () "parameter" s

pmain :: ParsecT [Char] () IO Integer
pmain = do
  x <- pnum `chainl1` pplus
  eof
  return x

pnum = do
  x <- read `fmap` many1 digit
  liftIO (putStrLn "bling!")
  return x

pplus = char '+' >> return (+)

ContT Monad

If you know me, you know that I always sneak in ContT whenever I get to choose a monad. Here it is: I make Parsec transform Cont (ContT of Identity). ParsecT is also an instance of MonadCont when used this way, so we can directly use callCC.

In the following example, we add some bad code to pnum, but we use callCC to skip the bad code. The parser behaves as original.

import Text.Parsec
import Control.Monad.Cont

play :: String -> Either ParseError Integer
play s = runCont (runParserT pmain () "parameter" s) id

pmain :: ParsecT [Char] () (Cont r) Integer
pmain = do
  x <- pnum `chainl1` pplus
  eof
  return x

pnum :: ParsecT [Char] () (Cont r) Integer
pnum = callCC $ \k -> do
  x <- read `fmap` many1 digit
  k x
  -- the following two lines are skipped
  many1 letter
  return (x+10000)

pplus :: ParsecT [Char] () (Cont r) (Integer -> Integer -> Integer)
pplus = char '+' >> return (+)

Yielding Parser: Introduction

With call/cc and mutable references, we can make and use generators (the yield command), which is in many people's Parsec wish list! Because in common usage, Parsec parses the input completely and returns one complete answer, or parses until a parse error and returns no answer, rather than incrementally reporting intermediate results. This is not what some applications want: they may want the intermediate results up to the parse error, or they may want to avoid spending memory for the complete, huge answer. But now we wield the power to yield!

My basic implementation of generators is in the Appendix. Here I just say how to use it. Use mkgen to create a generator from a body, like this:

g <- mkgen (\yield c0 -> do
               c1 <- yield a0
               c2 <- yield a1
               return b
           )

The body receives its yield command and an initial value as function parameters. Calling mkgen does not start the body yet; it only returns the generator. Calling the generator g starts it (providing the initial value c0) and runs until hitting yield. Calling the generator again resumes (providing the return values of yield, here c1 or c2). Like this:

dot0 <- g c0  -- dot0 = More a0
dot1 <- g c1  -- dot1 = More a1
dot2 <- g c2  -- dot2 = End b

g takes a parameter to be communicated to the body, and returns a More or End depending on whether the body is yielding or finishing. More and End are defined in the Yield module.

Here is a simple Parsec example with yield. The parser normally fetches a character, yields it, increments a count, and repeats; but if it is told False when resuming, it finishes and returns the count as the final answer. The caller of the parser just calls the generator a few times, with the last time passing in False and getting the count. Therefore, though we use the infinite file /dev/zero as input, the parser reads and finishes finitely.

import Text.Parsec
import Control.Monad.Cont
import Yield

main :: IO (Either ParseError Int)
main = runContT cmain return

cmain :: ContT r IO (Either ParseError Int)
cmain = do
  s <- liftIO (readFile "/dev/zero")
  g <- mkgen (\yield x -> runParserT (pmain (lift . yield) x) () "/dev/zero" s)
  g True >>= liftIO . print
  g True >>= liftIO . print
  End z <- g False
  return z

pmain yield b = loop b 0 where
  loop False n = return n
  loop True n = do
    c <- anyChar
    b <- yield c
    loop b $! (n+1)

You may note that the yield command and the generator do not live in ParsecT, but rather simply in ContT r IO. Accordingly, I need to add a level of lift to the yield command for use inside the parser. I do this because it suffices for my purpose. It can be done either way; there is no problem creating and using a generator at the ParsecT level, but it is usually unnecessary because we usually do not use Parsec operations between two invocations of the generator.

Yielding Parser: Tree Application

And now, we put Yield to good use. When we talk about incremental yielding parsers, we likely have XML parsing in mind! So here is a language much simpler than XML but with the same basic idea:

A tree is a tag, optionally followed by subtrees; subtrees must be enclosed in square brackets. A tag is one or more letters and/or digits. Spacing is usually optional, except that naturally two consecutive tags need spaces to separate.

Tree ::= Tag [ '[' {Tree} ']' ]
Tag ::= alphanum{alphanum}

An example is “f[x g[a b]]”. The root tree has tag f and two children. The first child has tag x and is a leaf. The second child has tag g and two leaves, a and b.

The parser yields two kinds of events to mark the beginning and the end of each subtree: after parsing a tag, yield a Start event giving that tag; after detecting the end of a subtree, yield a Stop event giving that subtree.

Here is the complete example:

import Control.Applicative((<*))
import Data.List(intercalate)
import Text.Parsec
import Control.Monad.Cont
import Yield(mkgen)

data Tree = Tree { tag :: String, sub :: [Tree] }
instance Show Tree where
  show Tree{tag=t, sub=s} = case s of
    [] -> t
    _ -> t ++ "[" ++ intercalate " " (map show s) ++ "]"

data Event = Start String | Stop Tree deriving Show

pmain yield = spaces >> (ptree <* eof)
  where
    ptree = do
      name <- ptag
      yield (Start name)
      cs <- option [] (between popen pclose (many ptree))
      let tree = Tree name cs
      yield (Stop tree)
      return tree
    ptag = (many1 alphaNum <?> "tagname") <* spaces
    popen = char '[' <* spaces
    pclose = char ']' <* spaces

main :: IO ()
main = runContT cmain return

cmain :: ContT r IO ()
cmain = do
  let s = "f [ x g[a b]" ++ cycle "huge["
  g <- mkgen (\yield _ -> runParserT (pmain (lift . yield)) () "hardcoded" s)
  replicateM_ 12 (g () >>= liftIO . print)

Despite the infinite input, the whole program finishes because it does not resume the parser infinitely. It just obtains the first few events and quit.

This parser still spends memory to build the whole tree. You can change this if you want and if the caller can live with it: do not build any subtree, do not return any tree (return () instead), and at Stop events, give just the tag and not a subtree.

Appendix: Generator and Yield

Here is the Yield module. It is explained in my Cont Monad article.

{-| Yield-style generators. -}
module Yield where

import Control.Monad.Cont
import Data.IORef

{-| 
Type of return values of generators. @More@ means the generator yields. @End@
means the generator finishes. See below for examples.
-}
data Dot a b = More a | End b deriving Show

{-|
Create a yield-style generator from a body. Example:

> g <- mkgen (\yield c0 -> do
>                c1 <- yield a0
>                c2 <- yield a1
>                return b
>            )

Then we can use:

> dot0 <- g c0  -- dot0 = More a0
> dot1 <- g c1  -- dot1 = More a1
> dot2 <- g c2  -- dot2 = End b

Further calls to @g@ return @End b@ too.

Although usually @m=n@, i.e., @g@ and the body are in the same monad as the
@mkgen@ call, technically they can be different. The @mkgen@ call
is in m with MonadIO for allocating IORef. The body and @g@ are in n with
MonadIO and MonadCont for using IORef and callCC.
-}
mkgen :: (MonadIO m, MonadIO n, MonadCont n) =>
         ((a -> n c) -> c -> n b) -> m (c -> n (Dot a b))
mkgen body = do
  inside <- liftIO (newIORef undefined)
  outside <- liftIO (newIORef undefined)
  let yield y = do
        o <- liftIO (readIORef outside)
        callCC (\ki -> do
                   liftIO (writeIORef inside ki)
                   o (More y)
               )
      next x = do
        i <- liftIO (readIORef inside)
        callCC (\ko -> do
                   liftIO (writeIORef outside ko)
                   i x
               )
      start x = do
        e <- body yield x
        liftIO (writeIORef inside (\_ -> return (End e)))
        o <- liftIO (readIORef outside)
        o (End e)
        undefined
  liftIO (writeIORef inside start)
  return next

I have more Haskell Notes and Examples

Contents