Sunday, March 16, 2008

Groovy Combinator Parsing

Groovy Combinator Parsing
Part 4: Abstract Syntax Trees


Associated with parsing is the simultaneous construction of an abstract
syntax tree (AST). In this final article we will investigate how this
is achieved with combinator parsers.

When a parse succeeds the success object carries a result value that
represents the recognised element (together with the remaining input
to be parsed). In the case of the letter parser shown in part 1, the
recognised element is the single character:

assert letter('abc###') == ... success with 'a' as the result ...

Equally, the parser:

def letterAndDigit = seqC(letter, digit)
assert letterAndDigit('a123###') == ... success with ['a', '1'] as the result ...

Because we are using a seqC combinator, it returns a list of two elements
produced by the two parsers.

In part 2 we saw the parser for a Groovy identifier:

def identifier = seqC(identifierStart, noneOrMoreC(identifierRest))

where the permissible characters that may start a Groovy identifier are
defined by the parser identifierStart, and all subsequent characters by
the parser identifierRest. The noneOrMoreC combinator will produce a
list of the result characters:

assert identifier('bbc2###') == ... success with ['b', ['b', 'c', '2']] ...

Here, the outer list is produced by seqC and the inner list through noneOrMoreC.
Normally, a lexer would wrap these individual characters into some token
that denotes the completed identifier. Consider that we have such a class:

class Identifier {
def identifier // String
}


The combinator mapC transform the value of a parser p into a different value:

def mapC = { p, f -> ... transform the result of parser p by f ... }

The parameter f is the closure that performs the transformation. Given:

def toIdentifier = { chList -> return new Identifier(identifier: chList[0] + chList[1].join()) }
def identifier = mapC(seqC(identifierStart, noneOrMoreC(identifierRest)), toIdentifier)
assert identifier('bbc2###') == ... success with Identifier object ...


A larger example would be the Groovy while statement:

whileStatement: "while" LPAREN strictContextExpression RPAREN compatibleBodyStatement


The corresponding definition is:

def whileStatement = mapC(seqCs([whileKeyword, lParen, expression, rParen, compatibleBodyStatement]), toWhileStatement)


with the mapping transformer constructing an object of the class WhileStatement:

class WhileStatement {
def expression
def body
}






Further Developments

The combinators described in this series of articles have been stable since the
summer of 2007. They have allowed me to develop a number of simple external DSLs.

Since a BNF grammar defining some language has a fixed structure, then it is
also possible to define combinators that will parse BNF production rules. Here
the BNF rules act as the external DSL. Action code added to the parser can
produce the combinators for the grammar automatically. In effect we have a
simple compiler-compiler.

Work is underway to produce a variant of the current combinators to give
support to error reporting when a parse detects an input error. This has
involved four small classes and relatively minor changes to the combinators.
This new work is presently being tested.

I am also investigating how the combinators scale when defining parsers for
large grammars. Specifically, I am using the combinators to produce parsers
for large subsets of Groovy. Currently I have successfully created parsers for
expressions, statements, declarations, class and interface definitions. I
anticipate being able to parse all the Groovy language, which will then be
used for various tools.

The work, however, is still very much a work-in-progress. For example, results
from developing a parser for a large grammar has revealed opportunities for
optimisations. So expect to see continual enhancements of the library.
The sources for this work will be posted shortly on the Groovy site. Users are welcome
to experiment and use the library. Feedback would be welcomed.

Ken Barclay

Sunday, March 9, 2008

Groovy Combinator Parsing

Groovy Combinator Parsing
Part 3: Ambiguous Grammars

The previous article introduced the basic combinators provided by theGParsec library. We showed how they can be combined and how they define parsers that mirror the production rules of the grammar.

We regularly find that our production rules are mutually dependent. For example, with Groovy we have:

whileStatement : 'while' '(' expression ')' compatibleBodyStatement
ifStatement : ...
statement : whileStatement ifStatement ...
blockBody : ( statement separators )*
compoundStatement : '{' blockBody '}'
compatibleBodyStatement : compoundStatement statement





In this extract the compatibleBodyStatement is used to define a whileStatement. The compatibleBodyStatement is either a compoundStatement or a statement, and the latter is any of the permissible statements, including the whileStatement.

To define a Groovy parser for a whileStatement, we would first have to define a parser for compatibleBodyStatement. This would require a parser definition for statement which would, in turn, require whileStatement.

Mutual dependencies are readily handled using the lazyC combinator:


def lazyC = { p ->
return { inp ->
return p[0](inp)
}
}


Here p represents a proxy for the actual parser, held as the first element of a list. Our productions are now defined in Groovy as:


def compatibleBodyStatementProxy = []
def whileStatement = seqCs([whileKeyword, lParen, expression, rParen, lazyC(compatibleBodyStatementProxy)])
def ifStatement = ...
def statement = altCs([whileStatement, ifStatement, ...])
def blockBody = noneOrMoreC(seqC(statement, separators))
def compoundStatement = seq3C(lBrace, blockBody, rBrace)
def compatibleBodyStatement = altC(compoundStatement, statement)
compatibleBodyStatementProxy << compatibleBodyStatementProxy



that can then be used to define a whileStatement. At the end of the code we place the actual compatibleBodyStatement into the proxy.


Ambiguous Grammars

Many grammars are ambiguous and require an arbitrary lookahead to determine the correct alternative in a production rule to follow. For example, Groovy statements can begiven an identifying label. Potentially, that identifier may be a label on a statement or an identifier used on the left of an assignment statement. The Groovy production rule is:


statement : whileStatement ifStatement ... labelledStatement
labelledStatement : identifier ':' statement statement


The tryC combinator behaves like its parser parameter, but pretends its has notconsumed any input when it fails. Consider the parser altC(tryC(p), q). Even when the parser p fails while consuming some input, the choice combinator will try thealternate q. The resulting Groovy definitions are:


def statement = altCs([whileStatement, ifStatement, ..., labelledStatement])

def labelledStatement = altC(tryC(seq3C(identifier, colon, statement)), statement)





Compiling

In the final installment we shall consider how we can adapt our parsers to construct abstract syntax trees to represent the parsed input. We will also look at further developments of this work.

Ken Barclay


Sunday, March 2, 2008

Groovy Combinator Parsing

Groovy Combinator Parsing

Part 2: Combinators

The first article introduced the satisfyC combinator. From it we are able to generate any number of parsers by providing a predicate test closure. For example, we were able to define simple parsers that recognised a letter or a digit in the inputstream.

The GParsec library comprises a number of such combinators. They are summarised in the table below.

satisfyC: The satisfyC combinator consumes a single input when the predicate succeeds.

altC (alt3C, altCs): is the choice combinator. Given two parsers it only looks at its second alternative if the first has not consumed any input - regardless of the final value. The two variants work the same way, alt3C is given three parsers and altCs a list of parsers.

seqC (seq3C, seqCs): is the sequencing combinator. It runs two parsers in succession and if successful, returns the result of the two parsers. Again, seq3C and seqCs are simple variants.


noneOrMoreC, oneOrMoreC: The combinator noneOrMoreC applies a parser zero or more times to an input stream. The result from each application of the parser are returned in a list. Not surprisingly, the combinator oneOrMoreC is used for non-empty sequences.

optionalC: The combinator optionalC may succeed in parsing some input. It always returns success.


Many of these combinators are programmed in a manner similar to combinator satisfyC. Some are defined in terms of the others. For example we might define:


def seq3C = { p, q, r -> return seqC(seqC(p, q), r) }
def oneOrMoreC = { p -> return seqC(p, noneOrMorec(p)) }


These definitions reveal how we might combine the combinators to construct useful parsers. For example, given the definition for the permissible characters that may start a Groovy identifier as identifierStart, and all subsequent characters by the parser identifierRest, then we might define the parser for an identifier as:


def identifierStart = altC(letter, charP('_'))
def identifierRest = altC(identifierStart, digit)
def identifier = seqC(identifierStart, noneOrMoreC(identifierRest))


Other simple parsers are:


def digits = oneOrMoreC(digit)
def integer = digits
def decimal = seqC(digits, seqC(period, digits))


A key characteristic of combinator parsers is they give rise to parser definitions that mimic the production rules of the grammar. For example, a variable definition in Groovy comprises one or more variable declarators separated by commas. A variable declarator is an identifier with an optional initializer. The productions are:


variableDeclarator : identifier ( '=' expression )?
variableDefinitions : variableDeclarator ( ',' variableDeclarator)*


Assuming we have parsers for single characters, identifier and expression, then we define the parsers as:


def variableDeclarator = seqC(identifier, optionalC(seqC(assign, expression)))
def variableDefinitions = seqC(variableDeclarator, noneOrMoreC(seqC(comma, variableDeclarator)))


Observe the one-to-one correspondence between the production rules and the parser definitions.


Problem grammars

We have shown how we define parsers using the range of combinators provided by GParsec. However, some grammars present problems as a result of including ambiguities. This is the subject of our next article.

Ken Barclay

Wednesday, February 27, 2008

Groovy Combinator Parsing

Groovy Combinator Parsing
Part 1: Combinator Parsing

This work was inspired by the many recent articles on Domain Specific Languages (DSLs). Most authors when describing DSLs consider only internal DSLs i.e. those embedded in a host language such as Groovy and exploiting the meta-programming facilities. Commonly, external DSLs are given little coverage due to the perceived difficulties. Generally the problems are due to external DSLs requiring input from compiler technology. For most application programmers this is considered a specialised area and requiring a steep learning curve.

Most DSL articles refer the reader to generator parsers such as Lex/Yacc or Antlr. Undoubtedly these are specialist tools that will be unfamiliar to most application developers.

In contrast to parser generators, the functional programming community have developed combinator parsers. They offer a set of combinators to express grammars. These combinators are manipulated as first class values and can be combined to define new combinators for the application. One advantage of combinator parsers is they avoid the integration of differing compiler tools.

The JParsec library offers a Java implementation for a combinator parser. This is a high quality product with some excellent features. However, it is a Java implementation constructed in a classical OO manner. It features a number of classes, some quite complex with a large number of methods.

The GParsec library described in this series of articles demonstrates the construction and application of a combinator parser library based on first principles. Further, in developing the library I showcase some of the excellent features of Groovy.


A Simple Combinator

A common requirement of a parser is to ensure the next part of the input satisfies some grammar rule. For example in Groovy we might expect an identifier to follow the class keyword in a class declaration. The combinator satisfyC is used for this purpose. It is developed as a Groovy closure and has the form:


def satisfyC = { pred ->
return { inp ->
if(inp.size() == 0)
return ... fail ...
else if(pred(inp[0]))
return ... success ...
else
return ... fail ...
}
}



The satisfyC closure accepts a single parameter and returns a closure object.
The returned closure is the required parser. Observe how calling this closure with differing actual parameters will deliver a family of different parsers.

The pred parameter of closure satisfyC is used in the definition of the returned closure. This is sometimes known as a free variable. It is bound to the actual parameter when satisfyC is called and is the reason for the differing parsers that can be delivered by this one closure definition.

The pred parameter of satisfyC is a predicate closure, one that is applied to a single parameter and returns a boolean result. It is used to test the first item in the input stream, denoted by the inp parameter. If the test succeeds then the parser reports success otherwise it reports failure when the test fails or when it encounters the end of the input.

The success and failure of a parse is captured by simple objects. A successful parse will generate and object that has the result of the parse and the remaining input as its state.

We can start to produce simple character parsers with satisfyC:


def isLetter = { ch -> return Character.isLetter(ch) }
def isDigit = { ch -> return Character.isDigit(ch) }
def letter = satisfyC(isLetter)
def digit = satisfyC(isDigit)
assert letter('abc###') == ... success with bc### still in the input stream ...
assert letter('123###') == ... fail ...
assert digit('123###') == ... success with 23### still in the input stream



Where the parse succeeds, then the character that has been recognised is maintained
as the result by the success object. The remainder of the input is also maintained
by the success object. This remaining input is then made available for processing
by another parser.


The GParsec Library

The library comprises a number of such combinators. Many are no more complex than satisfyC. Some are simply combinations of others. The complete library is only some 200 LOC! In the second article in this series we shall consider these combinators and how they are combined to construct useful parsers.

Ken Barclay



Tuesday, January 8, 2008

Google's MapReduce as Groovy closures

Google’s MapReduce programming model [http://labs.google.com/papers/mapreduce.html] serves for processing and generating large data sets. Users specify a mapper function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reducer function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Here we illustrate two simple examples: (a) for counting occurrences of words in documents; and (b) for creating an inverted index of the words in documents.

Both examples use the same mapReduce function, implemented as a Groovy closure. The function accepts two parameters representing, respectively, the mapper and reducer functions (also closures). The mapper, reducer and mapReduce functions exploit many of the functional programming techniques supported through Groovy closures. To aid the development we have prepared an abstract class ListFunctions which defines many of the standard list processing functions (closures) such as map, filter, composition, etc. The abstract class MapFunctions provides similar functions when applied to Maps.


Example 1: Word count

import com.adt.fp.ListFunctions as LF
import com.adt.fp.MapFunctions as MF

def mapReduce(mapper, reducer) {
def reducePerKey = LF.composition.curry(MF.mapWithKey.curry({ k, v -> v}),
LF.composition.curry(MF.filterWithKey.curry({ k, v -> (v != null)}),
MF.mapWithKey.curry(reducer)))

def groupByKey = { kvList ->
def insert = { mp, kv -> return MF.insertWith(LF.join, kv[0], [kv[1]], mp) }
return LF.foldL(insert, MF.empty(), kvList)
}

def mapPerKey = LF.composition.curry(LF.concat,
LF.composition.curry(LF.map.curry(LF.uncurry(mapper)), MF.toList))

return LF.composition.curry(reducePerKey, LF.composition.curry(groupByKey, mapPerKey))
}

def flip = { f, x, y -> return f(y, x) }
def mkPair = { x, y -> return [x, y] }

def tokenize = { str -> return str.tokenize() }
def sum = LF.foldL.curry({ x, y -> x + y }, 0)

def occurMapper = { key, words -> return LF.map(flip.curry(mkPair, 1), tokenize(words)) }
def occurReducer = { key, occurs -> sum(occurs) }

def wordOccurrenceCount = mapReduce(occurMapper, occurReducer)

def docs = ['doc1': 'fold the fold', 'doc2': 'appreciate the unfold']

assert wordOccurrenceCount(docs) == ["unfold":1, "the":2, "appreciate":1, "fold":2]

Example 2: Inverted index

import com.adt.fp.ListFunctions as LF
import com.adt.fp.MapFunctions as MF

def mapReduce(mapper, reducer) {
// …
}

def flip = { f, x, y -> return f(y, x) }
def mkPair = { x, y -> return [x, y] }

def tokenize = { str -> return str.tokenize() }

def indexMapper = { key, words -> return LF.map(flip.curry(mkPair, key), tokenize(words)) }
def indexReducer = { key, docs -> docs.unique().sort() }

def invertedIndex = mapReduce(indexMapper, indexReducer)

def docs = ['doc1': 'fold the fold', 'doc2': 'appreciate the unfold']

assert invertedIndex(docs) ==
["unfold":["doc2"], "the":["doc1", "doc2"], "appreciate":["doc2"], "fold":["doc1"]]


The mapReduce function returns the function (closure) that performs the required processing. It is defined in terms of its mapper function parameter and its reducer function parameter. The definition for mapReduce is in terms of its support functions mapPerKey, groupByKey and reducePerkey [http://www.cs.vu.nl/~ralf/MapReduce/]. The helper function mapPerKey exports the Map input data to a List then maps the mapper function over this List. The support function groupByKey groups intermediate values by intermediate keys. Finally, reducePerKey maps the reducer function over the groups of intermediate data.

In Example 1, the mapper function makes the pairs [word, 1] for every word in the source input. It is implemented by mapping the function flip.curry(mkPair, 1) over each word. This clever function uses mkPair, to produce the pair and flip ensures they are in the correct order. The reducer function for this example simply sums each of the 1s in the intermediate values.

In Example 2, the mapper function delivers the pairs [documentedID, word] for every word in the source input. The reducer function removes any duplicates and sorts the List of documentIDs from the intermediate values.

Sample code and a more expansive decription can be found at [Groovy MapReduce].