In addition to regular Packrat / Parsing Grammar / TDPL features Esrap supports
Esrap was originally written by Nikodemus Siivola. It is now developed and maintained by Jan Moringen.
Esrap is maintained in Git:
git clone -b stable git://github.com/scymtym/esrap.git
will get you a local copy (omit -b stable
to get the latest
development version).
is the GitHub project page.
Esrap is licenced under an MIT-style licence.
For more on packrat parsing, see http://pdos.csail.mit.edu/~baford/packrat/thesis/ for Bryan Ford’s 2002 thesis: “Packrat Parsing: a Practical Linear Time Algorithm with Backtracking”.
For left recursion support in packrat parsers, see http://www.vpri.org/pdf/tr2007002_packrat.pdf for A. Warth et al’s 2008 paper: “Packrat Parsers Can Support Left Recursion”.
Parsing proceeds by matching text against parsing expressions. Matching has three components: success vs failure, consumption of input, and associated production.
Parsing expressions that fail never consume input. Parsing expressions that succeed may or may not consume input.
A parsing expressions can be:
A terminal is a character or a string of length one, which succeeds and consumes a single character if that character matches the terminal.
Additionally, Esrap supports some pseudoterminals.
character
always succeeds, consuming
and producing a single character.
(character-ranges range ...)
match a
single character from the given range(s), consuming and producing that
character. A range can be either a list of the form (#\start_char
#\stop_char)
or a single character.
"foo"
succeeds and consumes input as if (and #\f #\o
#\o)
. Produces the consumed string.
(string length)
can be used to specify
sequences of arbitrary characters: (string 2)
succeeds and
consumes input as if (and character character)
. Produces the
consumed string.
Nonterminals are specified using symbols. A nonterminal symbol succeeds if the parsing expression associated with it succeeds, and consumes whatever the input that expression consumes.
The production of a nonterminal depends on the associated expression and an optional transformation rule.
Nonterminals are defined using defrule
.
Note: Currently all rules share the same namespace, so you should not use symbols in the COMMON-LISP package or other shared packages to name your rules unless you are certain there are no other Esrap using components in your Lisp image. In a future version of Esrap grammar objects will be introduced to allow multiple definitions of nonterminals. Symbols in the COMMON-LISP package are specifically reserved for use by Esrap.
(and subexpression ...)
A sequence succeeds if all subexpressions succeed, and consumes all input consumed by the subexpressions. A sequence produces the productions of its subexpressions as a list.
(or subexpression ...)
An ordered choice succeeds if any of the subexpressions succeeds, and consumes all the input consumed by the successful subexpression. An ordered choice produces whatever the successful subexpression produces.
Subexpressions are checked strictly in the specified order, and once a subexpression succeeds no further ones will be tried.
(not subexpression)
A negation succeeds if the subexpression fails, and consumes one character of input. A negation produces the character it consumes.
(* subexpresssion)
A greedy repetition always succeeds, consuming all input consumed by applying subexpression repeatedly as long as it succeeds.
A greedy repetition produces the productions of the subexpression as a list.
(+ subexpresssion)
A greedy repetition succeeds if subexpression succeeds at least once, and consumes all input consumed by applying subexpression repeatedly as long as it succeeds. A greedy positive repetition produces the productions of the subexpression as a list.
(? subexpression)
Optionals always succeed, and consume whatever input the subexpression
consumes. An optional produces whatever the subexpression produces, or
nil
if the subexpression does not succeed.
(& subexpression)
A followed-by predicate succeeds if the subexpression succeeds, and consumes no input. A followed-by predicate produces whatever the subexpression produces.
(! subexpression)
A not-followed-by predicate succeeds if the subexpression does not
succeed, and consumes no input. A not-followed-by predicate
produces nil
.
(< amount subexpression)
A lookbehind succeeds if subexpression
succeeds at the input
position reached by moving backward amount
, a positive integer,
characters from the current position and consumes no input. A
lookbehind produces whatever subexpression
produces.
(> amount subexpression)
A lookahead succeeds if subexpression
succeeds at the input
position reached by moving forward amount
, a positive integer,
characters from the current position and consumes no input. A
lookahead produces whatever subexpression
produces.
(predicate-name subexpression)
The predicate-name
is a symbol naming a global function. A
semantic predicate succeeds if subexpression succeeds and the
named function returns true for the production of the subexpression. A
semantic predicate produces whatever the subexpression produces.
Note: semantic predicates may change in the future to produce whatever the predicate function returns.
(function function-name)
function-name
is a symbol naming a global
function. function-name
’s lambda-list has to be compatible to
(text position end)
where text
is the whole input and
position
and end
indicate the maximal subsequence
function-name
should attempt to parse.
A function terminal succeeds if either
function-name
returns T
as its third value.
function-name
returns nil
as its third value (or returns
only two values) and nil
as its second value. This indicates that
the entire remaining input has been consumed.
function-name
returns nil
as its third value (or returns
only two values) and an integer > position
as its second value
indicating the position up to which text
has been consumed.
function-name
returns a value of type successful-parse
as
its first value.
When a function terminal succeeds, the first return value is an arbitrary production.
A function terminal fails if either
function-name
returns two values: an ignored value and
position
. Returning position
indicates that no progress
has been made.
function-name
returns three values: an ignored value, nil
or an integer >= position
and a string or a condition explaining
the failure. In this case, when the second value is not nil
, it
indicates the exact position of the failure.
function-name
returns a value of type error-result
as its
first value.
Note that rules which use functions as terminals do not automatically pick up redefinitions of the used functions. For that to happen, the rules have to be redefined as well.
See example-function-terminals.lisp for examples.
One aspect of designing Esrap rules is left recursion. A direct left recursive rule is of the form
(defrule left-recursion (or (and left-recursion STUFF) ALTERNATIVES))
The simplest indirect left recursive rule is of the form
(defrule left-recursion.1 left-recursion.2) (defrule left-recursion.2 (or (and left-recursion.1 STUFF) ALTERNATIVES))
Esrap can handle both kinds of left recursive rules, but the linear-time
guarantee generally no longer holds in such cases. The special variable
*on-left-recursion*
can be set to either nil
or
:error
to control Esrap’s behavior with respect to allowing left
recursion.
See example-left-recursion.lisp for examples.
Define symbol
as a nonterminal, using expression
as associated the parsing expression.
Multiple options
specifying transforms are composed in the order of
appearance:
(:text t) (:function parse-integer) => (alexandria:compose #'parse-integer #'text)
Following options
can be specified:
(:when test)
The rule is active only when test
evaluates to true. This can be used
to specify optional extensions to a grammar.
This option can only be supplied once.
(:constant constant)
No matter what input is consumed or what expression
produces, the production
of the rule is always constant
.
(:function function)
If provided the production of the expression is transformed using
function
. function
can be a function name or a lambda-expression.
(:identity boolean)
If true, the production of expression is used as-is, as if (:function identity)
has been specified. If no production option is specified, this is the default.
(:text boolean)
If true, the production of expression is flattened and concatenated into a string
as if by (:function text)
has been specified.
(:lambda lambda-list &body body)
If provided, same as using the corresponding lambda-expression with :function
.
As an extension of the standard lambda list syntax, lambda-list
accepts
the optional pseudo lambda-list keyword esrap:&bounds
, which (1)
must appear
after all standard lambda list keywords. (2)
can be followed by one or two
variables to which bounding indexes of the matching substring are bound.
Therefore:
lambda-list
::=
(standard-lambda-list-elements [&bounds start [end]])
(:destructure destructuring-lambda-list &body body)
If provided, same as using a lambda-expression that destructures its argument
using destructuring-bind
and the provided lambda-list with :function
.
destructuring-lambda-list
can use esrap:&bounds
in the same way
as described for :lambda
.
(:around ([&bounds start [end]]) &body body)
If provided, execute body
around the construction of the production of the
rule. body
has to call esrap:call-transform
to trigger the computation of
the production. Any transformation provided via :lambda
, :function
or :destructure
is executed inside the call to esrap:call-transform
. As a
result, modification to the dynamic state are visible within the
transform.
esrap:&bounds
can be used in the same way as described for :lambda
and :destructure
.
This option can be used to safely track nesting depth, manage symbol tables or for other stack-like operations.
:error-report
( t
| nil
| :context
| :detail
))
Defaults to t
if not provided. Controls whether and how the rule
is used in parse error reports:
t
The rule is used in parse error reports without restriction (i.e. when describing the context of a failure as well as listing failed rules and expected inputs).
nil
The rule is not used in parse error reports in any capacity. In particular, inputs expected by the rule are not mentioned.
This value is useful for things like whitespace rules since something like "expected space, tab or newline", even if it would have allowed the parser to continue for one character, is rarely helpful.
:context
The rule is used in the "context" part of parse error reports. The rule is neither mentioned in the list of failed rules nor are inputs expected by it.
:detail
The rule is not used in the "context" part of parse error reports, but can appear in the list of failed rules. Inputs expected by the rule are mentioned as well.
Parses text
using expression
from start
to end
.
Incomplete parses, that is not consuming the entirety of text
, are
allowed only if junk-allowed
is true.
Returns three values:
1) A production, if the parse succeeded, nil
otherwise.
2) The position up to which text
has been consumed or nil
if the
entirety of text
has been consumed.
3) If the parse succeeded, even if it did not consume any input, t
is
returned as a third value.
The third return value is necessary to distinguish successful and failed parses for cases like
(parse '(! #\a) "a" :junk-allowed t) (parse '(! #\a) "b" :junk-allowed t)
in which the first two return values cannot indicate failures.
raw
controls whether the parse result is interpreted and translated
into the return values described above. If raw
is true, a parse result
of type result
or error-result
is returned as a single value.
Note that the combination of arguments :junk-allowed t :raw t does not
make sense since the junk-allowed
parameter is used when parse results
are interpreted and translated into return values which does not
happen when :raw t.
Prints the grammar tree rooted at nonterminal symbol
to stream
for human
inspection.
Arguments must be strings, or lists whose leaves are strings. Catenates all the strings in arguments into a single string.
Associates rule
with the nonterminal symbol
. Signals an error if the
rule is already associated with a nonterminal. If the symbol is already
associated with a rule, the old rule is removed first.
Modifies the nonterminal symbol
to use expression
instead. Temporarily
removes the rule while it is being modified.
Returns rule designated by symbol
, if any. Symbol must be a nonterminal
symbol.
Makes the nonterminal symbol
undefined. If the nonterminal is defined an
already referred to by other rules, an error is signalled unless :force
is
true.
Returns the dependencies of the rule:
primary value is a list of defined
nonterminal symbols, and secondary value is a list of undefined nonterminal
symbols.
Return the parsing expression associated with the rule
.
Modify rule
to use expression
as the parsing expression. The rule must be
detached beforehand.
Returns the nonterminal associated with the rule
, or nil
if the
rule is not attached to any nonterminal.
automatically generated reader method
Turn on tracing of nonterminal symbol
.
If recursive
is true, turn on tracing for the whole grammar rooted at
symbol
. If recursive
is a positive integer, turn on tracing for all
rules reachable from the nonterminal symbol
in that number of steps.
If break
is true, break is entered when the rule is invoked.
If supplied, condition
has to be a function whose lambda-list is
compatible to (symbol text position end). This function is called to
determine whether trace actions should be executed for the traced
rule.
symbol
is the name of the rule being executed.
text
is the whole text being parsed.
position
is the position within text
at which the rule is executed.
end
is the end position of the portion of text
being parsed.
Turn off tracing of nonterminal symbol
.
If recursive
is true, turn off tracing for the whole grammar rooted at
symbol
. If recursive
is a positive integer, turn off tracing for all
rules reachable from the nonterminal symbol
in that number of steps.
break
and condition
are ignored, and are provided only for symmetry
with trace-rule
.
Return a list of terminals or tree of expressions with which a text
parsable by expression
can start.
A tree instead of a list is returned when expression
contains
semantic predicates, not
or !
. Elements in the returned list or
tree are
(predicate-name nested-elements)
where nested-elements
is the list of start terminals of the
expression to which predicate-name
is applied.
not
and !
expressions are represented as
({not,!} NESTED-ELEMENTS)
where nested-elements
is the list of start terminals of the
negated expression.
<
and >
expressions are represented as
({<,>} offset
NESTED-ELEMENTS)
where offset
is a positive integer and nested-elements
is the
list of start terminals of the expression that should match
offset
characters backward/forward from the current position.
The (outermost) list is sorted likes this:
1
. string terminals
2
. character terminals
3
. the character
wildcard terminal
4
. semantic predicates
5
. everything else
If supplied, when-rule-error-report
restricts processing of
nonterminals to rules whose :error-report
option is compatible with
the value of when-rule-error-report
.
Print a description of terminal
onto stream
.
In additional to actual terminals, terminal
can be of the forms
(PREDICATE-NAME TERMINALS) ({not,!} TERMINALS) ({<,>} OFFSET TERMINALS)
(i.e. as produced by EXPRESSION-START-TERMINALS).
This special variable controls Esrap’s behavior with respect to allowing left recursion.
When :error
, parse
signals a left-recursion
error when it encounters a
left recursive rule. Otherwise the rule is processed.
Note: when processing left recursive rules, linear-time guarantees generally no longer hold.
Return the input position at which the parse failure represented
by condition
occurred.
Return the result associated to the parse error represented by
condition
.
Return the context result associated to the parse error
represented by condition
.
Class precedence list: esrap-error, parse-error, error, serious-condition, condition, t
Signaled when an Esrap parse fails. Use esrap-error-text
to obtain the
string that was being parsed, and esrap-error-position
the position at which
the error occurred.
Class precedence list: left-recursion, esrap-error, parse-error, error, serious-condition, condition, t
May be signaled when left recursion is detected during Esrap parsing.
left-recursion-nonterminal
names the symbol for which left recursion
was detected, and left-recursion-path
lists nonterminals of which the
left recursion cycle consists.
Note: This error is only signaled if *on-left-recursion*
is bound
to :error
.
Class precedence list: esrap-parse-error, esrap-error, parse-error, error, serious-condition, condition, t
This error is signaled when a parse attempt fails in a way that .
Class precedence list: undefined-rule-error, undefined-rule, error, serious-condition, condition, t
Signaled when an undefined rule is encountered.