Ruby logo
In this article, I want to describe how to write a DSL / parser in Ruby with a treetop parser.

Writing the Grammar and Parser in Ruby first has the advantage of interactivity. Ruby is interpreted and has a very quick startup time.

To get going I’ll start writing a JSON parser. This may sound strange as there are already JSON parsers out there, but I don’t try to replace them, but rather show the concepts along that line rather simple grammar.

At the end, I will also give you a peek at a grammar for a more complex DSL for alert trigger creation.

To get an overview of the grammar, have a look at JSON.org.

The Ruby parser

For Ruby, I have chosen a treetop parser, which is a Parsing Expression Grammar (PEG).

Grammars in it look like this:

Basic setup of a grammar file.
grammar Grammarname (1)
  rule FirstRule (2)
   .. projections..
  end
  rule NextRule
   .. projections..
  end
end
1 The name of the grammar determines the name of the parser. See below
2 The first rule is the starting point of the grammar.

For our Json parser the top of the file json-simple.treetop would look like this:

# A treetop grammar for JSON
grammar JSONGrammar
  rule json            (1)
    array / object     (2)
  end
1 Topmost rule, parsing starts here
2 Content of the rule used for matching

A Json object would thus be either an array or an object. As both are not terminal symbols in a simple character or string (like 'this'), more rules are needed to resolve this. We will look at them below.

Before we continue with the grammar, let’s have a look at how to compile the grammar and use it.

Compiling the grammar

Treetop brings a command line tool called tt, which will compile the grammar into Ruby code.

Compiling the grammar using tt.
$ tt json-simple.treetop (1)
$ ls json.rb      (2)
json.rb
1 The input grammar file
2 The resulting Ruby parser

This worked well, but calling tt during a development cycle is a bit too heavy. Luckily, there is a way to compile grammar files on the fly when you first use them.

Here you see how a parser driver can load the definition and instantiate the parser.

json_parser.rb
require 'treetop'
class JsonParser
  Treetop.load(File.join(__dir__, 'json-simple.treetop'))
  @@parser = JSONGrammarParser.new
  def self.parse(data, env = {})
    # Pass the data over to the parser instance
    tree = @@parser.parse(data)
    # If the AST is nil then there was an error during parsing
    # we need to report a simple error message to help the user

The key part is those two lines:

  Treetop.load(File.join(base_path, 'json-simple.treetop'))  (1)
  @@parser = JSONGrammarParser.new    (2)
1 We are loading the .treetop grammar file
2 The name of the parser is determined from the grammar

The name of our parser is determined from the grammar instruction by appending Parser to the grammar name given there.

Let’s try our grammar.

Testing the Grammar with irb
$ irb
2.3.0 :001 > require_relative 'json_parser' (1)
 => true
2.3.0 :002 > JsonParser.parse '{}' (2)
NameError: undefined local variable or method `_nt_array' for #<JSONGrammarParser:0x007f895a9b31c0>
        from (eval):21:in `_nt_json'
1 Load the parser
2 Try to parse the empty object

You see that the parsing fails with an exception about an unknown method _nt_array. If you look at the grammar above you can see that we have said that a json object is either an array or an object but did not define them anywhere.

While the compiler is not telling us where in our source file it thinks the error is, the message above (which could be less cryptic) tells us that _nt_array is missing from _nt_json or in other words the rule json uses a non-terminal ( _nt ) named array, which is not defined.

TIP As the grammar is evaluated left to right, it also means that the error reporting is left to right.

 

In the above case, the missing object declaration is not reported, as the missing array already made the compiler fail.

Continuing with the grammar

So let’s continue with the grammar.

The definition of an array.
  rule array
    '[' space a_value? (space ',' space a_value)* space ']'
  end

In this example, you find two new things terminals enclosed in single quotes like the opening and
closing brackets, '[' and ']'. The parser expects such a symbol or keyword literally at this place.

So for the above, an array needs to start with a [ character and end with ] to be recognized.

The other new concept you see here are the quantifiers, that you probably already know from
regular expressions:

  • ? The item may show up zero or one time
  • + The item must show up at least one time (not shown above)
  • * The item may be given any number of times (including zero)

And last but not least items can be grouped together with parens ( and ).

So the array has to start with [ has a space rule, zero or one value and then zero or many times a block, which consists of a space, a literal colon, space and a value.

Now let’s look at the a_value.

The definition of a_value.
  rule a_value
    object / array / string / number / true / false / null
  end

Here you see the slash / which means or. A value can thus be an object, an array, a string, a number or one of true, false and null. As treetop is a PEG parser generator, it will evaluate the list from left to right.

Sometimes you want to have a rule match different options like the above, but where only a part of the matching expression is a choice. In this case, you can put a sub-expression in parens:

Grouping of sub-expressions.
  rule eval_boolean
    boolean ( 'and' / 'or' ) boolean
  end

The last excerpt from the grammar is the space.

The definition of a space.
  rule space
    [\s\n\t\r]*
  end

Here you see that the space consists of zero or more space, newline or tab characters.

TIP The full grammar up to this point is available in the file json-simple.treetop.

Generating values from the input

Up to this point, we have a grammar that can parse JSON input, but for the parser to be useful we actually want to transform the input into an object that contains nested lists and objects.

You may have seen above that the parser returns the abstract syntax tree (AST) it has generated:

    tree = @@parser.parse(data)

We could now walk the tree and determine the values from the walk. But there is a better way to do this in the treetop.

Individual grammar nodes can provide methods that can be called during the parsing run. There are several options to do this - I will show two.

Inline in the grammar

In this case, the function is directly inside the .treetop file. This is nice for a very simple function but otherwise is probably more confusing than helpful. Also, even if it is Ruby code, your IDE is most likely not helping you.

  rule val_in_parens
    '(' string ')'  (1)
    {  (2)
      def value (3)
        elements[1].text_value (4)
      end
    }
  end
1 The usual matching
2 The inline function definition is delimited by { and }
3 The name of the function can be anything with and without parameters.
4 Evaluation of the current node, see below
  Even if the name and signature of the function is arbitrary, it needs to be known to a
potential calling node.

Via Mixin

Here we externalize the functions into a Module and only provide a pointer to it inside the
grammar.

  rule val_in_parens
    '(' string ')' <ValueNode> (1)
  end
1 The match definition gets the name of the Module passed in angle brackets

We then have a Ruby file that contains the definition:

parser_extensions.rb
module ValueNode
  def value
    elements[1].text_value
  end
end

You see that this is the same as inside the grammar, but you get all the help from the IDE. It is now needed though to require that additional file from our parser driver.json_parser.rb

json_parser.rb
# Our JsonParser
require_relative 'parser_extensions'
require 'treetop'
class JsonParser

Evaluation from the parser driver

Now that we have the functions defined, we can start evaluating the input. This is simply done by calling a function on the generated AST:

json_parser.rb
  def self.parse(data, env = {})
    # Pass the data over to the parser instance
    tree = @@parser.parse(data)
[...]
    # Compute the hash and return it
    tree.value
  end

As we are calling the function value we also need to make sure that the topmost grammar node has it defined. Otherwise, an error like this is raised:

  In our case this is not needed, as our json rule only diverts to array or object and thus does not add any value (pun intended).
2.3.0 :003 > JsonParser.parse '{}'
NoMethodError: undefined method 'value' for #<Treetop::Runtime::SyntaxNode:0x007fc204068970>

Now the big question is how can we access the elements in a rule to evaluate them?

Internally the parsed result is out into a Ruby-array called elements. Data starts at index 0.

This is a bit tricky sometimes and I have added the element positions below the matches.

  rule kv_pair
     string space? ':' space? a_value
#     0      1      2     3    4
  end

To obtain the key (a string) you can thus use elements[0] and elements[4] for the value. Our value function for kv_pair could thus look like this:

  def value (parent)
    parent[elements[0].value.to_sym] = elements[4].value
  end

This works quite nicely and if you compile the grammar with tt as seen above, you can see that the compiled parser uses the same elements to provide some aliases:

Generated code in the parser.
  module KvPair0
    def string
      elements[0]
    end
    def a_value
      elements[4]
    end
  end

It is thus possible to use the aliases in the above for it to look like this:

  def value (parent)
    parent[string.value.to_sym] = a_value.value
  end

The parser gets a lot less brittle this way. If your rule now prepends a matching character literal,
the aliases stay the same while the indexes into the elements would have changed.

Now suppose you need another string in your input. Treetop would then name them string1 and string2. A more elegant solution here is to "tag" those input rules with the desired alias as in,

  rule kv_pair
     key:string space ':' space a_value
  end

So that you can access the key string as key:

parser_extensions.rb
module KVNode
  def value(parent)
    parent[key.value.to_sym] = a_value.value
  end
end

Still, the value part of the key-value-pair uses the un-aliased a_node. If you ever decide to
rename this, you also need to rename the ruby code.

  Treetop v1.6.8 seems to only provide the aliases for non-terminal symbols that are required. For cases where you have optional symbols, you still need to fall back to tags.

Handling of bad input

Now sometimes parsing an input fails because the input does not match the defined grammar. The parser does not return an AST in this case, which we can test for and return some debugging
information to the user:

json_parser.rb
    # If the AST is nil then there was an error during parsing
    # we need to report a simple error message to help the user
    if(tree.nil?)
      puts "Input : >>|#{data}|<<"
      puts @@parser.terminal_failures.join("\n")
      raise JsonParserException, "Parse error: #{@@parser.failure_reason}"
    end
    # Compute the hash and return it
    tree.value

Let’s see how this looks in irb.

Parser error as seen in irb
>> require_relative 'json_parser'
=> true
>> JsonParser.parse '{ x'
Input : >>|{ x|<<
String matching [\s\n\t] expected.
String matching '"' expected.
JsonParserException: Parse error: Expected one of [\s\n\t\r], '"' at line 1, column 3 (byte 3) after {
	from /Users/hrupp/src/dsl_writing/ruby/json_parser.rb:21:in `parse'
	from (irb):2

While working in your own code, you can change the error handling to your liking.

Traps

In this section, I want to point out a few traps that one can easily fall into.

Too much space

Suppose the following grammar

  rule object
    '{' space kv_pair? space '}'
  end
  rule kv_pair
     space string space ':' space value space
  end

Parsing or filling of the elements array may fail in unexpected ways. The parser is unable to
determine to which element the various space characters should be matched. This becomes a bit more obvious when we 'insert' the kv_pair ` declaration into the `object one:

  rule object
    '{' space space string space ':' space value space space '}' <ObjectNode>
#    0   1     2    3       4     5   6     7     8     9     10
  end

At positions 1,2 and 8,9, each time the parser is asked to (greedily) consume white space. One could argue that positions 1 and 8 do this and 2 and 9 could be no-ops, but apparently, this is not the case.

A solution can be to remove the surrounding space in kv_pair:

  rule kv_pair
     string space ':' space value
  end

Adjacent things that should not be

Take this grammar snippet:

  rule array
    '[' a_value?']'
  end

Treetop will fail to compile the grammar, as there is no whitespace between a_value? and ']', so the expression makes no sense to it.

Forgotten grouping

In the next snippet, we have two aBool that can be tied together via 'and' or 'or'.

  rule eval_boolean
    aBool 'and' / 'or' aBool
  end

Treetop will, in this case, try to match aBool 'and' or 'or' aBool, which is most of the time not what we intended. To fix this we need to put the alternatives into a group.

  rule eval_boolean
    aBool ( 'and' / 'or' ) aBool
  end

Naming things

Treetop will internally provide aliases for matches as we have seen before. Take this example:

  rule object
    '{' string ':' value '}'

It will create an alias 'value'. Now when you define a function value on it, you will get
a silent clash.

  rule object
    '{' string ':' value '}'
    {
       def value

Best here is to name your functions in a way that you would never name the rules or grammar elements.

Looking at a DSL for alert trigger creation

The Hawkular project also has an alerting component that can be used to inform the user when conditions are met. Those could be when certain values are larger than a threshold, when a string is matching or when the availability of a resource goes down. The definition of the alert trigger with its conditions etc. is done via REST-api.

With HawkFX, a pet project of mine, I have created an explorer for Hawkular. And in this regard, I have also started to add a DSL for trigger definition. I have blogged about it as well.

I will not show the full code here, but some excerpts to give you some ideas how a more complex grammar may look like. The full code is available in the HawkFX repo on GitHub.

Some examples

Trigger for availability, with high severity that is initially disabled
define trigger "MyTrigger"            (1)
  disabled                            (2)
  severity HIGH                       (3)
  ( availability "mymetric" is DOWN ) (4)
1 We define a new trigger with the name MyTrigger
2 It is disabled by at creation time
3 When it fires an alert, then the alert severity is High
4 It fires when the availability of mymetric is Down.
 
Trigger that fires when two conditions are true
define trigger "MyTrigger"
  AND(                                   (1)
    ( threshold counter "mycount" < 5 )  (2)
    ( string "mymetric" CO "ERROR" )     (3)
  )
1 The trigger only fires if all conditions are true
2 The condition is true if the counter-metric 'mycount' is less than 5
3 This condition is true if the string-metric 'mymetric' contains the string 'ERROR'
Trigger that disables itself after firing
define trigger "MyTrigger"
 ( threshold "myvalue" > 3 )
 auto-disable (1)
1 When this keyword is present, the Trigger will auto-disable after firing.

The grammar

Now that we have seen some examples, we can look at the grammar rules.

Start rule to define a trigger
   rule define
     'define trigger ' space? name:string space? (1)
     disabled:disabled? space?
     sev:severity? space?
     id:id? space? conditions space? act:action? (2)
     ae:auto_enable? space? ad:auto_disable? space?
     <DefineNode>                                (3)
   end
1 'define trigger' needs to be written as is
2 We need conditions. Pretty much everything else is optional
3 Evaluation happens in the DefineNode module

The rule for disabled is pretty simple and just checks if the keyword 'disabled' is present.

Rule for disabled
   rule disabled
     'disabled'
   end

Inside the Ruby code it is queried on the value method for the DefineNode module:

module DefineNode
  def value(env = {})
    t = Hawkular::Alerts::Trigger.new({})  (1)
    t.enabled = true if disabled.empty?    (2)
    t.name = name.value env
[..]
1 Trigger object that we want to prepare
2 We check if disabled is set.

Remember from above that disabled is available as a variable for us to use because we set it in the define rule via disabled:disabled?.

Rules for conditions
  rule conditions
    conds:(and_cond / or_cond / condition)  <ConditionsNode>
  end

Conditions can be either a single definition, or conditions joined by 'AND' or 'OR'. Next is the rule
for the 'AND' case:

Rules for conjunctions
  rule and_cond
    'AND(' space? first:condition more:( space condition )* space? ')' <AndConditionNode>
  end

Here you see a trick that is often used: we parse the first condition in first:condition and then define the remaining conditions as a list of any number of conditions in the more part.

In code this looks as follows:

The AndConditionNode evaluation in Ruby
module  AndConditionNode
  def value(env = {})
    first.value env                                      (1)
    more.elements.each {|elem| elem.condition.value env} (2)
    env[:trigger].firing_match = :ALL
  end
end
1 Evaluate the first condition
2 Evaluate the additional and optional conditions.

I am at the end of the example. Again, the full code is available in the HawkFX repository on GitHub.

References

Parsing time ranges in Treetop https://github.com/jcinnamond/treetop-time-range

Json.org http://json.org/index.html


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!


Download Red Hat Enterprise Linux Developer Suite, which includes Red Hat Enterprise Linux 7 server, a collection of development tools, and much more.

Last updated: February 6, 2024