Argument Lexer

This is a fairly thorough description of the Argument Lexer. It is short because the lexer code is small. It is small because the Rulz language specification (grammar) is simple.

Introduction

First, this language is in two parts, Operators and Arguments; both have their own Lexer/Parser. Second, lexing and parsing occurs at runtime. The loading of a Rulz program occurs first; that part strips comments and creates subroutines (if any) and data (if any). Then the Rules are run one after the other until no more are found.

This here is about the Argument lexer. It's goal is to turn a string like:

word$var123(1,2,"3")'a string'

To an array like:

word
$var
123
(1,2,"3")
'a string'

This is a characater by character, straight forward lexer — meaning exactly what that says: one character at a time from 0 to N.

The Grammar

The Rulz grammar is not at all like any other computer language's. To begin, separating the lexers means this here is about identifiers and literals — data — and not operators. (In the Rulz language, there are no keywords or separators, only operators and arguments.)

A few examples will suffice to start, defining the grammar for barewords, variables (barewords like Perl, variables like Perl/PHP but only either lowercase or uppercase — explained elsewhere):

[word]
abcdefghijklmnopqrstuvwxyz
[var]
$
abcdefghijklmnopqrstuvwxyz
[ivar]
$
ABCDEFGHIJKLMNOPQRSTUVWXYZ

That is the Grammar Definition Structure exactly: a name followed by one or more Character Lists.

The Lexing

There is a "first pass" on the first character that starts the lexing, which results in a list of strings that match a character in a definition character list. E.g. lexing the input string example above, the character w first pass result is:

(word)

This is because w matches one first character list. (The result is a list, using Rulz/Perl syntax, which, for lists, is Perl-like. Why a list will become obvious soon.)

There is a function associated with word which receives the input. That function counts the characters that match the word definition, which are lowercase letters, o, r, and d. First token from the input positions 0 - 3 is word.

For $ — continuing with the input example above, after a token has been found it is removed from the input string and the first pass occurs again — the result is:

(var,ivar)

And this means that the next character is used to "narrow" the list, in this example v is only found in var so ivar is eliminated from the list. Then the function for var finds the token in input positions 0 - 3 (from its context, because the previous token was removed), $var

For some inputs the first pass can result in two matches. This is a correct, or "fulfilled", condition (the lexer does not do evaluation). In this case the lexer assumes the first is the correct token.

Where a token can be ambiguous it needs to be "forced" by using more delimiters, like inside quotes.

Tokenizing for a single character list — just a through z for words — is the simplest case and is found by a single "is in list" test.

For multiple character lists such as for $ — which is a "leading character" token — or for quoted strings — which are "enclosing character" tokens — there might be required a few "state" flags, as well as the handling of an escaped enclosing character.

The tokenizer — the removing of characters from the input — is function based, which simply means that there is a named function for each token definition. (The function just "counts" token characters; the lexer does the token copying.)

That means that quoted strings are of the simplest grammar, just the starting character (their functions are so similar to be identical):

[quotes]
"'

And it's function simply returns true until the ending quote is found (while properly handling escaped quotes).

Special Notes

Characters not in any definition with be simply discarded during the first pass — whether or not this results in an error, a diagnostic or silence is runtime dependent. The odd thing about this is that spaces are seen as kind of an error.

Definitions need not define all characters for a token — they can just define the minimum unique start character sequence, which means the token function must count all characters that make its token.

Function Notes

The goal for a token function is return true while the character passed to it is part of the token.

Since a function might have state information — such as "first character", "previous character" and/or "end character" — after a token is found (which is "eaten" by the lexer) the function is called again with another parameter to tell it to "reset" any state data.

Not all functions will retain state data and the extra call will be a "No Op".

For a start/end delimited token there is an "extra" return parameter. Because the function return value is a boolean which controls the lexer loop, with a "token end has been reached" return of false (technically a transition from true to false but the lexer handles that not the token function), to be able to end on a delimeter (as opposed to "character is not in charlist"; a slight but significant difference), the function returns false with the "extra character" set — the lexing gets its token by appending this extra character to the token and it continues.

That allows this input:

'foo''bar'

to be two quoted strings.

(All that "extra" code are only a few lines, much less overall than that text do describe it.)

Conclusion

Currently there are only a dozen argument token types. The lexer code is a small API; it and the grammar are all that are needed — there is no "typical" lexer generator code.

Postscript

There is another way to implement the lexer; with a single regular expression, and a single builtin function call. A PHP example:

$s = "word\$var123(1,2,\"3\")'a string'";
$r = '/[a-z]+|\$[a-z]+|\-?[0-9]+|\((?:[^()\\\\]+|.)*\)|\'(?:[^\'\\\\]+|.)*\'/';
preg_match_all($r,$s,$m);
var_export($m[0]);

array (
  0 => 'word',
  1 => '$var',
  2 => '123',
  3 => '(1,2,"3")',
  4 => '\'a string\'',
)

(Though only five argument types are coded for; there would be several more in the end code, but still not too many for a single expression to work. Written in a less visually impaired way, of course.)

Here is a Go example:

package main
import ("regexp";"fmt")
func main() {
s := `word$var123(1,2,"3")'a string'`
r := `[a-z]+|\$[a-z]+|-?[0-9]+|\((?:[^()\\]+|.)*\)|'(?:[^'\\]+|.)*'`
rg := regexp.MustCompile(r)
rs := rg.FindAllString(s,-1)
fmt.Printf("%#v\n",rs)
}
[]string{"word", "$var", "123", "(1,2,\"3\")", "'a string'"}

Perl does not see to have an "all" version of a match. As far as my limited understanding goes — and this is just me, I am sure — I can only see a loop for each token, removing each found token and trying again.

Perl also states:

[T]he typical "match a double-quoted string" problem can be most 
efficiently performed when written as:

/"(?:[^"\\]++|\\.)*+"/

But the smaller version actually used in the above examples works in Perl as well. (Though I am not about to do any kind of performance analysis on any differences between the two.)

Post Postscript

What that means is that it would be silly for Rulz to NOT use the regular expression version of the lexer. And it would be INSANE for Rulz to use any "normal" programming lexer based on Yacc, Bison, Flex, or whatever, because such stuff would produce several files of incomprehensible code with a size greater than Rulz' entire code base.

Last Word

A Rule equivalent:

=s "word\$var123(1,2,\"3\")'a string'"
=r '[a-z]+|\$[a-z]+|-?[0-9]+|\((?:[^()\\]+|.)*\)|\'(?:[^\'\\]+|.)*\''
/$r/ $s
^