re2c input model: call for opinions

RE2C is a generator of lexical analyzers for C/C++. It takes a description of regular expressions, transforms it to deterministic finite automaton (DFA) and compiles DFA to C/C++ code, which is further compiled to binary. The resulting program — LEXER — accepts a string of characters and tries to match it against the regular expressions encoded in DFA. If the match is successful, LEXER performs some actions (predefined by the programmer).

image

current input model: user-defined pointers

LEXER operates on the input using four pointer-like primitives:

YYCURSOR —> current character
YYLIMIT —> end of input
YYMARKER —> backtrack position
YYCTXMARKER —> backtrack position for trailing context

and a function-like primitive YYFILL(n), which is invoked when LEXER needs more input.

image

RE2C assumes that these primitives are defined by the programmer so that LEXER can use them in the following contexts:

++YYCURSOR;
yych = *YYCURSOR;
yych = *++YYCURSOR;
yych = *(YYMARKER = YYCURSOR);
yych = *(YYMARKER = ++YCURSOR);
YYMARKER = YYCURSOR;
YYMARKER = ++YYCURSOR;
YYCURSOR = YYMARKER;
YYCTXMARKER = YYCURSOR + 1;
YYCURSOR = YYCTXMARKER;
if (YYLIMIT <= YYCURSOR) ...
if ((YYLIMIT - YYCURSOR) < n) ...
YYDEBUG (label, *YYCURSOR);

Each particular LEXER uses a subset of these cases, which depends on the complexity of regular expressions and can be partially controlled by the programmer.

usual input: pointers to plain buffer

The most obvious and efficient way to define pointer-like primitives is to make them simple pointers to a plain buffer:

image

Like this:

bool lex (const char * YYCURSOR, const char * const YYLIMIT)
{
    const char * YYMARKER;
    const char * YYCTXMARKER;
#   define YYCTYPE   char
#   define YYFILL(n) {}
    /*!re2c
        "int buffer " / "[" [0-9]+ "]" { return true; }
        *                              { return false; }
    */
}

int main ()
{
    const char buffer [] = "int buffer [1024]";
    return !lex (buffer, buffer + sizeof (buffer));
}

unusual input: STL stream

Sometimes programmers want to give up efficiency for memory safety, simplicity and whatever else. Why mess around with pointers to raw memory when we have STL streams?

RE2C can be used with any input model which can be expressed in terms of pointer-like primitives. Say, we want to change the above example to work with std::istringstream. All we need is an iterator with the following operations:

And the following constraint:

None of standard iterators supports all the operations we need, so we can either derive our iterator from a standard one or write it from scratch. I did the latter:

#include <sstream>
class ptr_t
{
    std::istringstream * is_;
    std::streampos p_;

  public:
    ptr_t (std::istringstream * is, std::streampos p)
        : is_ (is)
        , p_ (p)
    {}

    int operator * ()
    {
        is_->seekg (p_);
        return is_->peek ();
    }

    ptr_t & operator ++ ()
    {
        p_ += 1;
        return * this;
    }

    ptr_t operator + (std::streamoff off)
    {
        return ptr_t (is_, p_ + off);
    }

    friend bool operator <= (const ptr_t & ptr1, const ptr_t & ptr2)
    {
        return ptr1.p_ <= ptr2.p_;
    }

    friend std::streamoff operator - (const ptr_t & ptr1, const ptr_t & ptr2)
    {
        return ptr1.p_ - ptr2.p_;
    }
};

That’s how we use it:

bool lex (ptr_t & YYCURSOR, ptr_t & YYLIMIT)
{
    ptr_t YYMARKER    = YYCURSOR;
    ptr_t YYCTXMARKER = YYCURSOR;
#   define YYCTYPE   char
#   define YYFILL(n) {}
    /*!re2c
        "int buffer " / "[" [0-9]+ "]" { return true; }
        *                              { return false; }
    */
}

int main ()
{
    const char buffer [] = "int buffer [1024]";
    std::istringstream is (buffer);
    ptr_t cursor (&is, 0);
    ptr_t limit  (&is, sizeof (buffer));
    return !lex (cursor, limit);
}

re2c input API: critique

Have a look again at how RE2C uses pointer-like primitives:

++YYCURSOR;
yych = *YYCURSOR;
yych = *++YYCURSOR;
yych = *(YYMARKER = YYCURSOR);
yych = *(YYMARKER = ++YCURSOR);
YYCURSOR = YYMARKER;
YYCURSOR = YYCTXMARKER;
YYMARKER = YYCURSOR;
YYMARKER = ++YYCURSOR;
YYCTXMARKER = YYCURSOR + 1;
if (YYLIMIT <= YYCURSOR) ...
if ((YYLIMIT - YYCURSOR) < n) ...
YYDEBUG (label, *YYCURSOR);

It seems that some cases are redundant: they could be expressed via other cases and reduced. In fact, RE2C only needs these basic operations:

++YYCURSOR get current character
*YYCURSOR advance to the next character
YYMARKER = YYCURSOR backup input position
YYCTXMARKER = YYCURSOR backup input position of trailing context
YYCURSOR = YYMARKER restore input position
YYCURSOR = YYCTXMARKER restore input position of trailing context
if ((YYLIMIT - YYCURSOR) < n) ... ensure that enough characters are left

And all the cases above could be expressed via them:

++YYCURSOR; —> ++YYCURSOR;
yych = *YYCURSOR; —> yych = *YYCURSOR;
yych = *++YYCURSOR; —> ++YYCURSOR; yych = *YYCURSOR;
yych = *(YYMARKER = YYCURSOR); —> YYMARKER = YYCURSOR; yych = *YYCURSOR;
yych = *(YYMARKER = ++YCURSOR); —> ++YYCURSOR; YYMARKER = YYCURSOR; yych = *YYCURSOR;
YYMARKER = YYCURSOR; —> YYMARKER = YYCURSOR;
YYMARKER = ++YYCURSOR; —> ++YYCURSOR; YYMARKER = YYCURSOR;
YYCURSOR = YYMARKER; —> YYCURSOR = YYMARKER;
YYCTXMARKER = YYCURSOR + 1; —> YYCTXMARKER = YYCURSOR;
YYCURSOR = YYCTXMARKER; —> YYCURSOR = YYCTXMARKER; ++YYCURSOR;
if (YYLIMIT <= YYCURSOR) ... —> if ((YYLIMIT - YYCURSOR) < 1) ...
if ((YYLIMIT - YYCURSOR) < n) ... —> if ((YYLIMIT - YYCURSOR) < n) ...
YYDEBUG (label, *YYCURSOR); —> YYDEBUG (label, *YYCURSOR);

Then we could write a much better implementation of ‘ptr_t’:

#include <sstream>
class ptr_t
{
    std::istringstream & is_;

  public:
    explicit ptr_t (std::istringstream & is)
        : is_ (is)
    {}

    ptr_t & operator ++ ()
    {
        is_.ignore ();
        return * this;
    }

    int operator * ()
    {
        return is_.peek ();
    }

    operator std::streampos ()
    {
        return is_.tellg ();
    }

    ptr_t & operator = (std::streampos p)
    {
        is_.seekg (p);
        return *this;
    }
};

And use it like this:

bool lex (ptr_t & YYCURSOR, std::streampos YYLIMIT)
{
    std::streampos YYMARKER;
    std::streampos YYCTXMARKER;
#   define YYCTYPE   char
#   define YYFILL(n) {}
    /*!re2c
        "int buffer " / "[" [0-9]+ "]" { return true; }
        *                              { return false; }
    */
}

int main ()
{
    const char buffer [] = "int buffer [1024]";
    std::istringstream is (buffer);
    ptr_t cursor (is);
    return !lex (cursor, sizeof (buffer));
}

This implementation is clearly more efficient (no need to manually track and adjust input position) and simpler (we got rid of some operators).

what I suggest

So my first attempt was that simple: rewrite RE2C code generator to express all input actions in terms of basic operations. There is a whole article about why I abandoned this idea. In short, it might introduce slight performance regressions in rare cases.

Still I felt bad about current state of affairs (simplified API was requested by one of RE2C users). Then at some point I realized that there’s no need to change default code generation at all. We can simply hide the new API behind a switch:

--input < default | custom | istream >

Exposing input API this way has another important advantage over pointer-like API: compilation errors about undefined symbols are much shorter and easier to comprehend than multi-page compiler dumps about missing or ambiguous operators.

a call for opinions

So what do you think of it? Any ideas? Any suggestions? Any esoteric input models that don’t fit into this API? I’d be glad to hear.

The best place for discussion is re2c-devel and re2c-general mailing lists, but you can also e-mail me in private or drop a comment below.

Meanwhile, I created a branch for the new API, so you can try it out yourself.

thanks to

updates

Posted on January 13, 2015



comments powered by Disqus