Tuesday, March 15, 2011

Parsers and Compilers for Dummies. Where to start?

This is a good listing, but what is the best one for a complete newb in this area. One for someone coming from a higher level background (VB6,C#,Java,Python) - not to familiar with C or C++. I'm much more interested in hand-written parsing versus Lex/Yacc at this stage.

If I had just majored in Computer Science instead of Psychology I might have taken a class on this in college. Oh well.

From stackoverflow
  • Please have a look at: learning to write a compiler

    Also interesting:

    And there are more on the topic. But I can give a short introduction:

    The first step is the lexical analysis. A stream of characters is translated into a stream of tokens. Tokens can be simple like == <= + - (etc) and they can be complex like identifiers and numbers. If you like I can elaborate on this.

    The next step is to translate the tokenstream into a syntaxtree or an other representation. This is called the parsing step.

    Before you can create a parser, you need to write the grammar. For example we create an expression parser:

    Tokens

    addOp = '+' | '-';
    mulOp = '*' | '/';
    parLeft = '(';
    parRight = ')';
    number = digit, {digit};
    digit = '0'..'9';
    
    Each token can have different representations: + and = are both addOp and 
    23 6643 and 223322 are all numbers.
    

    The language

    exp = term | exp, addOp, term;  
    // an expression is a series of terms separated by addOps.
    term = factor | term, mulOp, factor;
    // a term is a series of factors separated by mulOps
    factor = addOp, factor | parLeft, exp, parRight | number;
    // a factor can be an addOp followed by another factor, 
    // an expression enclosed in parentheses or a number.
    

    The lexer

    We create a state engine that walks through the char stream, creating a token

    s00 
      '+', '-' -> s01       // if a + or - is found, read it and go to state s01.
      '*', '/' -> s02
      '('      -> s03
      ')'      -> s04
      '0'..'9' -> s05
      whitespace -> ignore and retry  // if a whitespace is found ignore it
      else ERROR      // sorry but we don't recognize this character in this state.
    s01
      found TOKEN addOp     // ok we have found an addOp, stop reading and return token
    s02 
      found TOKEN mulOp
    s03
      found TOKEN parLeft
    s04
      found TOKEN parRight
    s05
      '0'..'9'     -> s05    // as long as we find digits, keep collecting them
      else found number      // last digit read, we have a number
    

    Parser

    It is now time to create a simple parser/evaluator. This is complete in code. Normally they are created uusing tables. But we keep it simple. Read the tokens and calculate the result.

    ParseExp
      temp = ParseTerm // start by reading a term
      while token = addOp do
        // as long as we read an addop keep reading terms
        if token('+') then temp = temp + ParseTerm  // + so we add the term
        if token('-') then temp = temp - ParseTerm  // - so we subtract the term
      od
      return temp // we are done with the expression
    
    ParseTerm
      temp = ParseFactor
      while token = mulOp do
        if token('*') then temp = temp * ParseFactor
        if token('/') then temp = temp / ParseFactor
      od
      return temp
    
    ParseFactor
      if token = addOp then
        if token('-') then return - ParseFactor  // yes we can have a lot of these
        if token('+') then return ParseFactor
      else if token = parLeft then
        return ParseExpression
        if not token = parRight then ERROR
      else if token = number then
        return EvaluateNumber   // use magic to translate a string into a number
    

    This was a simple example. In real examples you will see that error handling is a big part of th

    tyndall : Thanks. This is great stuff. I guess a few of the stages were confusing me. Now I can go read up more on each stage.
  • If you're a complete n00b, the most accessible resource (in both senses of the term) is probably Jack Crenshaw's tutorial. It's nowhere near comprehensive but for getting started, I can't think of anything close except for books that are long out of print.

0 comments:

Post a Comment