Using regular expressions and strings to generate an AST

Using regular expressions and strings to generate an AST

Edgar Reynaldo

Major Reynaldo

May 2007

I'm writing a compiler, and I have all of the lexical and parsing phase done so I now have parse tokens.

What I want to do is this :

ParseToken -> "${PTYPE}={VALUE}"

so the parse token becomes a string starting with $, followed by a parse type and value separated by equals.

and then I want to use regular expressions to parse the tokens into an abstract syntax tree. Follow me? I have the BNF of the grammar I'm using, and I know the start sets of each word type.

For example say I have the following code :

SOURCE CODE
while ~eof(input)
   putchar(getchar(input),output)
end

LEXING STAGE
<while>< ><~><eof><(>< ><input>< ><)><>
<    ><putchar><(>< ><getchar><[>< ><input>< ><]><,>< ><output>< ><)><>
<end>

PARSING STAGE
<KW=21,KW_LOOP_WHILE><WS><OP_NOT><ID=eof><BLK='('><WS><ID=input><WS><BLK=')'><WS>
<WS><ID=putchar><BLK='('><WS><ID=getchar><BLK='['><WS><ID=input><WS><BLK=']'><OP_COMMA><WS><ID=output><WS><BLK=')'><WS>
<KW=18,KW_CONTROL_END><WS>

ENVISIONED ANALYTIC STRING STAGE
$KW=21$ OP=NOT $ID=eof $BLK=( $ID=input $BLK=)
$ID=putchar $BLK=( $ID=getchar $BLK=[ $ID=input $BLK=] $OP=COMMA $ID=output $BLK=)
$KW=END

And then I would use regular expressions to scan the string output, replacing intermediates with new nodes, reducing them to ${NODE#} and storing them in a table. That's what I did with an old command line function parser, but now you have to deal with multiple types and it's all getting a little bit crazy.

Frick I don't know what I'm doing. We have a month and a half to have working output and I don't know if we can finish this.

Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide
King Piccolo will make it so Then I could just flip it with some sweet, sweet matrix math. In either case, I’ll ride this wagon until the wheels fall off. Thanks to those that keep it running

bamccaig

Member #7,536

July 2006

I don't know jack about parsing or ASTs, but I do know that regular expressions are often abused for parsing when in reality they're not very good at it.

For example, it's impossible to parse any arbitrary HTML code with a single regular expression. You can parse a specific tag or segment of the code, but you cannot express the entire language as a single regular expression. To parse the entire document would require many, many passes, and it'll be a less efficient solution than parsing the text character by character and building a tree as you go.

I don't really understand what all that code is. To me, it looks like needlessly transforming binary data that you've already parsed back into text so that you have to parse it again? I don't get it.

Writing a compiler is hard. Why are you writing a compiler with only a month to go? Aren't there existing, generic tools that can be used for the parsing/lexing? Forgive me for saying it, but this feels very Monday project.

--
I mean the best with what I say. It doesn't always sound that way.

Edgar Reynaldo

Major Reynaldo

May 2007

The output I showed you above is just the lexing and parsing tags. The actual data is stored in a vector of parse tokens, and now I have to build the AST, or else generate code directly. I'm just trying to think of a way to automate it, or at least streamline the process somehow, because I'm crunched for time.

Like I said, it doesn't need to support all the features of the source language, just some for demonstration purposes. It definitely won't be complete by the end of the semester, but it should be a working demo.

EDIT
What I basically want is to turn a parse token into a 'letter' that a regular expression library can parse.

I don't know maybe this is a bad idea.

bamccaig

Member #7,536

July 2006

I won't bullshit as if I know what I'm doing parsing code. I've followed a sort of tutorial to do it once for a very simple Lisp, written in JavaScript which made things way easier, but that's the extent of my knowledge.

You first probably need to decide upon a parser design:

https://en.wikipedia.org/wiki/Parser#Types_of_parsers

I think I'm only familiar with the recursive decent parser.

It might help if you further define what your vector of parse tokens is (are they strings, individual characters, structures that represent code elements, etc.). And perhaps what language you're trying to parse.

--
I mean the best with what I say. It doesn't always sound that way.

Edgar Reynaldo

Major Reynaldo

May 2007

The source language is Kestrel : https://homepage.cs.uiowa.edu/~jones/compiler/kestrel/definition.shtml

I think printing to strings and then matching strings is a terrible idea, upon further reflection. EDIT While it worked for my function parser, inserting and erasing elements until they are reduced to a single expression is hard. There are much easier ways to do things.

What I really need is a better way to match and extract tokens from the stream, and I just started working on a token stream that matches and extracts tokens as you go.

I'm writing a recursive descent parser, so it's top down.

EDIT
My professor advises a 'stack' machine with push pull add sub mult div semantics as a layer between the parse tree and the assembly code. To be honest I'm not sure I know how that works.

I think I'm only going to support one or two sizes of integer words. Maybe long and long long and then make everything else use that. I just can't fathom how many functions it would take to have the 8 or 16 different types of integers there are in Kestrel. If I support strings, and ints I think that would be enough to provide proof of concept, which is all he really seems to be after.

I'm going to have to know enough assembly to provide a stack frame and function calls. How are things generally stored? I allocate a fixed size stack and then store things on it? I was going to pass parameters on the stack, because there are only so many registers to go around, 4 to be specific, that are designated for arguments specifically. Functions easily surpass a 4 argument limit quite often though. Just look at allegro.

So, any advice, hints, helpful tutorials on learning about the stack frame and registers?

As for my parsing question, I've mostly solved it. But it's going to take a while to get through all the BNF I have to code.