- Implementing Stack Oriented Languages
- Implementing Stack Oriented Languages — Part 2
- Implementing Stack Oriented Languages – Part 3
- Implementing Stack Oriented Languages – Part 4
Post Stastics
- This post has 3836 words.
- Estimated read time is 18.27 minute(s).
Part 2 – The Parser
It’s been some time since the first installment. It amazes me how quickly time flies by when you have so many projects at hand. So let’s jump right in and get coding!
Before we get started though, let’s make some changes to our keywords file. First, we won’t need the PSH (Push) keyword. The same is true for the ROT keyword. So remove them both.
# Stack Operators 'POP', 'PSH', 'SWAP', 'DUMP', 'DROP', 'ROT', 'OVER', 'DUP', 'DUP2', 'DUP3', 'CLEAR',
We will now add a few new keywords in the file. Add the SWAP2, ROL, ROR, and DUP4 to the stack operators section of the keyword file as shown below:
# Stack Operators 'POP', 'SWAP', 'SWAP2', 'DUMP', 'DROP', 'ROL', 'ROR', 'OVER', 'DUP', 'DUP2', 'DUP3', 'DUP4', 'CLEAR',
The ROT keyword was going to be a stack rotation command. After some thought, however, I realized that having the ability to rotate left and right would be better. So, I decided to remove ROT and replace it with ROL and ROR for rotate left and rotate right. I’m building this language on the fly. So, there may be more changes to come.
Now that we have the changes out of the way, let’s move on to the parser. Create a new file called parser.py and add the code below:
The Parser
# parser.py from lexer import Lexer from stack import Stack from tokens import * ####################################### # Parser # ####################################### class Parser: pass
Our parser will ask the lexer for tokens. The lexer will take the input text and generate tokens to pass to the parser for further processing. The parser will recognize the token type and handle it, ensuring it is an expected type, given the current context of the program. This all sounds rather complicated. However, once you get a feel for hope to implement this, you’ll find it rather easy to understand. Let’s see some code. We will begin by adding an __init__ method to our parser.
... ... ####################################### # Parser # ####################################### class Parser: def __init__(self, lexer: Lexer, data_stack: Stack = None): self.lexer = lexer # Setup data stack if data_stack: self.data = data_stack else: self.data = Stack() self.current_tok = Token(TT_START, None)
Notice the __init__ method takes in a lexer instance and an optional stack instance. We will need a method to get the next token in the input stream from the lexer. So add the next_token() method as shown below. Notice how we check the current_tok property and test if the current_tok is an end of file “EOF” token. If so, we don’t need to call the lexer any longer. If however, the current_token is any other token, we need to call the lexer for the next token and save it in current_tok for use later.
def next_token(self): if self.current_tok != TT_EOF: self.current_tok = self.lexer.next_token()
The meat of our parser comes from the parse method. The parse method looks at the token stored in current_tok and decides how to process it. Let’s start simple. We will first parse integers. Add the following code to the parser.
def parse(self): # get token and decide what to do with it tok = self.current_tok while tok.type != TT_EOF: self.next_token() tok = self.current_tok print(f'token: {tok}') if tok.type == TT_START: continue
Next, we will need a driver program, including a source string to parse. The following code will do fine for our first foray into parsing.
if __name__ == '__main__': source = """ 1 2 3 5 4 """ lexer = Lexer(source) parser = Parser(lexer) parser.parse()
Just to insure we’re all on the same page, I’ll show the complete parser file in it’s current state.
# parser.py from lexer import Lexer from stack import Stack from tokens import * ####################################### # Parser # ####################################### class Parser: def __init__(self, lexer: Lexer, data_stack: Stack = None): self.lexer = lexer # Setup data stack if data_stack: self.data = data_stack else: self.data = Stack() self.current_tok = Token(TT_START, None) def next_token(self): if self.current_tok != TT_EOF: self.current_tok = self.lexer.next_token() def parse(self): # get token and decide what to do with it tok = self.current_tok while tok.type != TT_EOF: self.next_token() tok = self.current_tok print(f'token: {tok}') if tok.type == TT_START: continue if __name__ == '__main__': source = """ 1 2 3 5 4 """ lexer = Lexer(source) parser = Parser(lexer) parser.parse()
If we’ve done everything correctly, we can run the parser and get the following output:
token: type: NUMBER, value: 1, at line 1 in column 3 token: type: NUMBER, value: 2, at line 1 in column 5 token: type: NUMBER, value: 3, at line 1 in column 7 token: type: NUMBER, value: 5, at line 1 in column 9 token: type: NUMBER, value: 4, at line 1 in column 11 token: type: EOF, value: None, at line 1 in column 11 Process finished with exit code 0
The next step is to figure out what to do with each token. In the case of a stack-based language, we simply take constant values and place them on the stack. So, let’s do that next.
Add the following code at the end of the parse method. We will fill the parse method with tests for each token type and code to handle each token type.
if tok.type == TT_NUMBER: self.data.push(int(tok.value)) continue
The parse method now looks like the following:
def parse(self): # get token and decide what to do with it tok = self.current_tok while tok.type != TT_EOF: self.next_token() tok = self.current_tok print(f'token: {tok}') if tok.type == TT_START: continue if tok.type == TT_NUMBER: self.data.push(int(tok.value)) continue
Recall that back in the ‘__init__’ method we set the ‘self.data’ to an empty stack. We will use this stack to store function parameters and results. Recall that stacked oriented languages primarily operate on values stored on the stack and place the results of operations back onto the stack.
Let’s test what we have now. But first, let’s add code to the driver to print the stack contents so we can see if our data made it onto the stack. Add the following print statement to the end of the driver:
print(parser.data)
The driver should now look like the following:
if __name__ == '__main__': source = """ 1 2 3 5 4 """ lexer = Lexer(source) parser = Parser(lexer) parser.parse() print(parser.data)
Now, if we run the parser driver we should get the following output:
token: type: NUMBER, value: 1, at line 1 in column 3 token: type: NUMBER, value: 2, at line 1 in column 5 token: type: NUMBER, value: 3, at line 1 in column 7 token: type: NUMBER, value: 5, at line 1 in column 9 token: type: NUMBER, value: 4, at line 1 in column 11 token: type: EOF, value: None, at line 1 in column 11 [ '1', '2', '3', '5', '4' ] Process finished with exit code 0
As you can see, the stack now contains the data values from our source code. Also, note the order of these values. The right-most value is in the head position of the stack and will be the first value popped off the stack.
The next thing to do is add some code so we can do something useful with the data on the stack. The easiest operations to code and understand are the arithmetic operations. So let’s code those next.
Arithmetic Operators
First, we will need tests for the arithmetic operator token types. Looking at our tokens in tokens.py we see that we need tests in our parser for the TT_PLUS, TT_MINUS, TT_MUL, and TT_DIV token types. These will be simple if tests.
When we find an arithmetic operator token, we need to handle the operation by popping the values needed for the operation off the stack, perform the arithmetic operation, and then place the result back on the stack.
After adding these tests the parse method should look as follows:
def parse(self): # get token and decide what to do with it tok = self.current_tok while tok.type != TT_EOF: self.next_token() tok = self.current_tok print(f'token: {tok}') if tok.type == TT_START: continue if tok.type == TT_NUMBER: self.data.push(int(tok.value)) continue elif tok.type == TT_PLUS: right = self.data.pop() left = self.data.pop() result = left + right self.data.push(result) elif tok.type == TT_MINUS: right = self.data.pop() left = self.data.pop() result = left - right self.data.push(result) elif tok.type == TT_MUL: right = self.data.pop() left = self.data.pop() result = left * right self.data.push(result) elif tok.type == TT_DIV: right = self.data.pop() left = self.data.pop() result = left // right self.data.push(result)
Let’s change the source code in the driver to:
source = """ 4 5 + """
Now run the driver. On exit, the stack should contain the value 9. 4 + 5 = 9. Now change the operator from addition to subtraction and run the driver. The stack now contains -1. Go on and test multiplication and division. If they work as expected, try some more complex problems. As an example:
4 5 * 2 / 5 - 127 16 - 2 * 197 2 * 4 /
For the above your results should be 5, 222, and 98 respectively.
Bitwise Operators
Next, let’s add the code for raising a number to a power. This code follows the same pattern as the other arithmetic operators. So give it a try on your own. The truth is most of the operations we will perform will follow the same pattern. So let’s add the code to handle the bitwise operators:
# Logical Operations elif tok.type == TT_INV: left = self.data.pop() left = ~left self.data.push(left) elif tok.type == TT_XOR: right = self.data.pop() left = self.data.pop() self.data.push((left ^ right)) elif tok.type == TT_AND: right = self.data.pop() left = self.data.pop() self.data.push((left & right)) elif tok.type == TT_OR: right = self.data.pop() left = self.data.pop() self.data.push((left | right)) elif tok.type == TT_LSL: right = self.data.pop() left = self.data.pop() self.data.push(left << right) elif tok.type == TT_LSR: right = self.data.pop() left = self.data.pop() self.data.push((left >> right))
The INV or Invert operator simply flips all ones in a number to zeros and all zeros to 1s. Most computers handle and computer languages handle integers as 2’s complement values. See this Computerphile episode for more info on 2’s complement representation, or just trust me when I say that inverting all the bits in the value 1 in python should result in -2. So the change the source in the driver to read:
source = “”” 1 ~ “””
Then run the driver. You should receive an result of -2.
Now lets try shifting 1 by 2. The result should be 4. However, when I tried this I found a small bug in the lexer. It seems that I return the TT_LSR token type for the ‘<<‘ operator. This can be fixed by changing the token type to TT_LSL. Now, change the source line in the driver to read:
source = “””” 1 2 << “””
If you run this driver now you should get a result of 4. Now change the shift distance of 2 to 3, 4, and 5. You should get results of 8, 16, 32 respectively. If not you have some debugging to do. Otherwise, Try the other logical operator and ensure they work as expected. Remember that shifting right by 1 is the same as dividing a value by 2. If you right sift a value by 2 it is the same as dividing the value by 4.
OK, we’ve covered enough ground for this post. Next time we will look at adding a front end. We will need this so that we can test our parser with real input source files. This will ease further development as we add more complex features like conditional statements, loops, variables, string handling, and input/output routines.
I hope your enjoying this series of articles. If your wanting to develop your languages, stack-oriented languages are some of the easiest to implement, and they are a great way to get your feet wet and learn the basics of language implementation.
As always, here’s the complete code for part 2 of the series:
Final Code
# parser.py from lexer import Lexer from stack import Stack from tokens import * ####################################### # Parser # ####################################### class Parser: def __init__(self, lexer: Lexer, data_stack: Stack = None): self.lexer = lexer # Setup data stack if data_stack: self.data = data_stack else: self.data = Stack() self.current_tok = Token(TT_START, None) def next_token(self): if self.current_tok != TT_EOF: self.current_tok = self.lexer.next_token() def parse(self): # get token and decide what to do with it tok = self.current_tok while tok.type != TT_EOF: self.next_token() tok = self.current_tok print(f'token: {tok}') if tok.type == TT_START: continue if tok.type == TT_NUMBER: self.data.push(int(tok.value)) continue # Arithmetic elif tok.type == TT_PLUS: right = self.data.pop() left = self.data.pop() result = left + right self.data.push(result) elif tok.type == TT_MINUS: right = self.data.pop() left = self.data.pop() result = left - right self.data.push(result) elif tok.type == TT_MUL: right = self.data.pop() left = self.data.pop() result = left * right self.data.push(result) elif tok.type == TT_DIV: right = self.data.pop() left = self.data.pop() result = left // right self.data.push(result) # Logical Operations elif tok.type == TT_INV: left = self.data.pop() left = ~left self.data.push(left) elif tok.type == TT_XOR: right = self.data.pop() left = self.data.pop() self.data.push((left ^ right)) elif tok.type == TT_AND: right = self.data.pop() left = self.data.pop() self.data.push((left & right)) elif tok.type == TT_OR: right = self.data.pop() left = self.data.pop() self.data.push((left | right)) elif tok.type == TT_LSL: right = self.data.pop() left = self.data.pop() print(f'{left << right}') self.data.push(int(left) << int(right)) elif tok.type == TT_LSR: right = self.data.pop() left = self.data.pop() self.data.push((left >> right)) if __name__ == '__main__': source = """ 16 2 >> """ lexer = Lexer(source) parser = Parser(lexer) parser.parse() print(parser.data)
# keywords.py KEYWORDS = [ # Stack Operators 'POP', 'SWAP', 'SWAP2', 'DUMP', 'DROP', 'ROL', 'ROR', 'OVER', 'DUP', 'DUP2', 'DUP3', 'CLEAR', # Conditional Operators 'IF', 'ELSE', 'THEN', 'NOT', # Iterative 'DO', 'LOOP', 'WHILE', 'WEND', # I/O Operators 'TYPE', # Outputs a string to stdout 'EMIT', # Outputs a character to stdout 'KEY', # Takes input character from stdin # Special Character Values # See ASCII Character Table for values 'CR', 'LF', 'NULL', 'BELL', 'BS', 'TAB', 'VT', 'FF', 'ESC', # Logical Operators 'NEQ', 'AND', 'OR', ]
from keywords import KEYWORDS from tokens import * class Lexer: def __init__(self, text, line=None, col=None): self.text = text self.pos = 0 self.line = line if line else 1 self.col = col if col else 1 self.current_char = self.text[0] ############################################# # Utility Methods # #-------------------------------------------# # These methods help manage the internal # # workings of the lexer. # ############################################# def advance(self): """advance(): Advances the current position in the character stream. """ if not self.is_eoi(): self.pos += 1 self.current_char = self.text[self.pos] self.update_position() else: self.current_char = '' def update_position(self): """update_position Maintains the line and column positions for the current token scan. """ if self.current_char == '\n': self.line += 1 self.col = 0 else: self.col += 1 def is_eoi(self): """is_eoi(): :return: True if current position is End Of Input stream. Returns False otherwise. """ return self.pos >= (len(self.text) - 1) def expected(self, expected, found): """expected Raises error is expected and found are different. """ if expected == found: return else: raise ValueError(f'[Error] Expected {expected} but found {found} in line: {self.line} position: {self.col}') def peek(self): """peek Returns next character in input stream after the current character. Note: If we are at EOI (End Of Input) then a space character is returned. """ if not self.is_eoi(): return self.text[self.pos + 1] else: return ' ' # space ############################################# # Recognizer Methods # #-------------------------------------------# # These methods collect the characters that # # belong to a single multi-character token # # into a single unit, and return that unit # # to the caller. # ############################################# def char(self): if self.current_char != "'": self.expected("'", self.current_char) self.advance() # Consume opening single quote char = self.current_char self.advance() # Consume character # Now we must have a closing single quote mark if self.current_char != "'": self.expected("'", self.current_char) self.advance() # Consume closing single quote return char def string(self): if self.current_char != '"': self.expected('"', self.peek()) self.advance() # Consume leading '"' string = '' while self.current_char != '"': string += self.current_char self.advance() self.advance() # Consume trailing '"' return string def number(self): number = '' while self.current_char.isdigit(): number += self.current_char self.advance() return number def is_ident(self): return self.current_char.isalpha() or self.current_char == '_' def _ident(self): _id = '' # identifiers begin with alpha or underscores if self.current_char.isalpha() or self.current_char == '_': while self.current_char.isalnum() or self.current_char == '_': _id += self.current_char self.advance() return _id def is_keyword(self, kw): return kw.upper() in KEYWORDS def eat_comment(self): if self.current_char == '(': while self.current_char != ')' and not self.is_eoi(): self.advance() self.advance() # skip trailing ) elif self.current_char == '\\': while self.current_char != '\n' and not self.is_eoi(): self.advance() self.advance() # skip new line character ############################################# # Tokenization # #-------------------------------------------# # The method next_token() is called by the # # client to obtain the next token in the # # input stream. # ############################################# def next_token(self) -> Token: if not self.is_eoi(): if self.current_char.isspace(): while self.current_char.isspace(): self.advance() if self.current_char.isdigit(): return Token(TT_NUMBER, self.number(), self.line, self.col) elif self.current_char == '+': val = self.current_char self.advance() return Token(TT_PLUS, val, self.line, self.col) elif self.current_char == '-': val = self.current_char self.advance() return Token(TT_MINUS, val, self.line, self.col) elif self.current_char == '*': val = self.current_char if self.peek() == '*': val += '*' self.advance() self.advance() return Token(TT_POW, val, self.line, self.col) self.advance() return Token(TT_MUL, val, self.line, self.col) elif self.current_char == '/': val = self.current_char self.advance() return Token(TT_DIV, val, self.line, self.col) elif self.current_char == '&': val = self.current_char self.advance() return Token(TT_AND, val, self.line, self.col) elif self.current_char == '|': val = self.current_char self.advance() return Token(TT_OR, val, self.line, self.col) elif self.current_char == '^': val = self.current_char self.advance() return Token(TT_XOR, val, self.line, self.col) elif self.current_char == '~': val = self.current_char self.advance() return Token(TT_INV, val, self.line, self.col) elif self.is_ident(): _id = self._ident() if self.is_keyword(_id): return Token(TT_KEYWORD, _id, self.line, self.col) else: return Token(TT_ID, _id, self.line, self.col) elif self.current_char == '<': val = self.current_char if self.peek() == '<': val += '<' self.advance() self.advance() return Token(TT_LSL, val, self.line, self.col) self.advance() return Token(TT_LESS, val, self.line, self.col) elif self.current_char == '>': val = self.current_char if self.peek() == '>': val += '>' self.advance() self.advance() return Token(TT_LSR, val, self.line, self.col) self.advance() return Token(TT_GREATER, val, self.line, self.col) elif self.current_char == '=': val = self.current_char if self.peek() == '=': val += '=' self.advance() self.advance() return Token(TT_EQ, val, self.line, self.col) self.advance() return Token(TT_ASSIGN, val, self.line, self.col) elif self.current_char == ':': val = self.current_char self.advance() return Token(TT_STARTDEF, val, self.line, self.col) elif self.current_char == ';': val = self.current_char self.advance() return Token(TT_ENDDEF, val, self.line, self.col) elif self.current_char == '(' or self.current_char == '\\': self.eat_comment() return self.next_token() elif self.current_char == "'": return Token(TT_CHAR, self.char(), self.line, self.col) elif self.current_char == '"': string = self.string() return Token(TT_STRING, string, self.line, self.col) elif self.is_eoi(): return Token(TT_EOF, None, self.line, self.col) else: raise ValueError(f'Unexpected character: {self.current_char} in input stream at position: {str(self.pos)}.') else: return Token(TT_EOF, None, self.line, self.col) def main(text: str): lexer = Lexer(text) tok = lexer.next_token() while tok.type != TT_EOF: print(tok) tok = lexer.next_token() if __name__ == "__main__": text = """ ( this is a multi-line comment. ) \ This ia a line comment." _myIdent """ main(text)
# stack.py class Stack: def __init__(self, report_errors=False): self.store = [] self.report_error = report_errors def push(self, item): self.store.append(item) def pop(self): if self.store: return self.store.pop() else: if self.report_error: raise IndexError('Attempt to pop a value from empty stack!') else: return None def tail(self): if self.store: return self.store.pop(0) else: return None def clear(self): self.store.clear() def count(self): return len(self.store) def is_empty(self): return not bool(self.store) def __str__(self): s = '[ ' for v in self.store: s += "'" + str(v) + "', " s = s[:-2] s += ' ]' return s def __repr__(self): return self.__str__()
# tokens.py # Data TT_CHAR = 'CHAR' # 'c' TT_STRING = 'STRING' # "string" TT_NUMBER = 'NUMBER' # 1234 (integers only TT_ID = 'ID' # Arithmatic Operators TT_ASSIGN = 'ASSIGN' # '=' TT_PLUS = 'PLUS' # '+' TT_MINUS = 'MINUS' # '-' TT_MUL = 'MUL' # '*' TT_DIV = 'DIV' # '/' TT_POW = 'POW' # '**" # Bitwise Operators TT_LSR = 'LSR' # '>>' TT_LSL = 'LSL' # '<<' TT_AND = 'AND' # '&' Bitwise AND TT_OR = 'OR' # '|' Bitwise OR TT_XOR = 'XOR' # '^' Bitwise XOR TT_INV = 'INV' # '~' Bitwise NOT # Relational Operators TT_EQ = 'EQ' # '==' Is Equal? TT_LESS = 'LESS' # '<' Is Less? TT_GREATER = 'GREATER' # '>' Is greater? # Logical Operators TT_LAND = 'LAND' # AND Logical TT_LOR = 'LOR' # OR Logical TT_NEQ = 'NEQ' # NEQ Logical # DEFINE Functions TT_STARTDEF = 'STARTDEFINE' # ':' TT_ENDDEF = 'ENDDEFINE' # ';' # INTERNAL USE TT_KEYWORD = 'KEYWORD' TT_START = "START" TT_EOF = 'EOF' class Token: def __init__(self, _type, _value, line=None, col=None): self.type = _type self.value = _value self.col = col self.line = line def __str__(self): return f'''type: {self.type}, value: {self.value}, at line {self.line} in column {self.col}''' def __repr__(self): return self.__str__()