Auto-translation used

Architecture of Thought, Part 1: Tanusha — how we taught Python to read runes

This is the first article from the new technical cycle, where we will look "under the hood" of the Tengri-Lang language. Now that the project is open on GitHub, we can show in detail how its practical implementation began. In this part, we will analyze the very first "master" of our compiler, a Lexer implemented in Python.

In previous articles, we talked a lot about philosophy. Now let's talk about the code.

Before starting to create a "combat" compiler for Go, we needed to quickly test the idea itself. Will the machine be able to recognize our runic syntax in principle? Will the grammar logic work? There is no better tool for such tasks than Python. He allowed us, without getting bogged down in details, to create the first "master" — "Tanusha" (the Discerner).

His work is like that of a reader who slides along a line and recognizes individual words without delving into the meaning of the whole sentence.

First of all, we have determined what the "recognized word" (token) will look like in our system. We have created a simple Token class, which is a container for information about each element of the code.

  • token.py:

Python

# token.py
class Token:
    """Describes one 'recognized' code element."""
def __init__(self, type, value, line=1, column=1):
        self.type = type # Token type (for example, 'Runa_Const', 'Identifier')
        self.value = value # Token value (for example, 'Λ', 'san')
self.line = line # Line number
        self.column = column # Column number

    def __repr__(self):
"""Method for beautiful token output when printing."""
return f"Token({self.type}, '{self.value}')"

Each token has a type and a value. Simple and effective.

Now it's "Tanusha" itself. In Python, it is very convenient to implement it as a Lexer class. The heart of this class is the token_map dictionary. It allows you to instantly, without complex checks, match a character from the code with its type.

  • lexer.py (key fragments):

Python

# lexer.py
from token import Token

class Lexer:
""""Tanusha" — recognizer. The final version of the prototype."""
def __init__(self, source_code):
self.code = source_code
        self.position = 0
# Full dictionary for instant recognition of single characters
        self.token_map = {
            'Π': 'Runa_Func_Def', '—': 'Runa_Var',
            'Λ': 'Runa_Const',    'Y': 'Runa_If',
            'Q': 'Runa_True',     'I': 'Runa_False',
            '↻': 'Runa_Loop',     '→': 'Runa_Return',
            # ... and so on for all runes and operators.
        }

The main get_next_token method looks at the current character and first checks if it is in our dictionary. If yes, the token is created instantly.

Python

# fragment of the get_next_token method in lexer.py
def get_next_token(self):
# ... (omitting spaces and comments)

    current_char = self.code[self.position]

    # Checking the dictionary of single characters
    if current_char in self.token_map:
        token_type = self.token_map[current_char]
        token = Token(token_type, current_char)
        self.position += 1
        return token

    # If the character is not in the dictionary, check if it is a number.
    if current_char.isdigit():
        return self._read_number()

    # Or maybe it's the name (identifier)
if current_char.isalpha():
identifier= self._read_identifier()
        return Token('Identifier', identifier)
    
    # ...

The _read_identifier() method simply reads characters in a row as long as they are letters or numbers, and returns the resulting word. This way, tengri is recognized as a single token, rather than 5 separate letters.

This simple Python Lexer has fulfilled its main task: it has proved that our system of runes and syntax is viable. He was able to correctly read and break down into meaningful parts the code that we wrote for him.

After receiving this confirmation, we realized that it was time to move on to the next level. To create a real, fast and reliable compiler, a more suitable tool was needed. That is why the next step in our history was the transfer of this verified logic to the "combat" language — Go.

In the next article, we will get acquainted with the heart of our compiler — "Kurastyrusha" (Parser), which knows all the laws of the language. It is he who takes a stream of tokens and builds from them a majestic "Oh Baytergi" — a Tree of Thought.

Stay tuned and check out our GitHub repository!

Comments 1

Login to leave a comment