The Subtle Ergonomics of the Python Parser

Ben Konz · September 28, 2019

Python has established itself as a popular programming language for both programming newcomers and veterans. This is due to how easy the language is to both read and write. While writing a Python parser for Jeroo, I discovered many subtle things that the Parser and Lexer do that make the language very ergonomic to use.

Dynamic Indentation Levels

Code blocks in Python have to start with a new indentation level and end with a dedentation level. For example:

if True:
    print("true")
print("always")

The indentation level can be of any length, so long as that it is more whitespace than the previously seen indentation level. Additionally, tabs count as 8 spaces, so it is possible to parse a file with both tabs and spaces.

Auto-Closing Dedentation Tokens

The Python language spec specifies that all indentation levels should be automatically be closed before the lexer emits an EOF token.

while True:
    print("loop")

This is quite useful, otherwise it would be very difficult to write programs with multiple code blocks.

if True:
    if True:
        if True:
            if True:

The programmer would have to make 4 newlines, all with one less whitespace character than the previous.

Logical Newlines

The Python language spec also describes logical newlines. They are defined by this regular expression ((whitespace* comment? newline)* whitespace* comment?) newline, which says that logical newlines are whitespace, comments, and at least one newline.

Without this property, newlines would need to be handled in the parser, which could lead to ambiguities.

EOF as a Valid Newline

Python programs consider an EOF token as a valid end to a statement. As a result, not every program needs to end with a newline character, which also makes programs easier to write. For example, this:

while True:
    print("loop")

is a valid program, even though it doesn’t have a newline at the end of the print statement.

Twitter, Facebook