← All Posts · ← Newer · Older →

PEG at the Core Developer Sprint

September 23, 2019 — originally posted on Medium


This week I’m not showing any new code for the parser generator I’ve described it the previous parts. Instead, I’ll try to describe what I did at the Core Developer Sprint last week before it all evaporates from my memory. Most of this relates to in PEG one way or another. Then I’ll show some code anyway, because I like to talk about code, and it roughly shows the path I see to a PEG-based parser for CPython 3.9.

[This is part 9 of my PEG series. See the Series Overview for the rest.]

Every year for the past four years a bunch of Python core developers get together for a week-long sprint at an exotic location. These sprints are sponsored by the PSF as well as by the company hosting the sprint. The first two years were hosted by Facebook in Mountain View, last year was Microsoft’s turn in Bellevue, and this year’s sprint was hosted by Bloomberg in London. (I have to say that Bloomberg’s office looks pretty cool.) Kudos to core dev Pablo Galindo Salgado for organizing!

Over 30 core devs attended this time, as well as two hopefuls. People worked on many different projects, from 3.8 release blockers to new PEPs. I hope that the PSF will publish a blog post with our accomplishments. One of the highlights was that the number of open PRs was pushed below 1000, merging over 100 PRs that had been waiting for some love from a core dev reviewer. There was even a friendly contest with a leaderboard showing the top 10 attendees who merged the most pull requests submitted by others.

For me, the main reason to attend events like this is always meeting people with whom I collaborate online all year long but whom I only get to see once or twice a year. This sprint was in Europe, so we saw a slightly different mix of core devs, and that was particularly important. Regardless, I also did quite a bit of hacking, and that’s what I’ll focus on below.

Most of my coding time at the sprint I worked with Pablo and Emily Morehouse on the PEG-based parser generator that I hope will some day replace the current pgen based parser generator. This isn’t the same code as the generator I’ve been blogging about, but it’s pretty similar. Pablo had contributed to this version already.

The first day of the sprint, Monday, I spent mostly on part 7 and 8 of this series. Originally I planned to publish all the material in a single post, but at the end of the day I wasn’t quite finished, so I cut things in two and posted the first half, on making a meta-grammar, as part 7. On Friday after dinner I finally found a little time to polish off part 8, but I still had to skip the “cut” operator because I realized I don’t have a compelling example.

On Tuesday I started working on developing a PEG grammar for Python. PEG grammars, even before you add actions, are closer to code than to abstract specifications, and we realized that we needed to test the grammar under development against real Python code. So while I hacked on the grammar, Emily wrote a bulk testing script. Once that was completed, my workflow was roughly as follows:

  1. Run the bulk testing script over some directory with Python code
  2. Investigate the first issue reported by the script
  3. Tweak the grammar to deal with that issue
  4. Repeat until there are no issues left
  5. Repeat with a different directory

I started with the pegen project’s own code as the target. Eventually my grammar could parse all of the Python constructs used in pegen, and I moved on to parts of the standard library, first focusing on Lib/test, then Lib/asyncio, and finally Lib, i.e. effectively the entire standard library (or at least the part written in Python). By the end of the week I could declare victory: the only files in the standard library that the new parser rejects are a few files with bad encodings, which exist purely to as test cases to verify that bad encodings are rejected, and some Python 2 code used as test cases for lib2to3.

Since then I’ve added some timing code to Emily’s script, and it looks like the new PEG-based Python parser is a tad faster than the old pgen parser. That doesn’t mean it will be all uphill from here though! The grammar defines more than 100 rules and is over 160 lines long. To make it produce ASTs, we will have to add an action to every rule (see part 6).

I did an earlier experiment, and adding actions will probably double or triple the size of the grammar file. Here’s the grammar from that experiment:

start[mod_ty]: a=stmt* ENDMARKER{ Module(a, NULL, p->arena) }
stmt[stmt_ty]: compound_stmt | simple_stmt
compound_stmt[stmt_ty]: pass_stmt | if_stmt
pass_stmt[stmt_ty]: a='pass' NEWLINE { _Py_Pass(EXTRA(a, a)) }
if_stmt[stmt_ty]:
| 'if' c=expr ':' t=suite e=[else_clause] {
_Py_If(c, t, e, EXTRA(c, c)) }
else_clause[asdl_seq*]:
| 'elif' c=expr ':' t=suite e=[else_clause] {
singleton_seq(p, _Py_If(c, t, e, EXTRA(c, c))) }
| 'else' ':' s=suite { s }
suite[asdl_seq*]:
| a=simple_stmt { singleton_seq(p, a) }
| NEWLINE INDENT b=stmt+ DEDENT { b }
simple_stmt[stmt_ty]: a=expr_stmt NEWLINE { a }
expr_stmt[stmt_ty]: a=expr { _Py_Expr(a, EXTRA(a, a)) }
expr[expr_ty]:
| l=expr '+' r=term { _Py_BinOp(l, Add, r, EXTRA(l, r)) }
| l=expr '-' r=term { _Py_BinOp(l, Sub, r, EXTRA(l, r)) }
| term
term[expr_ty]:
| l=term '*' r=factor { _Py_BinOp(l, Mult, r, EXTRA(l, r)) }
| l=term '/' r=factor { _Py_BinOp(l, Div, r, EXTRA(l, r)) }
| factor
factor[expr_ty]:
| l=primary '**' r=factor { _Py_BinOp(l, Pow, r, EXTRA(l, r)) }
| primary
primary[expr_ty]:
| f=primary '(' e=expr ')' {
_Py_Call(f, singleton_seq(p, e), NULL, EXTRA(f, e)) }
| atom
atom[expr_ty]:
| '(' e=expr ')' { e }
| NAME
| NUMBER
| STRING

There are a bunch of things here that I should explain.

If some of this sounds a little dodgy, I’m the first to admit it. It’s a prototype, and its main point is to show that it is, in principle, possible to generate a working AST using a parser generated from a PEG grammar. All this works without any changes to the existing tokenizer or bytecode compiler. The prototype can compile simple expressions and if statements, and the resulting AST can be compiled to bytecode and executed, and it seems to work.

Other stuff I did at the core development sprint:

In the next episode I hope to share some progress with the actions to generate AST nodes.

License for this article and the code shown: CC BY-NC-SA 4.0

← All Posts · ← Newer · Older →