Writing a Postgres SQL Pretty Printer in Rust: Part 1.5

Last week I wrote the first post in this series, where I introduced the project and wrote about generating Rust code for the parsed Postgres AST.

I also wrote about the need for wrapper enums in the generated code, but I don’t think I went into enough detail, based on questions and discussions I had after I shared that post in /r/rust.

So this week I will go into more detail on exactly why I had to do this.

Series Links

Part 1: Introduction to the project and generating Rust with Perl
Part 1.5: More about enum wrappers and serde’s externally tagged enum representation
Part 2: How I’m testing the pretty printer and how I generate tests from the Postgres docs

A Tagged Enum Example

I’ve made an example crate with all of the code I walk through below at https://github.com/autarch/tagged-enum-example.

In order to make this simpler, I’ll use some very simple JSON, as opposed to the rather complex JSON we get back from the Pg parser. However, I cannot change the JSON to make parsing easier, just like I cannot do that with the Pg parser’s output¹.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
  "Root": {
    "first": {
      "Foo": {
        "size": 42,
        "color": "blue"
      }
    },
    "second": {
      "Bar": {
        "mood": "indigo",
        "car": "Super"
      }
    },
    "actions": [
      {
        "Run": {
          "speed": 84
        }
      },
      {
        "Sleep": {
          "hours": 8
        }
      }
    ]
  }
}

I’ll use JSONPath to refer to parts of the document. You can see that every object in the JSON is “tagged” with its type. Those are the title case keys: $.Root, $.Root.first.Foo, $.Root.second.Bar, $.Root.actions[0].Run, and $.Root.actions[1].Sleep.

Let’s assume that the $.Root.second key is optional, so it could be entirely omitted in some documents.

The Naive Approach

Now let’s make some Rust structs that correspond to this JSON. This corresponds to the naive directory in my example repo.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

#[derive(Debug, Deserialize)]
struct Root {
    first: Foo,
    second: Option<Bar>,
    actions: Vec<Action>,
}

#[derive(Debug, Deserialize)]
struct Foo {
    size: i8,
    color: String,
}

#[derive(Debug, Deserialize)]
struct Bar {
    mood: String,
    car: String,
}

#[derive(Debug, Deserialize)]
enum Action {
    Run { speed: i64 },
    Sleep { hours: i8 },
}

This is all pretty straightforward. We have a Root struct that can contain a Foo, an optional Bar, and zero or more Action structs.

And here’s our parsing code:

1
2
3
4
fn main() {
    let output: Root = serde_json::from_str(DOC).expect("parsed");
    println!("{:#?}", output);
}

So what happens when we run this?

We get this error:

1
... 'parsed: Error("missing field `first`", line: 29, column: 1)', ...

The important bit is "missing field `first`", line: 29, column: 1. What’s at line 29, column 1 of our JSON document? That’s the end of the document, actually.

So basically we’re seeing that the serde JSON parser looked through the entire top-level object for a first key but could not find one. That makes sense, since the top-level object in the actual document only contains a key named Root.

Fortunately, serde has a solution to this, in the form of its “externally tagged enum representation” handling. For this type of JSON, each object is annotated with an extra “tag” indicating its type, just like we see with $.Root and $.Root.first.Foo and so on.

But the key word here is “enum”. Serde does not offer a way to handle this style of JSON without using enums. So I need to make a bunch of enums, one for each possible tag.

The So Many Enums Approach

This corresponds to the with-enums directory in my example repo.

And here are our structs and enums:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

#[derive(Debug, Deserialize)]
enum RootWrapper {
    Root(Root),
}

#[derive(Debug, Deserialize)]
struct Root {
    first: FooWrapper,
    second: Option<BarWrapper>,
    actions: Vec<Action>,
}

#[derive(Debug, Deserialize)]
enum FooWrapper {
    Foo(Foo),
}

#[derive(Debug, Deserialize)]
struct Foo {
    size: i8,
    color: String,
}

#[derive(Debug, Deserialize)]
enum BarWrapper {
    Bar(Bar),
}

#[derive(Debug, Deserialize)]
struct Bar {
    mood: String,
    car: String,
}

#[derive(Debug, Deserialize)]
enum Action {
    Run { speed: i64 },
    Sleep { hours: i8 },
}

And our main() is:

1
2
3
4
fn main() {
    let output: RootWrapper = serde_json::from_str(DOC).expect("parsed");
    println!("{:#?}", output);
}

Note that the type of output is now RootWrapper instead of Root. This runs without an error, giving us:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Root(
    Root {
        first: Foo(
            Foo {
                size: 42,
                color: "blue",
            },
        ),
        second: Some(
            Bar(
                Bar {
                    mood: "indigo",
                    car: "Super",
                },
            ),
        ),
        actions: [
            Run {
                speed: 84,
            },
            Sleep {
                hours: 8,
            },
        ],
    },
)

Yay, it works! But it has tons of pointless enums. Boo!

The enums generally clutter up the code with a lot of destructuring. For example, if I want to get the struct corresponding to $.Root.first.Foo, I have to write this:

1
2
    let RootWrapper::Root(root) = output;
    let FooWrapper::Foo(foo) = root.first;

In my Pg formatting code, multiply that destructuring by a thousand.

There must be some way out of here

When I shared this in /r/rust last week, /u/nicoburns had some helpful suggestions for working around this. We went back and forth a bit and I was able to get something that worked a little bit. But it only worked for simple cases. I couldn’t get it to work for cases like Option<Bar> or Vec<Action>. And in the Pg parser AST, I also end up with Option<Vec<Something>> too, as well as cases with tuple structs like Vec<(Foo, Bar)> and probably some other weird things too.

What I would love is a solution that changes the code generated by the serde macros to just “skip over” the tag instead of creating an enum for it when the enum only has one variant.

A solution that still requires the wrappers and even more generated code for them would be fine, though I suspect it’d make the AST code’s slow compilation even slower.

I started digging into serde a bit to try to understand how I might do this, but it’s pretty complex, and I’m still pretty new to Rust.

For now, I have enough other things to work on with this project. For example, the way I generate formatted SQL is horrific and unscalable (lots of inline some_str.push_str("WHERE ") and format!). I’m starting on a refactor to generate some sort of intermediate representation of the AST that I can then turn into a string.

Next up

Here’s a list of what I want to cover in future posts.

Diving into the Postgres grammar to understand the AST.
How I’m approaching tests for this project, and how I generate test cases from the Postgres documentation.
The benefits of Rust pattern-matching for working with ASTs.
How terrible my initial solution to generating SQL in the pretty printer is, and how I fixed it (once I actually fix it).
How the proc macro in the bitflags_serde_int crate works².
Who knows what else?

Stay tuned for more posts in the future.

Ok, technically I could do that, but that would involve parsing the JSON and rewriting it in order to … make it easier to parse? ↩︎
Edit 2021-04-24: Nope, not gonna write about this. It turns out I was reimplementing the already existing #[serde(transparent)] feature. ↩︎