Introduction to Code Generation

Code generation is a good way to automate writing repetitive code, also known as boilerplate. Although programming languages are getting better at reducing how much repetitive code you need to write, it does happen with some frequency.

Another use-case is generating the same code on a variety of different languages. For instance when writing a software development kit (SDK) for an API. This is the idea behind Protocol Buffers and OpenAPI which let you write your models and services in a domain-specific language (DSL).

An interesting use-case is writing a cross-language library with idiomatic interfaces in different languages. Normally if you write a library you choose the language up-front and that is the only language you consider supporting. If you do decide to make the library more widely accessible you need to write it in C. The problem with writing it in C is that, even though you can use it from practically any other language, the interface is going to be very clunky. With code generation you can provide nicer idiomatic interfaces for other languages. Watch out for an upcoming post about one such library 😉.

A more advanced use-case is generating code from a higher-level specification. This is what Flow does to generate animation code in Swift and HTML.

In this post I introduce some common approaches to code generation.

How to generate code?

There are a number of ways of generating code. They range from using print statements to using languages with built-in metaprogramming constructs. Each has its own pros and cons, and choosing should be based on your specific use-case. To evaluate each approach I am going to use these metrics:

Extensibility

How easy is it to add a new output language or format? Values range from 1 where you need to start from scratch for each new output language to 5 where you can add a new output language without having to modify any existing code.

Expressiveness

How expressive is the code generation language? Values range from 1 where there is no generation language to 5 where the generation language is a programming language in its own right.

Clarity

How clear is the generation code? Values range from 1 where everything is a messy mix of languages in the same file to 5 where everything is written in the same language and easily followed by developers new to the codebase.

Unintrusiveness

How unintrusive is the code generation? Values range from 1 where you need a completely new toolset to 5 where the code generation requires no additional dependencies.

Roll your own

This is the simplest form of code generation. Write a program that uses print statements to write another program. It may sound like a bad idea, but if you structure your code well it can actually be easier to maintain than most of the other approaches. It also has no extra dependencies and it doesn’t introduce any other languages to your codebase. The main drawback of this approach is that it doesn’t scale well to multiple output languages. In my experience you will end up writing a lot of code and having to refactor for every new output format.

Roll your own

Learn More

  • Protobuf plugins implement their code generation directly in C++. Have a look at their GitHub repo.

Model-based

Model-based code generation tools take a model as input and generate code. The input model can be a domain-specific language or a standard data format like JSON. The most common model-based generators are Protobuf and OpenAPI (Swagger). These are great for model and SDK generation across multiple different languages but are very limited in their expressivity. Although the model code is clear and concise, the generated code is often messier than necessary (due in part to the lack of expressiveness).

Model-based

Learn More

Template engines

The most common general-purpose generation mechanism is template engines. Templates are used extensively to generate HTML websites. Some common template engines are Jinja2, Mustache, PHP, eRuby, and Swift GYB. Templates work by interleaving verbatim output with generation instructions. With special control sequences in-between. When the template is executed the templating engine replaces the instructions with the generated code. See for instance the Mustache Demo.

Template engines shine when the generated code has a consistent structure, like in HTML. Most template engines (Mustache, Jinja2) lack expressivity: they only allow a small set of basic constructs (a.k.a. logic-free templates). While this is desirable for simple “fill in the blanks” use-cases, it will limit what you can generate.

The more expressive template engines (PHP, eRuby, GYB) are powerful but come with the price of having to introduce dependencies and separate languages to your codebase (unless, of course, your codebase happens to be in one of these languages). The template code can also end up being a jumbled mess of different languages.

Logic-free

Logic-free templates

General

General templates

Learn More

See the Wikipedia page for a full list of template engines and their capabilities.

Built-in Metaprogramming

Some languages like C, C++, Julia, and Rust have built-in metaprogramming constructs. These let you mix normal code with meta-code. These are great for simple use cases but overuse often leads to hard-to-understand code. The main disadvantage is that you have to be using one of these languages to start with and even then you can only generate more of the same language.

Built-in metaprogramming

Learn More

Conclusion

Code generation is a powerful tool. It lets you automate writing boring repetitive boilerplate code, write cross-language interfaces and SDKs, and generate code from higher-level specs.

But it's not without issues. One of the main issues with code generation is the need to mix the code that is doing the generation with the code that is being generated. In general these are two different languages. Better IDE support for this use case would be a good step in the right direction.

I will keep exploring options and sharing what I learn.