The Art of Transformer Programming

Book page: The Art of Transformer Programming.

“How can modern large language models perform simple computations, like sorting a list, searching for a sequence, or adding two numbers in decimal representation?”

I’ve spent a few days during the holiday season of last year answering this and related questions by manually programming a set of programs on a Transformer computer. I.e. I manually set the weights of a Transformer to execute these programs. In the hopes that this work might be interesting to others, I’ve spent some free weekend time compiling my results into book form.

You can find the full book online here.

This was one of the more fun things I’ve done in a while!

I also printed very few physical copies, and might include a way to order a physical copy, if anyone might be interested in reading it in bed :) I admit that I did not imagine the feeling of holding a physical copy of my own book to be so satisfying!

Below is a quick excerpt:

“How can modern large language models perform simple computations, like sorting a list, searching for a sequence, or adding two numbers in decimal representation?”

Surprisingly, several years and thousands of papers since the Transformer was invented, it is still hard to find a satisfying answer online to this seemingly simple question. Being a firm believer in Richard’s Feynman’s words “What I cannot create, I do not understand”, I decided to answer this question myself.

In fact, I decided to choose a set of simple programs, beyond those above, and implement all of them by hand on top of a production grade decoder-only Transformer. These programs included the canonical printing of the fixed string “Hello World!”, looking up values in a fixed memorized lookup-table, searching for an input pattern in a given longer sequence, sorting a list of numbers, and of course, decimal addition. Luckily, with some free holiday time, a few days later I found myself with several Python colabs with implementations of these programs. While my curiosity was appeased, with the hope that others might find the work interesting I decided to dedicate some free weekend time since and compile the work into a short booklet. This manuscript is the result.

I found this problem interesting for two reasons. First, creating these Transformer programs by hand was an extremely fun set of puzzles. The intellectual challenge felt similar to trying to program with an esoteric programming language, like Befunge, BCL, Brainfuck, Chef, INTERCAL, Malbolge, Piet, and others. Programming a Transformer manually, like is done in this work, is an esoteric programming language, made mostly of a particular set of linear algebra operations. As a matter of fact, if you feel especially bored, Chapter 8 includes a simple interpreter for this esoteric language… but notice that most modern AI systems are such interpreters themselves! Which brings me to the the second, and more important, reason.

The second reason that this problem interested me is that this said esoteric programming language also happens to power one of the most popular “computers” in the world - namely the Transformer architecture behind modern AI systems. As of today, Transformers are amongst the most important backbones of AI systems, and understanding better how they work, and especially, what is hard for them, is key to improving them. In spite of a tremendous amount of AI research since its invention, the Transformer architecture proved to be extremely robust, with very few changes being adopted in practice. We’ll soon see that while many of the design choices made by the Transformer creators make things harder for our human brains, these don’t pose any issue to the optimization procedure and the Transformer architecture. The most interesting cases though, are those that are hard for us humans and are hard for the optimizer or the architecture, and understanding these better might be key to creating better AI systems.

Feel free to use this in case you’d like to cite the book:

@misc{Leviathan_2022,
  title={The Art of Transformer Programming},
  url={https://yanivle.github.io/taotp.html},
  journal={Yaniv Leviathan’s Blog},
  author={Leviathan, Yaniv},
  year={2022}
}



Share this story