Defensive programming and the rise of the machines
High-level and low-level languages
Historically, there has been a split between low-level languages ( machine code and assembler) that were designed to be easy for machines to understand, and high-level languages (everything else) that were designed to be easier for humans to work with. Of course, it’s not as black and white as that, with a range of approachability on the “human” side, but it’s a reasonable summary.
For example, picking a small selection:
[low-level] machine code < assembler < C < C++ < C# < Python < R [high level]
Like it or not, we are now in an era where large language models (LLMs) are being trained on the world’s codebases, the majority of which are in high-level languages of some sort. And, while I don’t think software developers have too much to fear from the current state of the art in terms of job security, it’s clear that some sort of LLM tooling support for software engineering is the new normal.
This post is a thought experiment: if we were to design a new programming language that was particularly well-suited to automatic code generation and maintenance by generative machine learning models, what would it look like?
This post is about:
- language features that will make it easier for LLMs to write “good” code
It is not about:
- whether it is good/right to use LLMs for writing code
- compatibility with legacy codebases
The inevitability of hallucination
In some ways, a suitable lens for looking at the code generated by an LLM is that even when presented as code in a specific language and framework, it is more like very high-level pseudocode. After all, all it can do is present the probabilistically most likely response based on the training data it has seen. It might contain references to libraries or language features that don’t exist. It might create syntax errors. Although the models are rapidly getting better, these inconsistencies are artefacts of the underlying architecture of the models.
Accommodating such “hallucinations” effectively is an engineering challenge, like any other. In fact, in the domain of software development we have many tools at our disposal to cope with these imperfections, which wouldn’t work in more general generative text scenarios.
Keeping tests with the code under test
There is a lot to be said to keeping tests with the code being tested, at least for simple unit tests. It satisfies the good idea of having things which change together living close to one another in the codebase.
Dlang example
The D programming language has first-class support for tests alongside code. Let’s borrow their example:
class Sum
{
int add(int x, int y) { return x + y; }
unittest
{
Sum sum = new Sum;
assert(sum.add(3,4) == 7);
assert(sum.add(-2,0) == -2);
}
}
Python example
A similar concept is the idea of doctest in Python, where simple tests are embedded into documentation to act as examples, but can also be run as tests:
class Sum:
def add(self, x, y):
"""
Adds two integers x and y.
>>> sum_instance = Sum()
>>> sum_instance.add(3, 4)
7
>>> sum_instance.add(-2, 0)
-2
"""
return x + y
if __name__ == "__main__":
import doctest
doctest.testmod()
Rust example
I’m no Rust expert, but “the book” is very specific (emphasis mine) in its section on test organisation:
The purpose of unit tests is to test each unit of code in isolation from the rest of the code to quickly pinpoint where code is and isn’t working as expected. You’ll put unit tests in the src directory in each file with the code that they’re testing. The convention is to create a module named tests in each file to contain the test functions and to annotate the module with cfg(test).
So the above example in Rust might look like this:
pub struct Sum;
impl Sum {
pub fn add(&self, x: i32, y: i32) -> i32 {
x + y
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_add() {
let sum = Sum;
assert_eq!(sum.add(3, 4), 7);
assert_eq!(sum.add(-2, 0), -2);
}
}
TODO
There might be some downsides, such as when the production or redistributable form of the code deliberately excludes its tests. Let’s luxuriate in the hypothetical nature of this thought experiment, and mandate that this is a solved problem in our LLM coding language.
An alternative that is nearly as good is to have a parallel directory structure for tests that mirrors the files under test one to one. This means that there is one place to look for the corresponding tests for any piece of code.
This only covers unit testing. Other types of test – integration tests, performance tests, end-to-end tests, etc. – might reasonably live at a greater distance from the code under test, although there’s still something to be said for making them as specific as possible (e.g. clearly related to a specific submodule if that is the focus of the test).
The case for declarative test declarations
TODO Things like Cucumber(?), Jest (JavaScript), etc. Where you say what you are testing in plain text before testing it. It seems reasonable that this extra context would be useful to generative models.
Code contracts
Checking style / consistency
Type checking / compiling
Various languages aspire to the property of “if it compiles, it works”:
- Elm
- Haskell
- Rust
Defensive programming
Unpythonic
Immutability
Dependency injection?
Using assertions
Plausible development cycle
TODO – maybe not suitable for this article
- Write a failing test
- Run the test; confirm it fails
- Write the code necessary for the test to pass
- Run the test; confirm it passes
- Run the full test suite; check everything passes
- Check in the change to change management
Software development as a sequence of commits
TODO move me
LLMs are particular suited to tasks that are challenging to complete, but easy to verify. What strategies can we adopt in a low-trust environment to check that proposed code changes are valid?
A couple of things spring to mind that should help:
- making good use of a change management tool, such as Git
- test-driven development (TDD) and, as a consequence, high test coverage
- code that changes together should live together (including tests?)
- preferring an approach of “small pieces, loosely joined”
- code contracts
- a verifiable record of the functional requirements
The combination of the first two of these points leads to a codebase that is always close to a “known working” version.
Tooling support:
- lint
- type checking
- a compiler (for compiled code)
How can we write code that is as easy as possible for an LLM to modify?
- clear, unambiguous naming
- things that change together should live together
Third-party libraries – make it someone else’s problem. Code that isn’t there is the cheapest to maintain.
Of course, in practice, no one language will be “the best” for this thought experiment, just as no language is the best for human-only software development. There are many parameters in play and trade-offs to be considered.
Conclusion
It turns out that what is good for machines in this context is also good for humans.
Of course, you might prefer the route to greater job security by choosing to do the reverse of the suggestions presented here. As a bonus, your colleagues will find it hard to work with code you wrote as well: double the job security. Was it ever thus.
Other aspects:
- beauty
- refactoring
- security
- maintainability
- efficiency
- accessibility
- readability
Git
- Mountain climbing
- Bisect
I hope this is not a silver bullet, tech bro “throw more tools at the problem” post. The reality is subtler than that. There are things that generative models can do very well, and developers that adapt to this reality will be more productive than those that don’t. But, as with all tools, large language models need care and skill to get the best out of them.
Many engineers are not particularly good at expressing themselves in natural language. When the cycle is “thing in the programmer’s brain” -> “natural language” -> “model generated code”, any problems at the first interface will be amplified at the second.
Title
>>> answer = 42
>>> print(f"{answer=}")
answer=42