Genetic Improvement: the Story so far

This blog post is based on a seminar given to the Department of Computer Science at the University of Manchester in April 2016; it also builds on the ideas and talks of many fellow academics, who I acknowledge at the end of the article.

Currently, I and my colleagues are working in the area of “Genetic Improvement”, an unfortunate moniker for a relatively new method of applying search to improve existing software. It emerged from the field of Genetic Programming (GP), which uses search to create existing software; GP usually employs an evolutionary algorithm to do so.

Traditionally, GP is a heuristic search method of generating programs, loosely inspired by biology and following the principles of Darwinian evolution. GP starts from scratch to create a program (or more likely, a symbolic expression in the form of a tree), in order to solve a problem encapsulated by a fitness function. In contrast, GI offers a different slant, taking as its starting point human-written source code and improving it, either by repairing the software (bug-fixing) or optimisation of non-functional properties such as execution time, memory usage, or power consumption.

This post (a rather long one, be warned) outlines where Genetic Programming and thus GI came from, covers some of the recent developments and achievements in the field, and controversies surrounding it.

That Name

The name “Genetic Improvement” is an unfortunate one, as it conjures up connotations of biology and even eugenics, something of a departure for what is essentially a programming tool! Its origins in the field of “Genetic Programming” led to a natural etymological transition from “GP” to “GI”, which has made for a catchy acronym but more than a little confusion; this confusion is exacerbated when researchers try to pin down just exactly what the heck GI even is. On this subject I will say no more other than to offer an alternative and self-explanatory title:

GI == Software Improvement through Search

It’s worth noting that GP traditionally uses evolutionary search, but both GP and GI often involve other forms of search such as hill-climbing.

Prologue: Alan Turing, Computers and Programmers

Turing

Alan Turing, just prior to WWII

As discussed by GP pioneer Richard Forsyth in his EuroGP 2016 presentation, a good heuristic to judge the quality of a new idea is to check whether Alan Turing thought of it already. So it was with the idea of using some type of evolutionary process to generate programs, and his thoughts on this topic can be found in Intelligent Machinery (1948) and Computer Machinery and Intelligence (1950), the latter containing the following quote:

“. . . the learning process may be regarded as a search for a form of behaviour which will satisfy the teacher (or some other criterion). Since there is probably a very large number of satisfactory solutions the random method seems to be better than the systematic. It should be noticed that it is used in the analogous process of evolution.”

Turing was holding a discourse on how to teach, train, or program the new-fangled computing machines, and provided a wonderful insight by making this comparison with biology.

Jump forward to modern day programming practices, and we may be surprised to find that evolution is not implementing our software. Code is instead being written by millions of human programmers sat at terminals, typing away and experiencing the highs and lows of the elegant solutions and infuriating bug-chasing that such a pursuit entails. As much as I enjoy programming (and that’s a statement that has to be qualified in a great number of ways), I can’t help but wonder if this is a good use of those programmers’ time.

We can draw an analogy with the past by examining an illuminating quote that I stole directly from Wikipedia. It was written in the New York Times obituary of Sears Cook Walker, an astronomer who died in 1853:

“Mr. Walker was widely known as an accomplished Astronomer and a skillful Computer.”

A human computer! This may or may not be news to you, that the original definition of “computer” was to refer to a human carrying out repetitive arithmetic.

We may speculate: is the word “programmer” fated to become synonymous with a programming machine, in the same way that the word “computer” already has? Will today’s human programmers one day be as obsolete; an esoteric historical anomaly prior to the automated programming of machines?

The Story of Genetic Programming

Turing’s idea lay dormant until the birth of Genetic Algorithms (GAs) in the 1960s and 1970s. Actually, that’s untrue: as with most technological developments, things proceed incrementally, and the people we recognise as pioneers are those that popularised, combined, or refined existing ideas. There are exceptions — James Watt’s incredible innovations in steam power, for example — but on the whole, research proceeds in an incremental, messy, stop-start manner.

Putting that observation aside, and simplifying for the sake of a more graceful story, GAs materialised in the 1960s, and it was John Henry Holland in his book “Adaptation in Natural and Artificial Systems” who did most for their development and adoption.

Genetic Algorithm

A Canonical Genetic Algorithm, operating on a population of binary strings

GAs (as illustrated in the above diagram) evolve a population of binary strings. Each string encodes “decision variables” for an optimisation problem, such as how to design a physical artefact, a strategy for a game, a suite of inputs to test an application, and so on.

GAs use crossover and mutation to explore the space of decision variables: crossover is the splicing and recombination of strings that have been demonstrated to be effective in (at least partially) solving a problem; mutation is a simple process of bit flipping. “Parents” are chosen from the population by their fitness, and the generated offspring replace the incumbents.

All of this, as described by the brilliant Professor Susan Stepney (University of York – my internal examiner there and an inspirational academic), is very much the “Ladybird book” interpretation of the biological reality. Molecular Biology is much more complex than previously imagined, and yet GAs have largely retained their highly simplified and abstracted form.

In this canonical diagram of a GA, one crucial step is omitted: the creation of the initial population. This omission is deliberate on my part here, but look elsewhere on the web and you will often find initialisation is the forgotten stage of the process. Typically, initialisation involves the creation of a set of random binary strings, which appears counter-intuitive; when beginning a search for a solution, we know from experience that the point of departure for that search is important. For example, we do not begin looking for our house keys by searching our fridge. If you do, you should probably stop putting your belongings in the fridge.

So, back to the history: where did GP come from?

Forsyth

Richard Forsyth, GP Pioneer. Photo courtesy of Richard.

From the GA community, of course! GA -> GP -> GI, so we’re at the very least consistent when naming our algorithms. Once again simplifying the story, in 1981 Richard Forsyth developed (in Pascal!) a tool called BEAGLE, which used a GA operating on a population binary trees in order to perform pattern recognition. Crossover is implemented by exchanging subtrees between expressions (see the figure below), and mutation is similar, except we first generate a random subtree to “exchange”.

GP Crossover

Crossover in Genetic Programming: subtrees are exchanged between two fit parent individuals to generate two offspring.

In the 1980s, Nichael Cramer also published work on evolving symbolic expressions, and GP reached its final, popularising, conference-founding, apogee when John Koza (pictured below) authored four books on GP. The books are well-known for their robust and heavyweight construction, and this along with the sheer number of example applications offered by Koza as evidence of the generality of the GP approach immediately evokes the imagery of Koza bludgeoning critics into submission by wielding these weighty tomes of practical experience.

John Koza

John Koza, GP Pioneer. Photo courtesy of John.

Koza’s books have become the de facto standard for many decisions in GP: the decisions in his books were incorporated into GP frameworks, and those frameworks influenced others, which in term shaped the practice and perceptions of researchers in the field. It is interesting to reflect on the sheer weight of these cognitive biases: why are trees mostly used? Why do we not use graphs instead? Why do we employ conventional programming languages, rather than alternatives with properties that may be less friendly to humans but more amenable to evolutionary search? Note how the origins and early formation of a field can introduce hidden assumptions that live on long after they are justified or even recognised.

GP Algorithm

Canonical GP Algorithm (with apologies for the graphic design)

The canonical Genetic Programming algorithm looks remarkably similar to a GA. This is no coincidence. And again, initialisation is the neglected member of the GP process: most frameworks use an initialisation method invented by Koza that intends to sample the search space in a way that produces a variety of tree shapes and sizes, without any clear justification as to why this might be the preferred method. But here we are.

A Cambrian Explosion

Whilst possibly the most tired metaphor in evolutionary computation, if not the whole of popular science writing, the 1990s saw a rapid development of many forms of GP: LinearGP promised the evolution of straight-line machine code (no trees!); Whigham suggested the use of grammars to structure the search space, an idea that led to a whole branch (pun) of GP now known as “Grammatical Evolution”, or GE; Montana introduced strong typing, one of the few developments that has seen widespread adoption; and Miller suggested stepping away from the world of trees and evolving generalised recursively connected graphs, an idea that has yet to garner the full attention it deserves.

The State of Play

So, are we done yet? Is GP rapidly replacing human programmers, gradually relegating them to the role of historical anachronism? Well, evidently not. With a lack of theoretical grounding (despite some impressive work from the likes of Poli, Langdon, and McPhee) it became difficult to separate the good from the bad amongst the plethora of new ideas emerging from the field. In addition, empirical results failed to scale, and this conclusion is evidenced by the fact that nowadays GP is typically used to evolve only small expressions. The results of GP are often subsequently incorporated into a larger system, produced in a language that is not usually Turing complete.

We see this point played out in two glaringly obvious ways: first, the NASA antenna. Talk to any practitioner in GP for long enough, ask for proof of its usefulness, and they will inevitably bring up the automated design of an antenna for NASA, a component that was actually used on an experimental space mission in 2006.

Antenna

Antenna evolved at NASA, sent into orbit in 2006

As great an example as this is (and how wonderful to see the first artificially evolved artefact in orbit!), it does not look like the result of anything we would consider as conventional programming.

One of the most prestigious venues for awards related to GP is the Humie (Human-Competitive) Awards held at the GECCO Conference. Financial awards are given to work that is demonstrably competitive or superior to results generated by humans. Thus, GP has ceased to be considered a method of programming and is instead viewed as an invention machine, that is a mechanism by which solutions outside of conventional consideration are produced.

The Rise of Genetic Improvement

To add a personal aside: in 2005 I was backpacking around the world after quitting my job in industry. People told me it was time to stop being a gadabout and to do something with my life. I disagreed, and instead I enrolled for a PhD.

At the same time, GP was becoming viewed more as an “invention machine” and less as a programming tool, and it declined in favour and entered the downward curve on the Gartner hype cycle. It seemed that most of the good ideas had been had, scalability was still a dream, and there were little new discoveries to be made in the GP field; the gold rush was over.

What an opportune moment then, to commence a PhD in the subject. I joined the EPSRC-funded SEBASE project, and found myself as one of two PhD students working in GP, the other being Andrea Arcuri, then at the University of Birmingham. Andrea was tasked with developing the notion of applying GP to fix bugs, whilst I was assigned to explore the possibility of using GP to control the non-functional properties of software.

After an initial meeting (probably in a pub – most of our meetings were), we discussed our work and decided to collaborate. We literally integrated our two distinct Java systems to produce one large framework that would optimise the execution time of some simple programs running on a processor simulator. We took my simulation knowledge, the ECJ framework, and Andrea’s expertise in testing, and went to work.

Our key insight was: surely it’s got to be easier to optimise an existing program rather than to start from scratch, right? I mean, why would you try to run before you could walk? And why did GP even begin that way in the first place?

My opinion on this matter is that GP emerged from the AI community, where open-ended learning and novelty were more important than generating the types of programs you might find in the commercial world of software development. Again: a cultural bias can push an entire field in a single direction. Who knows how differently — for better or worse — things would have turned out if in the 1970s Richard Forsyth had attempted to optimise existing Fortran code, rather than evolve new symbolic expression trees?

As with any idea, we quickly found prior art: Conor Ryan’s work on evolving concurrent versions of sequential programs, and particularly his 2000 book on the subject, is the earliest work we consider to be in the same vein.

We pressed on and published Paper “0” at the SEAL conference, before a more detailed paper in IEEE Transactions on Evolutionary Computation (which suffered a long ingestion period and was almost rejected, incidentally). Paper 0 was not our finest hour, but it got the idea out there, and the second paper made less of a ham-fisted job of explaining what the heck we were talking about.

However, Paper 0 was still the first recognisable GI work, in the way that it used a search algorithm (Genetic Programming) to search for improvements to existing source code, using test cases to ensure the output exhibited the desired semantics.

In the journal paper, we experimented with a bunch of ideas that we thought might work, and tested them on a set of toy problems, programs too small to be useful. However, the optimisations we found were promising…

One example is that GP would often remove the extra counter in a for loop such that this code:

for (int i=0; i < length; i++)

became this optimised code:

for (; 1 < length; length--)...

Simple, but neat.

Changing the base case for factorial from n <=0 to n <=1 was a similar minor tweak.

We decided to deliberately challenge the system with:

int swi10(int c) {
 int i;
 for (i=0; i<10; i++) {
   switch (i) {
     case 0: c++; break;
     case 1: c++; break;
     case 2: c++; break;
     case 3: c++; break;
     case 4: c++; break;
     case 5: c++; break;
     case 6: c++; break;
     case 7: c++; break;
     case 8: c++; break;
     case 9: c++; break;
     default: c--; break;
     } 
  }
  return c;
}

…and we were not disappointed when we rewarded with an output as follows:

int swi10(int c) {
  return c + 10;
}

A more complex example was the optimisation of a simple program solving the “triangle classification problem”: given three integers representing the lengths of the sides of a triangle, classify that triangle as either equilateral, isosceles, scalene, or invalid (“that can’t possibly be a triangle!”).

The typical way to solve the problem is to sort the sides, then see if the sum of the two smallest is less than or equal to the largest (invalid), are they all the same (equilateral), two the same (isosceles), and otherwise scalene.

GP did its usual micro-optimisations, including avoiding copying from variable to variable where it could just use the first variable instead. It then ordered only two of the variables, and duplicated the subsequent computation, specialised by whether the ordering had resulted in a swap. This allowed it to eliminate a few comparisons (no need to check for equilateral if there’s been a swap, for example), and make the whole process more efficient. Certainly a non-trivial optimisation, and one that gave us hope.

Myself and Andrea wrapped up our PhDs, him working more on bug-fixing and me looking at trading off functionality for non-functional qualities (your program is a little bit broken but I’ve saved a lot of battery power! and so on) and then both of us headed to pastures new: Andrea to Simula Labs in Norway, and myself off to the glorious, friendly, rain-drenched streets of Scotland to work at the University of Glasgow.

Whilst at Glasgow, I attended the PLDI conference, and met a few people from Martin Rinard’s group at MIT. We were working on a line of research in memory management similar to something the MIT group had done before, and they were very helpful. Little did I know that in a coincidental turn of events the same team would end up working on a (controversial) contribution to GI… more on that shortly.

The Arrival of GenProg

Whilst myself and Andrea worked on other things, in the USA along came Stephanie Forest, Wes Weimer, Claire Le Goues and ThanhVu Nguyen who, along with many collaborators, developed the bug-fixing ideas into the GenProg Tool. GenProg was the moment that GI became big news, and won several awards including at the aforementioned Humies, at GECCO itself and at ICSE, one of the top software engineering conferences.

GenProg was more mature and sophisticated than our approach, operating at source code line level and generating patches, as well as using profiling to focus the search and improve the scalability of GI. It really made waves when it successfully solved a date calculation bug in the ill-fated Microsoft Zune product (essentially “a rubbish iPod” – although one audience member took offence to this flippant characterisation of mine, when giving a related talk earlier this year). The bug made the Zune lock up at midnight on New Year’s Eve 2008, and GenProg was able to find a repair, garnering much attention.

Whilst the American troupe were marching forward with their work, in London researchers at UCL began to further develop the non-functional optimisation work, led by Bill Langdon and Mark Harman. They developed a GI project called GISMOE, and wrote a keynote ASE paper, which they kindly invited me to co-author. They were awarded several EPSRC grants, all in all an exciting time for GI.

Significant results in the optimisation of open-source software followed: in particular, Langdon et al. authored two papers that found impressive speed-ups. Their optimisations to the BarraCUDA DNA Analysis program were so effective that they were adopted in the main project, deployed, and even ported by IBM to a supercomputer.

These were heady heights for GI, and it was almost inevitable that these early successes would soon be met with some productive scepticism.

The GenProg Controversy

The MIT Team I mentioned earlier, led by Martin Rinard, became interested in the GenProg work. After all, it was quite a remarkable development — automatically generated patches for bugs in real-world software, with a decent success rate. It was almost too good to be true, right?

And as the saying goes, it was, at least to some extent. A paper and an extended technical report revealed some of the problems with the GenProg work. The most glaring error was a bug in a script that effectively led the tool to consider “return code 0” from a shell script as equivalent to the “program is correct.” I should add that I haven’t had the opportunity to review the code personally, so I’m presenting the literature as-is.

More worryingly, in my opinion, is that a manual examination of a large number of patches showed that many of the patches were not general: whilst they fixed the problematic test cases used to reproduce the bug, they either did not fully repair the problem or broke existing functionality in an undesirable way. The acceptability of a patch is a somewhat subjective matter, but as published the results weren’t as promising as initially interpreted:

“Because of experimental error, the majority of the reported patches . . . do not produce correct outputs even for the inputs in the test suite . . . the overwhelming majority of the accepted patches are not correct and are equivalent to a single modification that simply deletes functionality.”

Qi et al. [2015]

Furthermore, Qi and his collaborators at MIT noticed a pattern: many of the patches were simply the removal of lines of code, rather than more complex exchange and relocation of source. This observation has also been made in some of the non-functional optimisation work.

So how does a community react to such criticism, aimed at the poster boy of the field? I think there were four main classes of reaction to these developments:

  1. Further examination of existing results. This was helpful and sensible, and is ongoing.
  2. A retreat to non-functional properties: as we have an oracle in the form of the existing program, we can generate as many test cases as we like, increasing our confidence in the preservation of existing functionality. This allows us to ameliorate the risk of somehow “breaking” the functionality of the original program.
  3. A retreat to something more conservative, namely Deep Parameter Tuning: modifying variables within source code that do not appear to change functional behaviour, in an attempt to reduce execution time. There is only a single paper on this subject, but there are many academics now working on this idea.
  4. Incorporating work from other fields: this is perhaps the most positive outcome. Members of the MIT group attended a workshop at UCL on GI in January 2016, and the exchange of ideas and presentation of techniques from the related fields of program synthesis and transformation are now percolating through the GI community.

All of this is well and good, but what happened to the vision of programming proposed our friend Alan Turing? Where is his prophesy of evolutionary programming, in this period of retreat, re-examination, and soul searching?

More worryingly, what does it really mean for a program to be correct? Consider that we have positive and negative test cases for a bug. We then generate a new input — which class of test cases does it belong to? We cannot be certain. Real-world software rarely contains specifications, contracts, or assertions: we must deal with the way the world is, even if it looks hard.

So, what now?

Firstly: an observation. The space of programs is big. Really big. The space of program transformations is also large, and both can be technically infinite, depending on our assumptions. Without a specification, and hence a constructive route to a solution, we must sample from this space. That sampling process is effectively search, even if you don’t necessarily call it that (e.g. SAT solving is also search). So I believe we are working along the right track.

Consider also that many of the results of GI still stand: GenProg did fix the Microsoft Zune bug, GI did optimise the triangle problem in a clever manner, and it did reduce the execution time of the programs Langdon et al. worked on. We should be careful not to lean towards abandoning an idea because it is found to be below our expectations in some cases.

Formal approaches to program synthesis and repair are limited in terms of the types and sizes of programs they can generate, and the changes they can make. A search approach is, in my opinion currently the most promising line of attack.

But, having said that, I believe two major challenges face us:

  • How do we scale our methods to large programs, large (multi-line, multi-location) bugs, and complex optimisations?
  • How do we generate or optimise a program such that it has the desired semantics?

If I knew the answers, I’d be collecting my Turing Award. But in the meantime, here’s some ideas to substitute for solid results.

Scaling

Ultimately, I believe the problem of scaling program synthesis and manipulation is about recursive structures. Trees in GP are mostly a historical accident, and we should abandon them! In a similar vein, in the 1990s GP researchers attempted to retrofit reuse into the tree structure, through “automatically defined functions” and “module acquisition”, but these methods relied on somehow delegating the process to evolution in a way that wasn’t entirely convincing.

Imagine we need to repair a bug that appears in the logic of a program in multiple locations. Through a recursive representation of the source code, it is much easier to search the space of multiple related simultaneous changes to the program.

Next, we need to match our search representation and methods to the problem at hand. It is not obvious why searching a patch-space, a space defined by humans for concisely representing their changes to source code, is a good choice for search, and yet this decision has not been closely cross-examined. Allowing a combination of fine-grained and coarse program manipulation will open up more possibilities for search to exploit.

Finally, we should prune the search space wherever possible using more formal techniques borrowed from the fields of traditional program synthesis and repair.

The Problem of Correctness

The first stage is to move away from syntactic manipulation. We cannot expect to preserve or create the correct behaviour by examining syntax alone.

What’s more, we should also consider the how as well as the what, that is consider the process of how semantics are encapsulated in a program, as well as the desirability of those semantics.

To do this, we must abandon the long-held “black-box” view of Genetic Programming: instead using static and dynamic analysis, we must develop a picture of the computational process being enacted by the program under test. For example, potential solutions should be tested individually using custom test data for every patched program. Similarly, we could use invariant generation tools such as Daikon, and more formalised specification-deriving methods, to construct an abstract description of the computation as it executes.

All of this comes at great computational cost, which is another assumption we must abandon — amazing results often have amazing cost, at least at first. Witness the success of Deep Blue (sheer scale of processing power) and AlphaGo (the emergence of GPGPU) in making their respective breakthroughs. We must be prepared to develop much more sophisticated frameworks that are far more computationally demanding, if we wish to deal with large-scale real-world software. We can simplify and optimise our approaches after we make progress. That is not to say that our scientific work requires large computation power: answering scientific questions and applying the derived methods to large-scale software are two related but distinct problems.

Winning Back Turing

On a final note, we should take heart in the success of biological evolution: the evidence is all around us. A sceptic of artificial evolution would find it difficult to reject the notion that biological evolution has proven to be phenomenally successful. Just as neural networks have had their winters, so right now evolutionary methods including GP are awaiting further breakthroughs. We should be careful not to let specific challenges and temporary setbacks discourage us in the development of our ideas.

In addition, we have a great advantage over biology: we may augment the evolutionary process in order to focus and accelerate its progress.

All opinions expressed here are my own.
Acknowledgements to Professor Mark Harman, Dr Bill Langdon, and Dr Justyna Petke for their inspiring discussions and presentations, particularly with regards to the future of programming. Thanks to Justyna and Bill for proof-reading this post, and all their helpful corrections and suggestions.