Genetic Improvement: the Story so far

This blog post is based on a seminar given to the Department of Computer Science at the University of Manchester in April 2016; it also builds on the ideas and talks of many fellow academics, who I acknowledge at the end of the article.

Currently, I and my colleagues are working in the area of “Genetic Improvement”, an unfortunate moniker for a relatively new method of applying search to improve existing software. It emerged from the field of Genetic Programming (GP), which uses search to create existing software; GP usually employs an evolutionary algorithm to do so.

Traditionally, GP is a heuristic search method of generating programs, loosely inspired by biology and following the principles of Darwinian evolution. GP starts from scratch to create a program (or more likely, a symbolic expression in the form of a tree), in order to solve a problem encapsulated by a fitness function. In contrast, GI offers a different slant, taking as its starting point human-written source code and improving it, either by repairing the software (bug-fixing) or optimisation of non-functional properties such as execution time, memory usage, or power consumption.

This post (a rather long one, be warned) outlines where Genetic Programming and thus GI came from, covers some of the recent developments and achievements in the field, and controversies surrounding it.

That Name

The name “Genetic Improvement” is an unfortunate one, as it conjures up connotations of biology and even eugenics, something of a departure for what is essentially a programming tool! Its origins in the field of “Genetic Programming” led to a natural etymological transition from “GP” to “GI”, which has made for a catchy acronym but more than a little confusion; this confusion is exacerbated when researchers try to pin down just exactly what the heck GI even is. On this subject I will say no more other than to offer an alternative and self-explanatory title:

GI == Software Improvement through Search

It’s worth noting that GP traditionally uses evolutionary search, but both GP and GI often involve other forms of search such as hill-climbing.

Prologue: Alan Turing, Computers and Programmers


Alan Turing, just prior to WWII

As discussed by GP pioneer Richard Forsyth in his EuroGP 2016 presentation, a good heuristic to judge the quality of a new idea is to check whether Alan Turing thought of it already. So it was with the idea of using some type of evolutionary process to generate programs, and his thoughts on this topic can be found in Intelligent Machinery (1948) and Computer Machinery and Intelligence (1950), the latter containing the following quote:

“. . . the learning process may be regarded as a search for a form of behaviour which will satisfy the teacher (or some other criterion). Since there is probably a very large number of satisfactory solutions the random method seems to be better than the systematic. It should be noticed that it is used in the analogous process of evolution.”

Turing was holding a discourse on how to teach, train, or program the new-fangled computing machines, and provided a wonderful insight by making this comparison with biology.

Jump forward to modern day programming practices, and we may be surprised to find that evolution is not implementing our software. Code is instead being written by millions of human programmers sat at terminals, typing away and experiencing the highs and lows of the elegant solutions and infuriating bug-chasing that such a pursuit entails. As much as I enjoy programming (and that’s a statement that has to be qualified in a great number of ways), I can’t help but wonder if this is a good use of those programmers’ time.

We can draw an analogy with the past by examining an illuminating quote that I stole directly from Wikipedia. It was written in the New York Times obituary of Sears Cook Walker, an astronomer who died in 1853:

“Mr. Walker was widely known as an accomplished Astronomer and a skillful Computer.”

A human computer! This may or may not be news to you, that the original definition of “computer” was to refer to a human carrying out repetitive arithmetic.

We may speculate: is the word “programmer” fated to become synonymous with a programming machine, in the same way that the word “computer” already has? Will today’s human programmers one day be as obsolete; an esoteric historical anomaly prior to the automated programming of machines?

The Story of Genetic Programming

Turing’s idea lay dormant until the birth of Genetic Algorithms (GAs) in the 1960s and 1970s. Actually, that’s untrue: as with most technological developments, things proceed incrementally, and the people we recognise as pioneers are those that popularised, combined, or refined existing ideas. There are exceptions — James Watt’s incredible innovations in steam power, for example — but on the whole, research proceeds in an incremental, messy, stop-start manner.

Putting that observation aside, and simplifying for the sake of a more graceful story, GAs materialised in the 1960s, and it was John Henry Holland in his book “Adaptation in Natural and Artificial Systems” who did most for their development and adoption.

Genetic Algorithm

A Canonical Genetic Algorithm, operating on a population of binary strings

GAs (as illustrated in the above diagram) evolve a population of binary strings. Each string encodes “decision variables” for an optimisation problem, such as how to design a physical artefact, a strategy for a game, a suite of inputs to test an application, and so on.

GAs use crossover and mutation to explore the space of decision variables: crossover is the splicing and recombination of strings that have been demonstrated to be effective in (at least partially) solving a problem; mutation is a simple process of bit flipping. “Parents” are chosen from the population by their fitness, and the generated offspring replace the incumbents.

All of this, as described by the brilliant Professor Susan Stepney (University of York – my internal examiner there and an inspirational academic), is very much the “Ladybird book” interpretation of the biological reality. Molecular Biology is much more complex than previously imagined, and yet GAs have largely retained their highly simplified and abstracted form.

In this canonical diagram of a GA, one crucial step is omitted: the creation of the initial population. This omission is deliberate on my part here, but look elsewhere on the web and you will often find initialisation is the forgotten stage of the process. Typically, initialisation involves the creation of a set of random binary strings, which appears counter-intuitive; when beginning a search for a solution, we know from experience that the point of departure for that search is important. For example, we do not begin looking for our house keys by searching our fridge. If you do, you should probably stop putting your belongings in the fridge.

So, back to the history: where did GP come from?


Richard Forsyth, GP Pioneer. Photo courtesy of Richard.

From the GA community, of course! GA -> GP -> GI, so we’re at the very least consistent when naming our algorithms. Once again simplifying the story, in 1981 Richard Forsyth developed (in Pascal!) a tool called BEAGLE, which used a GA operating on a population binary trees in order to perform pattern recognition. Crossover is implemented by exchanging subtrees between expressions (see the figure below), and mutation is similar, except we first generate a random subtree to “exchange”.

GP Crossover

Crossover in Genetic Programming: subtrees are exchanged between two fit parent individuals to generate two offspring.

In the 1980s, Nichael Cramer also published work on evolving symbolic expressions, and GP reached its final, popularising, conference-founding, apogee when John Koza (pictured below) authored four books on GP. The books are well-known for their robust and heavyweight construction, and this along with the sheer number of example applications offered by Koza as evidence of the generality of the GP approach immediately evokes the imagery of Koza bludgeoning critics into submission by wielding these weighty tomes of practical experience.

John Koza

John Koza, GP Pioneer. Photo courtesy of John.

Koza’s books have become the de facto standard for many decisions in GP: the decisions in his books were incorporated into GP frameworks, and those frameworks influenced others, which in term shaped the practice and perceptions of researchers in the field. It is interesting to reflect on the sheer weight of these cognitive biases: why are trees mostly used? Why do we not use graphs instead? Why do we employ conventional programming languages, rather than alternatives with properties that may be less friendly to humans but more amenable to evolutionary search? Note how the origins and early formation of a field can introduce hidden assumptions that live on long after they are justified or even recognised.

GP Algorithm

Canonical GP Algorithm (with apologies for the graphic design)

The canonical Genetic Programming algorithm looks remarkably similar to a GA. This is no coincidence. And again, initialisation is the neglected member of the GP process: most frameworks use an initialisation method invented by Koza that intends to sample the search space in a way that produces a variety of tree shapes and sizes, without any clear justification as to why this might be the preferred method. But here we are.

A Cambrian Explosion

Whilst possibly the most tired metaphor in evolutionary computation, if not the whole of popular science writing, the 1990s saw a rapid development of many forms of GP: LinearGP promised the evolution of straight-line machine code (no trees!); Whigham suggested the use of grammars to structure the search space, an idea that led to a whole branch (pun) of GP now known as “Grammatical Evolution”, or GE; Montana introduced strong typing, one of the few developments that has seen widespread adoption; and Miller suggested stepping away from the world of trees and evolving generalised recursively connected graphs, an idea that has yet to garner the full attention it deserves.

The State of Play

So, are we done yet? Is GP rapidly replacing human programmers, gradually relegating them to the role of historical anachronism? Well, evidently not. With a lack of theoretical grounding (despite some impressive work from the likes of Poli, Langdon, and McPhee) it became difficult to separate the good from the bad amongst the plethora of new ideas emerging from the field. In addition, empirical results failed to scale, and this conclusion is evidenced by the fact that nowadays GP is typically used to evolve only small expressions. The results of GP are often subsequently incorporated into a larger system, produced in a language that is not usually Turing complete.

We see this point played out in two glaringly obvious ways: first, the NASA antenna. Talk to any practitioner in GP for long enough, ask for proof of its usefulness, and they will inevitably bring up the automated design of an antenna for NASA, a component that was actually used on an experimental space mission in 2006.


Antenna evolved at NASA, sent into orbit in 2006

As great an example as this is (and how wonderful to see the first artificially evolved artefact in orbit!), it does not look like the result of anything we would consider as conventional programming.

One of the most prestigious venues for awards related to GP is the Humie (Human-Competitive) Awards held at the GECCO Conference. Financial awards are given to work that is demonstrably competitive or superior to results generated by humans. Thus, GP has ceased to be considered a method of programming and is instead viewed as an invention machine, that is a mechanism by which solutions outside of conventional consideration are produced.

The Rise of Genetic Improvement

To add a personal aside: in 2005 I was backpacking around the world after quitting my job in industry. People told me it was time to stop being a gadabout and to do something with my life. I disagreed, and instead I enrolled for a PhD.

At the same time, GP was becoming viewed more as an “invention machine” and less as a programming tool, and it declined in favour and entered the downward curve on the Gartner hype cycle. It seemed that most of the good ideas had been had, scalability was still a dream, and there were little new discoveries to be made in the GP field; the gold rush was over.

What an opportune moment then, to commence a PhD in the subject. I joined the EPSRC-funded SEBASE project, and found myself as one of two PhD students working in GP, the other being Andrea Arcuri, then at the University of Birmingham. Andrea was tasked with developing the notion of applying GP to fix bugs, whilst I was assigned to explore the possibility of using GP to control the non-functional properties of software.

After an initial meeting (probably in a pub – most of our meetings were), we discussed our work and decided to collaborate. We literally integrated our two distinct Java systems to produce one large framework that would optimise the execution time of some simple programs running on a processor simulator. We took my simulation knowledge, the ECJ framework, and Andrea’s expertise in testing, and went to work.

Our key insight was: surely it’s got to be easier to optimise an existing program rather than to start from scratch, right? I mean, why would you try to run before you could walk? And why did GP even begin that way in the first place?

My opinion on this matter is that GP emerged from the AI community, where open-ended learning and novelty were more important than generating the types of programs you might find in the commercial world of software development. Again: a cultural bias can push an entire field in a single direction. Who knows how differently — for better or worse — things would have turned out if in the 1970s Richard Forsyth had attempted to optimise existing Fortran code, rather than evolve new symbolic expression trees?

As with any idea, we quickly found prior art: Conor Ryan’s work on evolving concurrent versions of sequential programs, and particularly his 2000 book on the subject, is the earliest work we consider to be in the same vein.

We pressed on and published Paper “0” at the SEAL conference, before a more detailed paper in IEEE Transactions on Evolutionary Computation (which suffered a long ingestion period and was almost rejected, incidentally). Paper 0 was not our finest hour, but it got the idea out there, and the second paper made less of a ham-fisted job of explaining what the heck we were talking about.

However, Paper 0 was still the first recognisable GI work, in the way that it used a search algorithm (Genetic Programming) to search for improvements to existing source code, using test cases to ensure the output exhibited the desired semantics.

In the journal paper, we experimented with a bunch of ideas that we thought might work, and tested them on a set of toy problems, programs too small to be useful. However, the optimisations we found were promising…

One example is that GP would often remove the extra counter in a for loop such that this code:

for (int i=0; i < length; i++)

became this optimised code:

for (; 1 < length; length--)...

Simple, but neat.

Changing the base case for factorial from n <=0 to n <=1 was a similar minor tweak.

We decided to deliberately challenge the system with:

int swi10(int c) {
 int i;
 for (i=0; i<10; i++) {
   switch (i) {
     case 0: c++; break;
     case 1: c++; break;
     case 2: c++; break;
     case 3: c++; break;
     case 4: c++; break;
     case 5: c++; break;
     case 6: c++; break;
     case 7: c++; break;
     case 8: c++; break;
     case 9: c++; break;
     default: c--; break;
  return c;

…and we were not disappointed when we rewarded with an output as follows:

int swi10(int c) {
  return c + 10;

A more complex example was the optimisation of a simple program solving the “triangle classification problem”: given three integers representing the lengths of the sides of a triangle, classify that triangle as either equilateral, isosceles, scalene, or invalid (“that can’t possibly be a triangle!”).

The typical way to solve the problem is to sort the sides, then see if the sum of the two smallest is less than or equal to the largest (invalid), are they all the same (equilateral), two the same (isosceles), and otherwise scalene.

GP did its usual micro-optimisations, including avoiding copying from variable to variable where it could just use the first variable instead. It then ordered only two of the variables, and duplicated the subsequent computation, specialised by whether the ordering had resulted in a swap. This allowed it to eliminate a few comparisons (no need to check for equilateral if there’s been a swap, for example), and make the whole process more efficient. Certainly a non-trivial optimisation, and one that gave us hope.

Myself and Andrea wrapped up our PhDs, him working more on bug-fixing and me looking at trading off functionality for non-functional qualities (your program is a little bit broken but I’ve saved a lot of battery power! and so on) and then both of us headed to pastures new: Andrea to Simula Labs in Norway, and myself off to the glorious, friendly, rain-drenched streets of Scotland to work at the University of Glasgow.

Whilst at Glasgow, I attended the PLDI conference, and met a few people from Martin Rinard’s group at MIT. We were working on a line of research in memory management similar to something the MIT group had done before, and they were very helpful. Little did I know that in a coincidental turn of events the same team would end up working on a (controversial) contribution to GI… more on that shortly.

The Arrival of GenProg

Whilst myself and Andrea worked on other things, in the USA along came Stephanie Forest, Wes Weimer, Claire Le Goues and ThanhVu Nguyen who, along with many collaborators, developed the bug-fixing ideas into the GenProg Tool. GenProg was the moment that GI became big news, and won several awards including at the aforementioned Humies, at GECCO itself and at ICSE, one of the top software engineering conferences.

GenProg was more mature and sophisticated than our approach, operating at source code line level and generating patches, as well as using profiling to focus the search and improve the scalability of GI. It really made waves when it successfully solved a date calculation bug in the ill-fated Microsoft Zune product (essentially “a rubbish iPod” – although one audience member took offence to this flippant characterisation of mine, when giving a related talk earlier this year). The bug made the Zune lock up at midnight on New Year’s Eve 2008, and GenProg was able to find a repair, garnering much attention.

Whilst the American troupe were marching forward with their work, in London researchers at UCL began to further develop the non-functional optimisation work, led by Bill Langdon and Mark Harman. They developed a GI project called GISMOE, and wrote a keynote ASE paper, which they kindly invited me to co-author. They were awarded several EPSRC grants, all in all an exciting time for GI.

Significant results in the optimisation of open-source software followed: in particular, Langdon et al. authored two papers that found impressive speed-ups. Their optimisations to the BarraCUDA DNA Analysis program were so effective that they were adopted in the main project, deployed, and even ported by IBM to a supercomputer.

These were heady heights for GI, and it was almost inevitable that these early successes would soon be met with some productive scepticism.

The GenProg Controversy

The MIT Team I mentioned earlier, led by Martin Rinard, became interested in the GenProg work. After all, it was quite a remarkable development — automatically generated patches for bugs in real-world software, with a decent success rate. It was almost too good to be true, right?

And as the saying goes, it was, at least to some extent. A paper and an extended technical report revealed some of the problems with the GenProg work. The most glaring error was a bug in a script that effectively led the tool to consider “return code 0” from a shell script as equivalent to the “program is correct.” I should add that I haven’t had the opportunity to review the code personally, so I’m presenting the literature as-is.

More worryingly, in my opinion, is that a manual examination of a large number of patches showed that many of the patches were not general: whilst they fixed the problematic test cases used to reproduce the bug, they either did not fully repair the problem or broke existing functionality in an undesirable way. The acceptability of a patch is a somewhat subjective matter, but as published the results weren’t as promising as initially interpreted:

“Because of experimental error, the majority of the reported patches . . . do not produce correct outputs even for the inputs in the test suite . . . the overwhelming majority of the accepted patches are not correct and are equivalent to a single modification that simply deletes functionality.”

Qi et al. [2015]

Furthermore, Qi and his collaborators at MIT noticed a pattern: many of the patches were simply the removal of lines of code, rather than more complex exchange and relocation of source. This observation has also been made in some of the non-functional optimisation work.

So how does a community react to such criticism, aimed at the poster boy of the field? I think there were four main classes of reaction to these developments:

  1. Further examination of existing results. This was helpful and sensible, and is ongoing.
  2. A retreat to non-functional properties: as we have an oracle in the form of the existing program, we can generate as many test cases as we like, increasing our confidence in the preservation of existing functionality. This allows us to ameliorate the risk of somehow “breaking” the functionality of the original program.
  3. A retreat to something more conservative, namely Deep Parameter Tuning: modifying variables within source code that do not appear to change functional behaviour, in an attempt to reduce execution time. There is only a single paper on this subject, but there are many academics now working on this idea.
  4. Incorporating work from other fields: this is perhaps the most positive outcome. Members of the MIT group attended a workshop at UCL on GI in January 2016, and the exchange of ideas and presentation of techniques from the related fields of program synthesis and transformation are now percolating through the GI community.

All of this is well and good, but what happened to the vision of programming proposed our friend Alan Turing? Where is his prophesy of evolutionary programming, in this period of retreat, re-examination, and soul searching?

More worryingly, what does it really mean for a program to be correct? Consider that we have positive and negative test cases for a bug. We then generate a new input — which class of test cases does it belong to? We cannot be certain. Real-world software rarely contains specifications, contracts, or assertions: we must deal with the way the world is, even if it looks hard.

So, what now?

Firstly: an observation. The space of programs is big. Really big. The space of program transformations is also large, and both can be technically infinite, depending on our assumptions. Without a specification, and hence a constructive route to a solution, we must sample from this space. That sampling process is effectively search, even if you don’t necessarily call it that (e.g. SAT solving is also search). So I believe we are working along the right track.

Consider also that many of the results of GI still stand: GenProg did fix the Microsoft Zune bug, GI did optimise the triangle problem in a clever manner, and it did reduce the execution time of the programs Langdon et al. worked on. We should be careful not to lean towards abandoning an idea because it is found to be below our expectations in some cases.

Formal approaches to program synthesis and repair are limited in terms of the types and sizes of programs they can generate, and the changes they can make. A search approach is, in my opinion currently the most promising line of attack.

But, having said that, I believe two major challenges face us:

  • How do we scale our methods to large programs, large (multi-line, multi-location) bugs, and complex optimisations?
  • How do we generate or optimise a program such that it has the desired semantics?

If I knew the answers, I’d be collecting my Turing Award. But in the meantime, here’s some ideas to substitute for solid results.


Ultimately, I believe the problem of scaling program synthesis and manipulation is about recursive structures. Trees in GP are mostly a historical accident, and we should abandon them! In a similar vein, in the 1990s GP researchers attempted to retrofit reuse into the tree structure, through “automatically defined functions” and “module acquisition”, but these methods relied on somehow delegating the process to evolution in a way that wasn’t entirely convincing.

Imagine we need to repair a bug that appears in the logic of a program in multiple locations. Through a recursive representation of the source code, it is much easier to search the space of multiple related simultaneous changes to the program.

Next, we need to match our search representation and methods to the problem at hand. It is not obvious why searching a patch-space, a space defined by humans for concisely representing their changes to source code, is a good choice for search, and yet this decision has not been closely cross-examined. Allowing a combination of fine-grained and coarse program manipulation will open up more possibilities for search to exploit.

Finally, we should prune the search space wherever possible using more formal techniques borrowed from the fields of traditional program synthesis and repair.

The Problem of Correctness

The first stage is to move away from syntactic manipulation. We cannot expect to preserve or create the correct behaviour by examining syntax alone.

What’s more, we should also consider the how as well as the what, that is consider the process of how semantics are encapsulated in a program, as well as the desirability of those semantics.

To do this, we must abandon the long-held “black-box” view of Genetic Programming: instead using static and dynamic analysis, we must develop a picture of the computational process being enacted by the program under test. For example, potential solutions should be tested individually using custom test data for every patched program. Similarly, we could use invariant generation tools such as Daikon, and more formalised specification-deriving methods, to construct an abstract description of the computation as it executes.

All of this comes at great computational cost, which is another assumption we must abandon — amazing results often have amazing cost, at least at first. Witness the success of Deep Blue (sheer scale of processing power) and AlphaGo (the emergence of GPGPU) in making their respective breakthroughs. We must be prepared to develop much more sophisticated frameworks that are far more computationally demanding, if we wish to deal with large-scale real-world software. We can simplify and optimise our approaches after we make progress. That is not to say that our scientific work requires large computation power: answering scientific questions and applying the derived methods to large-scale software are two related but distinct problems.

Winning Back Turing

On a final note, we should take heart in the success of biological evolution: the evidence is all around us. A sceptic of artificial evolution would find it difficult to reject the notion that biological evolution has proven to be phenomenally successful. Just as neural networks have had their winters, so right now evolutionary methods including GP are awaiting further breakthroughs. We should be careful not to let specific challenges and temporary setbacks discourage us in the development of our ideas.

In addition, we have a great advantage over biology: we may augment the evolutionary process in order to focus and accelerate its progress.

All opinions expressed here are my own.
Acknowledgements to Professor Mark Harman, Dr Bill Langdon, and Dr Justyna Petke for their inspiring discussions and presentations, particularly with regards to the future of programming. Thanks to Justyna and Bill for proof-reading this post, and all their helpful corrections and suggestions.






Lee Smolin on Academia

I recently came across this fantastic quote from Lee Smolin, author of “The Trouble with Physics”, on the nature of academia. Posting it here as a reminder to self.

“Like so many things important for life the issues you talk about come down to values. There is to put it simply, an ongoing fight between those of us who do science to satisfy our curiosity about nature and increase our knowledge and those who do it for careerist or egotistical reasons. We need you on our side in this fight. If you do not stand up for your values, who do you think will do this work for you?”

Genetic Programming has gone Backwards

When Genetic Programming (GP) first arose in the late 80s and early 90s, there was one very defining characteristic of its application, which was so widely accepted as to be left unsaid:

GP always starts from scratch

Sure, you might implement low-level functions in the function set, and even abstract some more complex functions in certain cases, but GP almost always (with some notable exceptions) started with a random population generated by a random number generator.

Later on, people tried to encourage some reuse in GP through the automatic discovery of reusable subtrees, but again the initial population was generated in a way that was not directly related to the problem at hand.

Perhaps from an AI point of view, this didn’t seem unusual. It’s a similar approach to the use of a neural net, which starts with a random set of weighted connections, and is then trained for a particular task. In fact, to have incorporated existing solutions or source code could have been perceived as “cheating” at the time! Surely any AI system should not need any hint from a human.

However, when you consider this situation from an automatic programming point of view, it quickly becomes apparently that Genetic Programming was trying to run before it could walk.

Rather than trying to create programs from scratch, wouldn’t it make more sense to try to reuse, tweak or augment existing code? Then, once we’d worked out the finer details of this less ambitious approach we could go on to generate larger and larger code fragments, and — hey! — maybe an entire program or algorithm.

Recently, some GP practitioners have pivoted towards this more incremental approach, which has become one of the most successful lines of attack in GP research, in the form of a subfield now referred to as Genetic Improvement. Led by researchers from the University of Virginia, CMU, and UCL, this field has changed the rules of the game: GP is now used to complement the efforts of human programmers, rather than completely replace them.

Such an approach now appears blindingly obvious! Why first try to generate code that has already been written by humans? Why not see whether GP can do better than humans, if given handwritten code as a starting point? Why not take advantage of the vast database of code that the internet has provided to us? It’s as if we have suddenly realised that it is much easier to evolve something in the context of an existing ecosystem, rather than starting entirely from scratch.

Excitingly, Genetic Improvement has additional benefits: the resulting programs are usually much more relevant to industry, and can excite and inspire researchers in other fields. Rather than battle with AI techniques to see who can solve an overly simplistic and artificial toy problem more rapidly, GI offers the opportunity to fix problems in open-source projects; to produce faster, more efficient software than can be achieved by manual means alone; and to explore power-efficiency and other neglected properties of software in a way that human programmers often overlook.

I recently spoke to one of the leaders of GI research, and I asked him if he felt any rivalry with others racing to push the boundaries of what GI can achieve. His reply?

“Not at all! There is simply so much work to be done. There is just so much to do!”

Time to roll up your sleeves and get involved! If we work hard enough on GI, perhaps one day we’ll get back to where we started — the dream of true automatic programming.

All opinions are my own!
Thanks to John Clark, John Woodward, and William B. Langdon for proof-reading a draft of this post.

Compiling the DaCapo 9.12 Bach Benchmark Suite in Ubuntu 12.04

Update 26/05/2015. Thanks to Peter Hofer at JKU for a useful addendum regarding further problems!

Recently I’ve been doing some unusual memory management research using AspectJ. As part of this research, I’ve had to recompile the DaCapo 9.12 Benchmark Suite, which is the standard suite in memory management research.

In this post, I’ll explain how to do this. I built the suite on an Ubuntu 12.04, both 32-bit and 64-bit.

For the changes to files within the DaCapo file structure, I have created a patch for 64-bit compilation.

Installing Java

The first thing you’ll need to do is install three different versions of the Java SDK: 1.4, 1.5, and 1.6

These are available from the Oracle download archive:

The versions I installed were:

  • j2sdk1.4.2_19
  • jdk1.5.0_22
  • jdk1.6.0_45

If you have any problems installing the 64-bit versions, try using the 32-bit equivalent after installing ia32-libs using apt-get.

Assuming they’re in your ~/Downloads folder:

cd ~/Downloads
chmod +x j2sdk-1_4_2_19-linux-i586.bin jdk-1_5_0_22-linux-i586.bin jdk-6u45-linux-i586.bin

Then, one-by-one, run the scripts and agree to the license agreement:




The self-extracting binaries will have created subdirectories. These need to be moved:

sudo cp -r j2sdk1.4.2_19/ jdk1.5.0_22/ jdk1.6.0_45/ /usr/lib/jvm

Now configure the java installs using update-alternatives:

 sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/j2sdk1.4.2_19/bin/java" 1
 sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.5.0_22/bin/java" 2
 sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.6.0_45/bin/java" 3
 sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/j2sdk1.4.2_19/bin/javac" 1
 sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.5.0_22/bin/javac" 2
 sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.6.0_45/bin/javac" 3

Configuring Java

Note that you must have version 6 set as your default to compile the benchmarks, at least at the moment you initiate the build:

export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45/

Installing Other Packages

You’ll want to install apache ant, subversion, cvs, javacc:

sudo apt-get install ant subversion cvs javacc

Downloading the DaCapo Source

Change to the directory you want to work within.

Download the source from:

Save it in the working directory.


Configuring Ant Build Properties

Copy “” to “”:

cd benchmarks

Edit and add:




The failonerror property is useful for debugging.

Fix some Build Files


Fix the Eclipse URL in bms/eclipse/build.xml


Fix the svn repo URL in bms/jython/build.xml


Daytrader requires Java 1.5, but the rest of the suite uses 1.6. To get around this problem, follow the instructions as advised by Diogenes Nunez on the DaCapo mailing list.

[Updated 26/05/2015 – thanks to Peter Hofer]

The Daytrader benchmarks are built with Maven, which tries to download artifacts from repositories at Unfortunately, Codehaus is shutting down their services, and instead of serving HTTP 404 for the missing artifacts, their web server serves an HTML page which Maven happily accepts as POM and JAR file content. Obviously, this causes the build to fail. Since I could not find a way to prevent Maven from accessing specific repositories, I worked around this issue by adding the following to /etc/hosts:

Checksums for Jython and DayTrader

Remove the jython checksum file:

rm bms/jython/downloads/jython-src-svn-6571.tar.gz.MD5

Similarly, remove checksum for bms/libs/daytrader:

 rm libs/daytrader/downloads/geronimo-jetty6-minimal-2.1.4-plugins.tar.gz.MD5
 rm libs/daytrader/downloads/jstl-impl-1.2.jar.MD5
 rm libs/daytrader/downloads/daytrader-svn-767347-src.tar.gz.MD5


Info. from Peter Hofer:

“I encountered an error during the xalan build, where for a recursive invocation of ant, the environment variable ANT_HOME is set to the value of the “ant.install” ant variable. I had to explicitly set this ant variable with an additional entry in ant.install=/usr/share/ant”

Environmental Variables

Peter Hofer had some problems with his environment:

“Setting JAVA_HOME was not sufficient in my environment. I found that various other Java-related variables were pre-set in my openSUSE installation (JAVA_BINDIR, JAVA_ROOT, JDK_HOME, JRE_HOME) and at some point the build favoured one of those over JAVA_HOME, leading to errors, for example when starting Geronimo. After unsetting those other variables, the build completed successfully.”



This took 25 minutes on my 64-bit Linux box with a quad-core Intel i5-3550 @ 3.3 GHz and 8GB RAM.