Google’s DeepMind AI division has tackled everything from StarCraft to protein folding. So it’s probably no surprise that its creators have eventually turned to what is undoubtedly a personal interest: computer programming. In Thursday’s edition of Science, the company describes a system it has developed that produces code in response to programming typical of those used in human programming contests.
On an average challenge, the AI system could score near the top half of participants. But it had a bit of trouble scaling, being less likely to produce a successful program on problems where more code is typically required. Still, the fact that it works at all without having been given any structural information about algorithms or programming languages is a bit of a surprise.
Rising to the challenge
Computer programming challenges are fairly simple: People are given a task to complete and produce code that should perform the requested task. In an example given in the new paper, programmers are given two strings and asked to determine whether the shorter of the two could be produced by substituting backspaces for some of the keypresses needed to type the larger one. Submitted programs are then checked to see whether they provide a general solution to the problem or fail when additional examples are tested.
Given enough examples of programs that can solve a single problem, it would probably be possible for an AI system to infer the algorithmic structure needed to succeed. But that wouldn’t be a general solution to tackle any problems; an AI trained on one class of challenge would fail when asked to tackle an unrelated challenge.
To make things more generalizable, the DeepMind team treated it a bit like a language problem. To an extent, the description of the challenge is an expression of what the algorithm should do, while the code is an expression of the same thing, just in a different language. So the AI in question was designed to have two parts: one that ingested the description and converted it to an internal representation, and a second that used the internal representation to generate functional code.
Training the system was also a two-stage process. In the first stage, the system was simply asked to process a snapshot of material on GitHub, a total of over 700GB of code. (In these days where you can fit that on a thumb drive, that may not sound like much, but remember that the code is just raw text, so you get a lot of lines per gigabyte.) Note that this data will also include the comments, which should use natural language to explain what the nearby code is doing and so should help with both the input and output tasks.
Once the system was trained, it went through a period of tuning. DeepMind set up its own programming contests and then fed the results into the system: problem description, working code, failing code, and the test cases used to check it.
Similar approaches had been tried previously, but DeepMind indicated that it was just able to throw more resources at the training. “A key driver of AlphaCode’s performance,” the paper indicates, “came from scaling the number of model samples to orders of magnitude more than previous work.”