IBM Releases its brand new CodeNet Dataset for AI Coding
The Siliconreview
28 May, 2021

In a recent press conference, IBM has released its new Project CodeNet, a dataset aimed at teaching AI to translate code from one programming language to another. This new dataset consists of 14 million code samples, made up of around 500 million lines of code in 55 programming languages, ranging from C++, Java, Python, and Go to Cobol, Pascal, and Fortran.

IBM Research further added that the CodeNet could be used to train machine learning models to translate code. The code samples have been taken from various entries in order to open programming competitions, and according to the company, over 90 percent of the code samples come with a description of what the code does, which also includes a precise problem statement, specification of the input format, and the output format.

IBM hopes that Project CodeNet will be able to "drive algorithmic innovation" to extract the more complex code using sequence-to-sequence models, similar to that of the language translators for human languages currently used. The company's focus is to make a more significant dent in machine understanding of code instead of machine processing of code.