What would you do if you wanted to change a particular code from one programming language to another? It’s simple enough if it is a small code and you can just write the core logic in a different language. But what about large companies that have a codebase consisting of millions of lines of code? They can’t just hire someone to easily convert these lines from one language to another! This process is expensive and it may take multiple years and millions of dollars. But this is also a necessary process for companies sometimes. For example, If a company has a codebase in an older language, then they need to change this into a newer and more relevant language. In fact, the Commonwealth Bank of Australia spent $750 million for five years starting in 2012 to convert their codebase from COBOL to Java.
It would have been much easier if they could have used a trans compiler to automatically convert code from one programming language to another instead of starting from scratch. But that’s not easy! All programming languages have different syntax, variable types, standard library functions, etc. and so it’s not a piece of pie to convert code automatically. Luckily for future companies looking to convert their codebase from a legacy language to a more modern one, Facebook has just announced the creation of a TransCoder that can convert code from a programming language like C++, Java, and Python into another language.
What is TransCoder AI?
The TransCoder AI uses an unsupervised learning algorithm in Machine Learning to translate code between C++, Java, and Python. This algorithm identifies the common elements, known as tokens, between the input and output languages. These tokens can include common keywords such as “for,” “if,”, “else”, “while,” “try”, etc. and also mathematical digits and operators that are common no matter the language. Some other tokens are the common strings that are a part of the code itself.
The algorithm also uses back translation to improve the translation quality of the TransCoder. This means that source code to target code model and target code to source code model are trained simultaneously and then coupled together to create the final output. So a target code to source code model creates the code in the source language from the target language and then the source code to target code model converts this into the target language and this process is repeated till the code obtained for the target language is the same as the original target language code.
Performance of TransCoder AI
Facebook researchers trained the TransCoder AI algorithm using code from GitHub repositories. There are more than 2.8 million open-source repositories that focus on translating function from one programming language to another. So researchers created around 6,000 tokens or common elements in the programming languages and used these to train the TransCoder AI algorithm to translate the functions.
After training the algorithm, Facebook researchers tested its accuracy using 852 parallel functions in C++, Java, and Python from neveropen! Since neveropen has almost all functions available in multiple languages like C, C++, Java, C#, Python, etc. it was the perfect place to obtain function code for a source language and then check if the code generated in the target language by the TransCoder AI algorithm matched the code available on neveropen. A new metric known as computational accuracy was used to gauge the accuracy of the algorithm while translating a function code from one programming language to another.
The computational accuracy obtained while translating in between C++, Java, and Python is given here:
- Computational Accuracy of C++ to Java: 74.8%
- Computational Accuracy of C++ to Python: 67.2%
- Computational Accuracy of Java to C++: 91.6%
- Computational Accuracy of Java to Python, 68.7%
- Computational Accuracy of Python to Java: 56.1%
- Computational Accuracy of Python to C++: 57.8%
Facebook researchers concluded that while many source functions translated by the TransCoder AI algorithm were not perfectly accurate, the Computational Accuracy was still relatively high as compared to previous attempts. The TransCoder was able to understand and differentiate between the syntax of all the languages and it assigned the correct data structures, methods, and libraries in the source code of the target language as compared to the source language. Researchers also claimed that the TransCoder could easily be generalized to any programming language apart from C++, Java, and Python without any expert knowledge. All in all, this experiment was a big success and it definitely outperformed the current commercial solutions to convert from one language to another manually.