Project Title: ADVANCING NOVEL TEXTUAL SIMILARITY-BASED SOLUTIONS IN SOFTWARE DEVELOPMENT
PI: Dr Boško Nikolić, full professor, School of Electrical Engineering, University of Belgrade
SRO’s: School of Electrical Engineering, University of Belgrade; Innovation Center of the School of Electrical Engineering; Faculty of Philology, University of Belgrade; University of Zurich
The result of close cooperation between researchers from seemingly distant scientific fields will be a new system which will facilitate the work of software engineers and linguists who study the Serbian language.
An interdisciplinary research team will develop an intelligent tool for recognizing the semantic similarity between parts of a software system written in programming languages and comments in natural languages. Special research focus will be directed towards solving the problem of cross-level semantic textual similarity primarily in Serbian, with comparison with the results obtained for the English language. The system will be able to recognize code clones. Within the scope of this project, new methods for program code analysis will be used, which include the use of machine learning techniques and artificial intelligence.
In addition to the tool for determining the similarity between the parts of the software and the comments, a group of software engineers and linguists will develop a new semantic search algorithm for exploring code using natural language input. One of the goals is also to establish a database and model for automatic processing of the Serbian language.
The AVANTES project is of great importance for Serbia because researchers will create complex datasets and introduce innovations into existing technologies for processing the Serbian language, for which far fewer resources are currently available than for other international languages such as English. This will facilitate not only the work of software engineering in Serbia, but also of linguists who study the Serbian language.
PROJECT OBJECTIVE: Develop annotated datasets and innovate the existing technologies for advanced automatic Serbian language processing.
METHODOLOGY: New methods for code analysis will be implemented using machine learning techniques and artificial intelligence. Recognition of semantic comment similarity of software parts will be analysed, as well as similarities between texts of different lengths.
EXPECTED RESULTS: A new software similarity tool. A novel semantic code search algorithm for exploring code using natural language input. Datasets and models for automatic processing of the Serbian language.
Illustration: Ivana Bugarinovic