Project Title: SPEAKER/STYLE ADAPTATION FOR DIGITAL VOICE ASSISTANTS BASED ON IMAGE PROCESSING METHODS
Acronym: S-ADAPT
PI: Dr Vlado Delić, full professor, Faculty of Technical Sciences, University of Novi Sad
SRO’s: Faculty of Technical Sciences, University of Novi Sad; Mathematical Institute, Serbian Academy of Sciences and Arts, Belgrade
Speech communication between a human and a machine, based on technologies of automatic speech recognition, in one direction, and text-to-speech synthesis in the other, is not as flexible or as natural as speech communication between people. This form of communication cannot function quickly and accurately enough in conditions of ambient noise, different voices of speakers and speaking styles. The S-ADAPT project will improve these aspects of human-machine speech communication by applying advanced methods of artificial intelligence.
The automatic speech recognition technology will become more accurate and robust, because it will be adaptable to the gender and age of the speaker, different speaking styles and recording conditions. Text-to-speech synthesis technology will become more flexible in terms of the speaker’s voice and speaking style, with a relatively small amount of data for adaptation, which is very important because in real-life scenarios there is usually a small amount of speech material available.
As a direct result of the project, a wide range of users in Serbia and other countries will be able to flexibly communicate with their mobile phone, and later with other devices, in their mother tongue. The results of the research will be verified through the incorporation into an existing digital voice assistant app for mobile phones.
PROJECT OBJECTIVE: The goal is to implement the algorithms for adaptation to the speaker/style into the existing digital voice assistant for Serbian, which will be the primary technological result of the project.
METHODOLOGY: In a broad sense, the S-ADAPT project will be based on artificial intelligence technologies, machine learning concepts and algorithms based on deep neural networks, especially neural networks.
EXPECTED RESULTS: Adaptation of selected image processing algorithms to speech processing is expected to improve the flexibility of speech recognition technologies and the usability of digital voice assistants, robots, smart homes, offices and cars.
Illustration: David Bilobrk