Auto-translation used

October was productive for the team of Sustainable Innovation and Technology Foundation specialists working on the implementation of the KazLLM project

In October, the Sustainable Innovation and Technology Foundation team completed a significant amount of work in several key areas. Key achievements cover areas of natural language processing and artificial intelligence, as well as optimization and functionality updates for existing products such as KazLLM and Soyle.

The team completed collecting data from various sources to create a training set for the KazLLM model, which is capable of checking the spelling of the Kazakh language. In October, the dataset size reached 409,586,135 tokens, which became an important foundation for further improvements in the accuracy and capabilities of the model.

In addition, work was carried out to improve the new tokenizer, which allowed the KazLLM model to be trained more efficiently on Kazakh text. Vocoder training continues on 10 speakers, showing intermediate results. The text machine translation model has been overtrained, which led to an improvement in BLEU metrics by more than 2 points. Training of a multi-task model for speech recognition and translation, as well as text translation, is ongoing. These works called to add the functionality of voice selection (male or female) for text-to-speech, integrate work with the payment system in test mode and increase the response speed due to parallel processing of requests and load distribution in the Soyle product.

Another part of the team worked on improving the virtual avatar. The appearance of the virtual avatar was improved and a virtual school office was added for user interaction, face recognition technology was implemented to identify users.  In addition, the virtual avatar was integrated with the new version of KazLLM with 70 billion parameters, providing higher accuracy in request processing and improved user experience.

It should be noted that the implementation of work within the framework of the KazLLM development project became possible with the partial support of AstanaHub.