Auto-translation used

The work of the Sustainable Innovation and Technology Foundation specialists on the KazLLM project in August

In August 2024, specialists of the Sustainable Innovation and Technology Foundation replenished the data set to create a model that checks spelling in the Kazakh language. At the moment, the data set consists of more than 409 million tokens, which are included in the KazLLM training set, which will allow the model to improve the quality of answers in the Kazakh language. The language model was trained with 8 billion parameters, including primary additional training on the Kazakh language corpus and finetuning on a specialized dataset. The project specialists also implemented optimization and acceleration of the text generation process.

As part of the comparative analysis of models in English and Kazakh, a web interface was developed on the HuggingFace platform and the first comparative analysis of datasets in English and Kazakh was carried out, while the dataset in Kazakh was prepared using machine translation.

In terms of improving the neural machine translation model, the parallel corpus of the Kazakh language was expanded to 470 thousand lines in several languages, the demo version of the translation was improved, and new features were added to the web interface. In addition, the project team configured the SeamlessM4T model for simultaneous speech recognition and translation, and trained the vocoder for speech translation and speech synthesis tasks.

The implementation of work within the framework of the KazLLM development project became possible with the partial support of AstanaHub.

Comments 3

Login to leave a comment