Auto-translation used

Sustainable Innovation and Technology Foundation specialists continue to work on KazLLM

In June 2024, a group of Sustainable Innovation and Technology Foundation specialists worked on the implementation of a project to develop a Kazakh language model. So, a series of LLM training experiments was conducted, during which a tokenizer with a dictionary of tens of thousands of tokens was trained on texts in the Kazakh language. In addition, the OLMo model is trained on tokens of texts in the Kazakh language.

In order to improve the neural machine translation model, the data set of a parallel corpus containing texts in Kazakh and other languages was replenished. Based on these data, further training of the neural machine translation model is carried out. The project staff also integrated the dataset into a basic speech model capable of simultaneously performing a text machine translation task to improve the model's results.

Another part of the group worked on the visualization of virtual avatars and created a prototype avatar that could tell the lecture material in Kazakh.

This project's implementation became possible with the partial support of AstanaHub.

Comments 0

Login to leave a comment