The post has been translated automatically. Original language: English
In June 2024, a group of specialists from the Sustainable Innovation and Technology Foundation worked on a project to develop a Kazakh language model. Thus, a series of LLM training experiments was conducted, during which a tokenizer with a dictionary of tens of thousands of tokens was trained on Kazakh-language texts. In addition, the OLMo model is trained on tokens of texts in the Kazakh language.
In order to improve the neural machine translation model, the data set of the parallel corpus containing texts in Kazakh and other languages was expanded. Based on these data, the neural machine translation model is further trained. The project staff also integrated the dataset into a basic speech model capable of simultaneously performing the task of text machine translation in order to improve the model's results.
Another part of the group was working on the visualization of virtual avatars and was creating a prototype avatar that could tell the lecture material in Kazakh.
The implementation of the work became possible with the partial support of AstanaHub.
In June 2024, a group of specialists from the Sustainable Innovation and Technology Foundation worked on a project to develop a Kazakh language model. Thus, a series of LLM training experiments was conducted, during which a tokenizer with a dictionary of tens of thousands of tokens was trained on Kazakh-language texts. In addition, the OLMo model is trained on tokens of texts in the Kazakh language.
In order to improve the neural machine translation model, the data set of the parallel corpus containing texts in Kazakh and other languages was expanded. Based on these data, the neural machine translation model is further trained. The project staff also integrated the dataset into a basic speech model capable of simultaneously performing the task of text machine translation in order to improve the model's results.
Another part of the group was working on the visualization of virtual avatars and was creating a prototype avatar that could tell the lecture material in Kazakh.
The implementation of the work became possible with the partial support of AstanaHub.