KazLLM project in 2024: results

Between May and November 2024, the SITF team trained a large-scale language model of size 70B, completing key stages of training. The model is currently undergoing final supervised fine-tuning, which focuses on using instructional datasets to improve its performance. This process combines quantitative metrics with rigorous data analysis by a team of linguists, ensuring not only the accuracy of the model but also its ability to understand language well. The final training corpus includes 150 billion tokens in Kazakh, Russian, English, and Turkish, of which 7.5 billion are specifically adapted for fine-tuning. To evaluate and refine the model, the team uses a wide range of benchmarks, such as AI2 Reasoning Challenge, Grade School Math Problems, HandeSella, Massive Multitask Understanding Language, Winogrande, HumanEval, and Discrete Reasoning Over Paragraphs. The training dataset was translated using Chat GPT 4o machine translation. The results showed that Kaz-LLM significantly outperforms open-source models in Kazakh and shows a slight advantage in Russian and English, approaching OpenAI's performance.

In the speech translation project, the fundamental speech model demonstrated outstanding results, outperforming tools such as Google Translate, Yandex Translate, and GPT-4o. On the FLoRes benchmark, the model achieved excellent BLEU scores, demonstrating high accuracy in translations between Kazakh, Russian, English, and Turkish. These successes highlight the significant progress in the development of machine translation and speech processing technologies, ensuring their competitiveness on a global level.

This work on the KazLLM development project was made possible with the partial support of AstanaHub.

Comments 7

Login to leave a comment

тағы 1 ұпай

Reply

қызық тілі ішіне кіргені ұнады

Reply

керемет нәтиже, бастысы қолдау

Reply

👍🏻👍🏻👍🏻

Reply

интересно где разработчики намерены применять эту ЛЛМ

Reply