KazLLM project in 2024: results
Between May and November 2024, the SITF team trained a large-scale language model of size 70B, completing key stages of training. The model is currently undergoing final supervised fine-tuning, which focuses on using instructional datasets to improve its performance. This process combines quantitative metrics with rigorous data analysis by a team of linguists, ensuring not only the accuracy of the model but also its ability to understand language well. The final training corpus includes 150 billion tokens in Kazakh, Russian, English, and Turkish, of which 7.5 billion are specifically adapted for fine-tuning. To evaluate and refine the model, the team uses a wide range of benchmarks, such as AI2 Reasoning Challenge, Grade School Math Problems, HandeSella, Massive Multitask Understanding Language, Winogrande, HumanEval, and Discrete Reasoning Over Paragraphs. The training dataset was translated using Chat GPT 4o machine translation. The results showed that Kaz-LLM significantly outperforms open-source models in Kazakh and shows a slight advantage in Russian and English, approaching OpenAI's performance.
In the speech translation project, the fundamental speech model demonstrated outstanding results, outperforming tools such as Google Translate, Yandex Translate, and GPT-4o. On the FLoRes benchmark, the model achieved excellent BLEU scores, demonstrating high accuracy in translations between Kazakh, Russian, English, and Turkish. These successes highlight the significant progress in the development of machine translation and speech processing technologies, ensuring their competitiveness on a global level.
This work on the KazLLM development project was made possible with the partial support of AstanaHub.
Comments 7
Login to leave a comment
Ai Nur · Dec. 13, 2024 14:35
тағы 1 ұпай
Ai Nur · Dec. 13, 2024 14:35
қызық тілі ішіне кіргені ұнады
Ai Nur · Dec. 10, 2024 21:06
керемет нәтиже, бастысы қолдау
Balzhan I · Dec. 9, 2024 23:40
👍🏻👍🏻👍🏻
Balzhan I · Dec. 9, 2024 23:38
отлично!
Dauren Bazilov · Dec. 6, 2024 09:43
интересно где разработчики намерены применять эту ЛЛМ
Ilias Zholaman · Dec. 5, 2024 21:43
🔥👏