Auto-translation used

SITF team actively worked on KazLLM in November

November 2024 was a landmark month for AI and technology development in Kazakhstan. The KazLLM team completed a massive data collection, including 409 million tokens from various sources such as the Kazakh Wikipedia and news sources into the training set. Switching to the new Nemotron 70B model and supplementing it with synthetic data allowed KazLLM to not only outperform the original Llama 70B in three languages, but also bypass OpenAI GPT-4o in Russian. The KazLLM corpus has expanded significantly, including hundreds of thousands of parallel strings in Kazakh, Russian, English, and Turkish.

Among the key achievements is the launch of the Soyle App, based on the SeamlessM4T model. The application now integrates the Halyk Epay payment system with new tariff plans, as well as the functionality for uploading and translating files while preserving the original format. The team held a hackathon, where developers created solutions based on the application API, and presented Soyle App to a wide audience in a press release. They also began filming training videos, and meetings with customers helped to get valuable feedback for further improvement of the product.

The development of speech recognition systems continues. The new ASR Whisper Turbo model demonstrates high accuracy in complex conditions, including noise and accents. Support for English, Russian and Turkish languages ​​was added, data augmentation was carried out to ensure stability in a multilingual environment. Bug fixes and improved integration with modules such as Audio2Face made the system more flexible and reliable. These steps bring the technology closer to widespread use in real conditions.

It should be noted that the implementation of work within the framework of the KazLLM development project became possible with the partial support of AstanaHub.