Only RK

Price: 8000000

Number of applications: 3

Decision acceptance deadline

12.01.26 (inclusive)

Form of award

One-time payment

Product status

MVP

Task type

ICT tasks

Сфера применения

Media sphere

Область задачи

Virtual Presenter

Type of product

Software/ IS

Problem description

Modern AI assistants mostly use text or audio, but visual communication remains limited. The problem is that: There is no service that can generate natural facial movements corresponding to speech in real time.; Most solutions do not support HLS streaming, integration into web clients, and work with GPU inference.; there are no ready-made libraries connecting TTS → audio tags → lipsync → HLS; Existing lipsync models do not provide an industrial API service with statuses, sessions, and progress. Thus, there is a need for a single technological solution that provides stable video avatar generation with full automation of the pipeline TTS → Infer → Stream.

Expected effect

Functional effects The ability to automatically generate video avatars for any text. Natural lip-sync, exactly corresponding to audio synthesis. An HLS video that can be embedded in any web interface. Scalability under heavy loads due to GPU inference. Economic effects Reducing the cost of producing video content (there are no operators, studios, or actors). Accelerating the development and deployment of new AI assistants. Save on staff training and video explanations. Technological effects Improving the manufacturability of the company's products. Creating a unique competence in lipsync generation. Forming a base for future 3D avatars, real-time digital humans. Social effects Improving public access to digital services. The convenience of user interaction with AI bots.

Full name of responsible person

Akhmetov Beknazar Zhalgasbekovich

Purpose and description of task (project)

The goal of the project is to create a technology service that automatically generates video AVATAR responses with lip-sync based on a user's text request. The system should: accept text; synthesize speech via TTS; perform lipsync inference (MuseTalk 1.5); generate an HLS video stream (init.mp4 + *.m4s segments); give the client a ready-made API stream. The project is a full-fledged information service using GPU, PyTorch, FFmpeg, UNet/VAE models, Whisper audio processing and custom avatar preprocessing. The work is aimed at creating a technological platform for interactive AI assistants with real-time video generation.

Note

The terms and details of the implementation are determined by agreement of the parties without changing the essence of the task. The functionality can be expanded after the completion of the basic stage of development. Additional improvements are provided in a separate technical specification.