Price: 8000000
Number of applications: 3
12.01.26 (inclusive)
One-time payment
MVP
ICT tasks
Media sphere
Virtual Presenter
Software/ IS
Modern AI assistants mostly use text or audio, but visual communication remains limited. The problem is that: There is no service that can generate natural facial movements corresponding to speech in real time.; Most solutions do not support HLS streaming, integration into web clients, and work with GPU inference.; there are no ready-made libraries connecting TTS → audio tags → lipsync → HLS; Existing lipsync models do not provide an industrial API service with statuses, sessions, and progress. Thus, there is a need for a single technological solution that provides stable video avatar generation with full automation of the pipeline TTS → Infer → Stream.
Functional effects The ability to automatically generate video avatars for any text. Natural lip-sync, exactly corresponding to audio synthesis. An HLS video that can be embedded in any web interface. Scalability under heavy loads due to GPU inference. Economic effects Reducing the cost of producing video content (there are no operators, studios, or actors). Accelerating the development and deployment of new AI assistants. Save on staff training and video explanations. Technological effects Improving the manufacturability of the company's products. Creating a unique competence in lipsync generation. Forming a base for future 3D avatars, real-time digital humans. Social effects Improving public access to digital services. The convenience of user interaction with AI bots.
Akhmetov Beknazar Zhalgasbekovich
Purpose and description of task (project)
The goal of the project is to create a technology service that automatically generates video AVATAR responses with lip-sync based on a user's text request. The system should: accept text; synthesize speech via TTS; perform lipsync inference (MuseTalk 1.5); generate an HLS video stream (init.mp4 + *.m4s segments); give the client a ready-made API stream. The project is a full-fledged information service using GPU, PyTorch, FFmpeg, UNet/VAE models, Whisper audio processing and custom avatar preprocessing. The work is aimed at creating a technological platform for interactive AI assistants with real-time video generation.
Note
The terms and details of the implementation are determined by agreement of the parties without changing the essence of the task. The functionality can be expanded after the completion of the basic stage of development. Additional improvements are provided in a separate technical specification.