Making decisions before ...

12.01.26

Form of award

by agreement

Product status

Idea

Task type

Задачи ИКТ

Сфера применения

Media sphere

Область задачи

Neurotechnology and artificial Intelligence

Type of product

Software/ IS

Problem description

STT often produces "noise": a set of heterogeneous nouns, hindrances, and fragments of phrases. This leads to false activation of intent recognition, unnecessary calls to business logic, and resource consumption. Using ML/LLM for primary filtering is expensive and not always justified: high latency, cost, and dependence on external services. We need a fast, deterministic filter that discards obvious garbage and flags questionable cases.

Expected effect

The project involves the development of a lightweight multilingual NLP (rule-based) model for evaluating the meaningfulness of texts obtained from STT systems. The module is designed to filter out noise, fragmentary and thematically disconnected phrases before the intent recognition stage or transmission to AI. The solution is implemented in JavaScript, works deterministically and without the mandatory use of ML/LLM, while the use of existing NLP libraries is allowed and encouraged (for example, nlp.js, compromise, tokenizers, morphological utilities) to speed up development and improve the quality of analysis. The result of the module is a numerical score (score 0-1) and explicable features that allow it to be reliably applied in multilingual voice and text systems in an industrial environment.

Full name of responsible person

Жанғазы Темірлан Маратұлы

Purpose and description of task (project)

Create a lightweight, deterministic TypeScript module that, based on a fragment of speech (already converted to text), outputs a numerical score (0-1) and diagnostic features that allow reliable (without AI/LLM) filtering of STT noise and low-meaning speech to the stage of transmission to the NLU/intent classifier. The module is designed for integration into pipelines of voice robots (STT → filter → intent/flow). It works offline/locally (without ML-inference) and uses a set of heuristics: grammatical markers, bundles, frequency bigrams, domain dictionaries, simple regular patterns, length/punctuation parameters. Issues a score for monitoring and logs.

Note