Price: 0
Number of applications: 0
12.01.26
by agreement
Idea
Задачи ИКТ
Media sphere
Neurotechnology and artificial Intelligence
Software/ IS
STT often produces "noise": a set of heterogeneous nouns, hindrances, and fragments of phrases. This leads to false activation of intent recognition, unnecessary calls to business logic, and resource consumption. Using ML/LLM for primary filtering is expensive and not always justified: high latency, cost, and dependence on external services. We need a fast, deterministic filter that discards obvious garbage and flags questionable cases.
The project involves the development of a lightweight multilingual NLP (rule-based) model for evaluating the meaningfulness of texts obtained from STT systems. The module is designed to filter out noise, fragmentary and thematically disconnected phrases before the intent recognition stage or transmission to AI. The solution is implemented in JavaScript, works deterministically and without the mandatory use of ML/LLM, while the use of existing NLP libraries is allowed and encouraged (for example, nlp.js, compromise, tokenizers, morphological utilities) to speed up development and improve the quality of analysis. The result of the module is a numerical score (score 0-1) and explicable features that allow it to be reliably applied in multilingual voice and text systems in an industrial environment.
Жанғазы Темірлан Маратұлы
Purpose and description of task (project)
Create a lightweight, deterministic TypeScript module that, based on a fragment of speech (already converted to text), outputs a numerical score (0-1) and diagnostic features that allow reliable (without AI/LLM) filtering of STT noise and low-meaning speech to the stage of transmission to the NLU/intent classifier. The module is designed for integration into pipelines of voice robots (STT → filter → intent/flow). It works offline/locally (without ML-inference) and uses a set of heuristics: grammatical markers, bundles, frequency bigrams, domain dictionaries, simple regular patterns, length/punctuation parameters. Issues a score for monitoring and logs.