Auto-translation used

We are training Whisper Small to recognize Kazakh speech

Speech recognition (Speech-to-Text) systems are widely used today in voice assistants, chatbots, automatic translation services, and other solutions that simplify human interaction with computer systems. Support for local languages, such as Kazakh, is especially important because there are often not enough ready-made solutions for these languages.

In this article, we will look at how to train (fine-tuning) the Whisper Small model from OpenAI on the Mozilla Common Voice dataset for Kazakh speech recognition. We will also look at how such a model is useful and what application scenarios it may have.

Whisper is a family of speech recognition models developed by OpenAI. They are characterized by high accuracy and the ability to work with different languages. However, each model has its own limitations. Whisper Small is a more compact version that is optimized for English by default.

Advantages of Whisper Small:

  • Relatively small size (which simplifies the learning and deployment process).
  • Sufficient accuracy for a variety of tasks (especially after additional training).
  • The ability to adapt to new languages and domains.

The expansion of support for the Kazakh language in the field of speech recognition provides a number of advantages:

  1. Inclusivity: Convenience for native speakers, including people with disabilities.
  2. Automation: Fast transcription of meetings, lectures and telephone conversations in Kazakh.
  3. Education: Creating interactive applications for learning a language or conducting tests.
  4. Media content: Automatic generation of subtitles in Kazakh.
  5. Ecosystem development: Stimulating the development of local products and solutions.

The following methods were used for training:

  • Google Colab with GPU T4 (about 5 hours of training).
  • Mozilla Common Voice dataset (Kazakh part).

During the experiment, the model was trained for 4,000 steps, and an intermediate evaluation was performed every 1,000 steps. Below is a summary of the key metrics — Training Loss, Validation Loss and WER (Word Error Rate, percentage of errors in speech recognition):

Training LossEpochStepValidation LossWER
0.00596.097610000.01382.0531
0.000312.195120000.00061.8636
0.000118.292730000.00020.0
0.000124.390240000.00020.0

As can be seen from the table, at the end of the training, the WER score in the validation sample reached 0%. This means that the model on the test samples (from the set used) copes without errors. However, the result may vary based on real data, so it is extremely important to additionally check the quality using more diverse examples.

Below is the code used to train the Whisper Small model on a Kazakh sample from Mozilla Common Voice. The code can be run in Google Colab or another environment that supports Python.

# Update pip to the latest version
!pip install --upgrade --quiet pip

# Installing libraries for working with sound, learning, assessment and the web interface
!pip install --upgrade --quiet datasets[audio] transformers accelerate evaluate jiwer tensorboard
!pip install gradio==3.41.0
  1. datasets[audio] — allows you to download and process audio data from standard datasets (for example, Mozilla Common Voice).
  2. transformers is a library from Hugging Face for working with transformer models (including Whisper).
  3. accelerate — accelerates learning on multiple GPUs/TPUs.
  4. evaluate, jiwer — tools for calculating metrics, including WER.
  5. tensorboard is for visualizing learning metrics.
  6. gradio is for creating a simple web application (demo).
from huggingface_hub import notebook_login

notebook_login()

To be able to upload a model to your Hugging Face repository, you need to log in using a token.

from datasets import load_dataset, DatasetDict

# Creating a DatasetDict object
common_voice = DatasetDict()

# Uploading the training (train+validation+validated) and test samples
common_voice["train"] = load_dataset("mozilla-foundation/common_voice_17_0", "kk", split="train+validation+validated")
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_17_0", "kk", split="test")

# Deleting unnecessary columns
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])
  1. load_dataset — loads the ready-made Mozilla Common Voice dataset version 17.0 for the Kazakh language (parameter "kk").
  2. We are combining three subsamples (train + validation + validated) into one training dataset.
  3. We delete unused fields to reduce the amount of data and simplify preprocessing.
from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Kazakh", task="transcribe")
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Kazakh", task="transcribe")
  • WhisperFeatureExtractor — converts the audio signal into a spectrogram (log-chalk spectrum).
  • WhisperTokenizer — turns text data into tokens.
  • WhisperProcessor is a wrapper that combines both FeatureExtractor and Tokenizer.
from datasets import Audio

# We bring the audio to a frequency of 16kHz, which is expected by the model
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

def prepare_dataset(batch):
    audio = batch["audio"]
    # Convert the audio signal into log chalk signs
    batch["input_features"] = feature_extractor(
        audio["array"], 
        sampling_rate=audio["sampling_rate"]
    ).input_features[0]

    # Tokenizing the text
batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

# Apply the function to the entire dataset
common_voice = common_voice.map(anticip_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)
  1. cast_column — converts audio to a format compatible with the Audio library.
  2. prepare_dataset is the main preprocessing: we calculate the spectrogram and tokenize the text.
  3. map — we apply the function to all the elements of the dataset. The num_proc=2 parameter enables parallelism.
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.generation_config.language = "Kazakh"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None
  • WhisperForConditionalGeneration is a model for generating text outputs based on audio.
  • Setting up the language and task in generation_config.

Let's create a collator (a special object that will prepare batches for training):

import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Extracting and padding audio signs
        input_features = [{"input_features": f["input_features"]} for f in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # Extracting and padding labels
label_features = [{"input_ids": f["labels"]} for f in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # Replacing padding tokens with -100
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # Deleting the initial token (if it already exists)
if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
labels = labels[:, 1:]

        batch["labels"] = labels
        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)
import evaluate
metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # Replace -100 with pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # Decode predictions and real labels into the text
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # Counting WER (Word Error Rate)
wer_value = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer_value}
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-kk",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

processor.save_pretrained(training_args.output_dir)
  • max_steps=4000 — the total number of training steps.
  • evaluation_strategy="steps" — we will evaluate the model after a certain interval of steps.
  • save_steps=1000 and eval_steps=1000 — every 1000 steps we save a checkpoint and evaluate the model.
  • logging_steps=25 — log metrics every 25 steps.
  • load_best_model_at_end=True — at the end we will load the best checkpoint according to the WER metric.
trainer.train()

At this stage, the learning process will begin. It is important to monitor metrics in TensorBoard or directly in logs.

trainer.push_to_hub(
    "armanibadboy/whisper-small-kazakh", 
    token="HF_TOKEN" # Your Hugging Face token
)

You can replace "armanibadboy/whisper-small-kazakh" with any other name of your repository on Hugging Face.

To test the model in real time, you can create a simple web interface based on Gradio.:

from transformers import pipeline
import gradio as gr

# Upload our trained model (replace it with your path/repository name)
pipe = pipeline(model="armanibadboy/whisper-small-kk")

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(source="microphone", type="filepath"),
    outputs="text",
    title="Whisper Small Kazakh",
    description="Transcription of audio into text for the Kazakh language (fine-tuning Whisper Small).",
)

iface.launch()

Now, by launching the cell, you will receive a web application with the ability to record audio through a microphone. The model automatically recognizes and outputs the Kazakh text.

By training the Whisper Small model on the Kazakh Mozilla Common Voice dataset, we have obtained a system capable of effectively recognizing speech in Kazakh. This opens up broad prospects for the development of local applications, from voice assistants and educational platforms to automatic audio/video transcription services.

The fine-tuning approach of a relatively small model (Whisper Small) turned out to be inexpensive (about 10 credits in Google Colab Pro) and fast enough (5 hours on GPU T4). The final results show good recognition accuracy (WER up to 0% in the test sample), however, this indicator may vary in real conditions.

If you want to develop this project further, add more data, improve the preprocessing process, and experiment with hyperparameters. Good luck with your research!

Comments 4

Login to leave a comment

осындай модельдер жетпей жатыр қазақшаға

Reply

на этой неделе еще сделаю tts

Reply