Auto-translation used

How to create a local offline voice assistant in Python with Faster Whisper and Ollama

In this article, we'll look at how to write a simple but completely offline voice assistant in Python. He can:

  1. Listen to the user's speech through the microphone.
  2. Recognize voice (speech-to-text) using Faster Whisper— a local Whisper model that does not require an internet connection.
  3. Analyze the question and generate an answer using Ollama, a locally launched LLM (large language model) running without a server in the cloud.
  4. To voice the response text (text-to-speech) via the pyttsx3 library.

The main advantage of such an assistant is that it works without the Internet, and your personal data (voice sound, transcribed text, dialogue history) is not transmitted to any external services.

First, make sure that you have the following packages installed:

  • PyAudio – for recording audio from a microphone.
  • wave – to save recorded audio to a WAV file (standard Python library).
  • Faster Whisper is a local library for speech recognition using the Whisper model.
  • Ollama is a locally installed LLM that does not require an internet connection.
  • pyttsx3 is a speech synthesis library (TTS) that also works offline.
pip install pyaudio faster-whisper ollama pyttsx3

The main stages of the assistant's work:

  1. Initialization:Importing libraries, Launching PyAudio and configuring recording parameters, Downloading the local Whisper model (Faster Whisper), Initializing the voice acting engine (pyttsx3), Checking the availability of Ollama (make sure that it is running locally and does not require network access).
  2. Importing libraries,
  3. Launching PyAudio and configuring recording settings,
  4. Loading the local Whisper model (Faster Whisper),
  5. Initialization of the voice acting engine (pyttsx3),
  6. Check the availability of Ollama (make sure that it is running locally and does not require network access).
  7. Recording a voice command:The code opens the PyAudio recording stream and "listens" to the microphone.Detects when the user has stopped speaking by the volume level (RMS).After a few seconds of silence, it automatically stops recording and saves it to a WAV file.
  8. The code opens the PyAudio recording stream and "listens" to the microphone.
  9. Detects when the user has stopped speaking by the volume level (RMS).
  10. After a few seconds of silence, it automatically stops recording and saves it to a WAV file.
  11. Speech recognition (speech-to-text):The WAV file is transferred to the Faster Whisper (local model).The model returns the decrypted text without sending any data to the cloud.
  12. The WAV file is transferred to the Faster Whisper (local model).
  13. The model returns the decrypted text without sending any data to the cloud.
  14. Generating a response:The decrypted text (the user's question) is added to the dialog history.The whole story is sent to Ollama, which runs locally without internet.Ollama generates a response using a loaded large language model.
  15. The decrypted text (the user's question) is added to the dialog history.
  16. The whole story is sent to Ollama, which runs locally without internet.
  17. Ollama generates a response using a loaded large language model.
  18. Voicing the response (text-to-speech):The assistant voices the answer through pyttsx3.Everything happens offline and does not depend on online services.
  19. The assistant voices the answer through pyttsx3.
  20. Everything happens offline and does not depend on online services.

Below is an example of the code that implements the described functions.

import pyaudio
import wave
import time
import sys
import struct
import math
from faster_whisper import WhisperModel
import ollama
import pyttsx3
import threading

engine = pyttsx3.init()

model_size = "large-v3"
model = WhisperModel(model_size, device="cpu", compute_type="int8")

CHUNK = 1024 # Number of samples in one frame
RATE = 16000 # Sampling rate
FORMAT = pyaudio.paInt16
CHANNELS = 1

# The volume threshold below which we consider that the user is "silent"
THRESHOLD = 300

# How many consecutive seconds should I be "silent" to stop recording
SILENCE_LIMIT = 2.0

# We store the entire history of the dialog here
messages = []

def rms(data):
"""
Calculate the approximate volume (RMS) for one byte block.
    data: byte string read from PyAudio
    """
count = len(data) // 2 # Number of samples in int16 format
    format_str = "<" + "h" * count
    shorts = struct.unpack_from(format_str, data)

    sum_squares = 0.0
    for sample in shorts:
        sum_squares += sample * sample

    if count == 0:
        return 0

    return math.sqrt(sum_squares / count)

def record_once(filename="audio.wav"):
"""
We record one fragment of speech (until 'silence') and save it to a WAV file.
    We return the path to the recorded file.
    """
    p = pyaudio.PyAudio()
    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)

    frames = []
    print("Start of recording (speak)...")

    silence_counter = 0 # Counter of "silent" chunks
while True:
        data = stream.read(CHUNK)
        frames.append(data)
        # Calculate the volume of the current CHUNK
        current_rms = rms(data)
        if current_rms < THRESHOLD:
# Volume below threshold => "silence"
            silence_counter += 1
        else:
silence_counter = 0
# If the total length of silent blocks has exceeded SILENCE_LIMIT seconds, exit
        if silence_counter * (CHUNK / RATE) > SILENCE_LIMIT:
            break
    print("Recording completed.")

    stream.stop_stream()
    stream.close()
    p.terminate()

    # Save it to a WAV file
    wf = wave.open(filename, 'wb')
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(p.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b''.join(frames))
    wf.close()
    return filename

def transcribe(filename):
"""
We recognize the file using faster_whisper.
    Returning the recognized text.
    """
    segments, info = model.transcribe(filename, beam_size=8)
    print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
    texts = ''
    for segment in segments:
        texts += segment.text
    return texts

def main():
    while True:
        #1. Recording a fragment of speech
        audio_path = record_once("audio.wav")
# 2. Recognize
        user_text = transcribe(audio_path)
# If recognition suddenly returns an empty string, skip it
        if not user_text.strip():
print("The user did not say anything or was not recognized, we continue to listen...")
continue

        print("The user said:", user_text)

        #3. Add to the history as a new message from the user
        messages.append({"role": "user", "content": user_text})

        #4. Sending the WHOLE story to Ollama
        # ollama.chat allows you to send a list of messages
        response = ollama.chat(
            'qwen2.5-coder:latest',
            messages=messages,
        )
        # 5. Extracting the response text
        assistant_text = response['message']['content']
        print("Assistant replied:", assistant_text)

        # 6. Save the assistant's response to the history
        messages.append({"role": "assistant", "content": assistant_text})

        # 7. Voicing the response
        engine.say(assistant_text)
engine.runAndWait()
# Then the cycle repeats, we record again, etc.

if __name__ == "__main__":
    main()
  1. Faster Whisper is the local version of Whisper. All audio processing for speech recognition takes place inside your computer.
  2. Ollama — this tool can work completely without an internet connection if you download the required language model in advance. The response is generated locally.
  3. pyttsx3 is a text-to-speech library that does not use external APIs. Everything is synthesized on your computer.
  4. PyAudio — reads audio from the microphone locally and does not forward it to the network.

Thus, no stage of the assistant's work sends your data to the Internet. You control which models and versions of Whisper (or other LLMs) you use.

  • Model selection: Try different Faster Whisper models (small, medium, large) to find a balance between speed and recognition accuracy.
  • Noise reduction: If you have a noisy environment, you can add a library for noise reduction and improve the recognition quality.
  • Alternative language models: Ollama supports not only qwen2.5-coder, but also other models. Choose the one that is best suited for your tasks (for example, for code generation, general chat, translations, etc.).
  • Graphical interface: Add a simple GUI (for example, on PyQt or tkinter) to control the assistant and see the recognized text and responses.

We have created an offline voice assistant that works completely without the Internet - from recording a microphone and speech recognition to generating and voicing a response. This is a great way to ensure the safety of personal data and flexibly customize the system for yourself. Now you can expand the assistant's functionality by teaching it to execute different commands and process complex dialogues — all this locally and with full control over the data.

Experiment, add new features, and let your local assistant become an indispensable tool in your work and daily life!

Comments 1

Login to leave a comment