Talk A Speech Recognition Toolkit based on Python

Presented by Yotaro Kubo in Scientific track 2010 on 2010/07/10 from 14:45 to 15:00 in room Dussane

SRTk is an open-source software framework for speech recognition implemented by using Python [1]. Automatic speech recognition (ASR), which converts recorded speech audio signals to text transcriptions, is one of the most promising technologies for advanced human-machine interaction. Although the algorithms and heuristics used in this technology are still under investigation and are in need of rapid implementation of state-of-the-art methods, the current frameworks are mainly implemented by using "heavyweight" languages. This is mainly due to the amounts of data used in ASR research. The number of vectors used for building models is significantly high in general. This situation makes it difficult to implement ASR research software in "easy-to-hack" way. In this framework, the Python language is used as a glue language of small high-performance modules written by using the C++ language and the Boost.Python modules. Since the speech recognition research maintains modularity, this architecture speeds up the ASR methods without loss of flexibility.

This module installs several shell commands to manipulate probabilistic models and data used in ASR, and several python modules to enable investigations of ASR. Specifically, the following modules are implemented in order to realize python-based ASR.

  1. Continuous-density hidden Markov models (CD-HMMs) and their training methods

    In current ASR methods, CD-HMMs are widely accepted as a model of speech feature sequences. Although there exists some python-based implementation of HMMs for modeling of discrete univariate sequence, an HMM implementation for continuous multivariate sequences is the first one. As training methods of CD-HMMs, maximum-likelihood training method and minimum mutual information training method are implemented in this module. CD-HMMs are also widely used in several fields; such as gesture recognition and hand-writing recognition.

  2. OpenFST wrapper

    Weighted finite-state transducer (WFST), extension of finite-state automaton (FSA), is a very important model for natural language processing as well as speech recognition. This model represents transducers of symbol sequences. Since ASR can be assumed as a transition process, which converts phoneme sequences into word sequences, some algorithms for WFST manipulation are useful for ASR.

  3. Speech decoder based on beam search for WFSTs

    In order to obtain results of speech recognition, a shortest path algorithm on WFSTs is applied in general. A beam-search method for WFSTs and HMMs is implemented in this module.

  4. Speech signal processing tools

    Since raw samples of recorded speech signals are intractable in general, feature extraction methods, which extract sequences of feature vectors from sample sequences, are applied. This software framework implements an extraction method of the most widely accepted feature, called Mel-frequency cepstral coefficients (MFCCs).

The performances of this software framework are evaluated by using open-source speech corpus "VoxForge," and a standard benchmark test in academic field "TIMIT." Basic theories of ASR, implementation details of this framework, and the basic usage will be presented in the session.