Browsing by Subject "Speech processing systems"

Now showing 1 - 8 of 8

A speech processing application for the Huberman-Hogg neural network model
(Texas Tech University, 1988-12) Taylor, Valerie J
A typical word recognition system requires that several major tasks be performed; necessary components include (1) a preprocessor to extract the significant information from the speech time waveform, (2) a section which stores the training set of word models or templates and then compares an unknown input pattern with the training set, and (3) decision logic to determine the best matching word. This thesis reports on experiments that explore isolated word recognition with an artificial neural network based on the Huberman- Hogg (H-H) model. The results presented in this manuscript were developed from computer simulations of the speech recognition system, but an electro-optical H-H system is also proposed and described. The principal goal of the experimental work is to test the suitability of the ambiguity function representation in preprocessing speech data. Employing the ambiguity function for the speech signal representation was expected to provide two advantages: the input patterns to the H-H network should become less sensitive to time shifts of the total speech waveform, perhaps even making time alignment of the words unnecessary; the ambiguity function of a signal can be obtained in real time with a coherent optical processor, as shown by Marks, Walkup, and Krile (1977), to provide two-dimensional input to an electro-optical H-H network. Since studies indicate that the H-H neural network effectively processes a variety of input functions, this network was chosen as a classifier/recognizer for the ambiguity function patterns representing speech data. Ambiguity functions for isolated words are generated from digitized voice recordings and then submitted to the H-H network for training and recognition testing. Which pattern of the training set best matches the unknown pattern is a decision clearly dependent on the distance metric employed and these experiments explore use of several similarity measures. Following an introductory discussion including an overview of speech processing, the radar ambiguity function, and the Huberman-Hogg neural network model, is a description of the experimental arrangement. Both the components of tiie system and the simulation software are treated. The next section gives the particulars of the various experimental conditions and results. It was found that the ambiguity function performed as desired, acting as a representation that allows the system to become less shift sensitive; however, the neural network processing, at least with the parameter set and decision logic employed, did not yield any increase in the recognition capabilities of the system. Several potential problem areas are identified and suggestions are made for future studies.
An algorithm for voice segregation
(Texas Tech University, 2004-05) Akrofi, Kwaku
The ability to focus on one voice or sound of interest, the target voice, in the presence of one, several or many interfering sounds is known as the cocktail party effect. Many hearing disabilities limit the ability of die patient to isolate the target voice. Voice segregation is the isolation of the target voice by artificial/computational means. Of the many computational automatic voice segregation (AVS) methods available, spectral subtraction (SS) is one whose simplicity and efficiency makes it practical to use for a hearing aid. In SS, the spectrum of an estimate of the interfering sounds is subtracted from the input signal to obtain an estimate of the target voice. A drawback of SS is the fact that it can only effectively isolate the target voice when the interfering sounds are stationary. However, in many everyday situations, e.g., at a cafeteria or in the vicinity of ventilation fans, the interfering sounds are stationary. This thesis presents the design, implementation and assessment of a quasi real-time single-channel frame-by-frame SS-based AVS algorithm that could be used for a hearing aid. Two methods of voice activity detection (VAD) for the algorithm are assessed in this document: energy/zero-crossing-rate VAD (EZVAD) and entropy VAD (EVAD). Since SS can only make an estimate of the spectrum of the interfering sounds during breaks in the target voice, VAD is necessary to detect periods in the input signal where the target voice may be absent. A computationally simple fundamental frequency estimator (FoE) that also tracks the gradient of the fundamental frequency, fo, of each frame is also designed and tested. The function of the FoE is to facilitate voice segregation when the interfering sounds are pitched sounds. Tests on a MATLAB implementation of tiie design showed that SS does perform well provided the interfering sounds are stationary. A problem is the persistence of "musical noise" in tiie output of the system. Techniques that significantly reduce musical noise can only be implemented on non real-time SS systems. Also, EVAD was found to be feasible only when the system is non real-time. The FoE was able to track the target voice of an utterance that was recorded in isolation, but only under certain constraints that are not practical in everyday situations. Hence, the final design comprised just a SS algorithm and an EZVAD. The conclusion drawn by this thesis is that a simple SS-based AVS algorithm that uses EZVAD can significantly reduce near-random stationary interfering sounds, as well as interfering sounds that consist of pure tones.
Investigation of improved masking noise for the speech privacy
(2016-12)
Speaker independent real-time speech recognition system
(Texas Tech University, 1998-08) Jindani, Abid M
This thesis attempts to develop a real-time speaker-independent Automatic Speech Recognition (ASR) system. The system recognizes isolated utterances from a limited vocabulary, and is small and cost-efficient to be incorporated into a consumer appliance. The recognition is based on zero crossings and energy content measurement on the speech waveforms. The algorithm is based on segmenting the speech waveform into ten equally spaced intervals and performing a match with the patterns in a reference template. The system was implemented on an IBM Personal Computer and achieved an error rate of 0% on a vocabulary of four words from an initial ten-word database of 16 speakers (8 male and 8 female). The system recognized unknown utterances in less than 0.3 seconds.
Speech data compression
(Texas Tech University, 1996-08) Ho, Chien-Te
The analysis-by-synthesis method is the most useful application for the parametric representation. The necessary components for the model are derived from signal analysis procedures while the output speech waveform is obtained from the synthetic procedure. This method, such as the Codebook Exited Coder (CELP) [1], is first implemented in the time domain. The basic approach is to model the correlation among the speech samples by using a linear time-varying filter. An excitation model can then be obtained by removing the correlation. Since the filter will not ignore the noise, the parametric representation does have problems with the noisy speech data. An alternative procedure is to implement the technique in the frequency domain. This leads to a flexible method for lower bit rate procedure transmission. Furthermore, it provides a suitable way to model the filter in a noisy environment. Methods such as the harmonic vocoder and Multiband Excitation Coder (MBE) [4] are all frequency domain techniques. Since the speech data is recovered from the parametric model, the output depends on the model parameters, which may greatly effect the quality of the speech. The objective of this thesis is to develop efficient algorithms for implementing the harmonic vocoder in the frequency domain. A reUable method is developed to realize the analysis procedure and to achieve the correct fundamental elements of speech signal. An efficient method is proposed to synthesize output speech signal and to improve speech quality. Also, the techniques of model refinement and enhancement will be described in this thesis. In practice, the analogue speech signal is sampled at 8000Hz and this rate is used throughout this research. The research is concentrated on the method for speech data compression and speech quality improvement rather than coding schemes.
Speech recognition in individuals with dysarthria
(Texas Tech University, 2000-05) Acrey, Adrienne M.
The purpose of this study was to compare the effects of speech training on the recognition accuracy of a speech recognition system (i.e. DragonDictate) for three speakers with moderate dysarthria and three typical speakers. A pretest was administered to measure speech intelligibility and mental state. Each subject participated in training sessions with the computer that involved the repetition of 70 stimulus items. Stimulus items were selected from a word list which contained acoustic-phonetic contrasts. The results indicated superior recognition accuracy scores for typical speakers in contrast to speakers with dysarthria. Additionally, speakers with dysarthria required more sessions to achieve ceiling on recognition scores in comparison to the typical speakers. In summary, the speakers with dysarthria were able to obtain high recognition accuracy scores after training the system.
Speech system for a voice-impaired person
(Texas Tech University, 1999-12) Sirigineedi, Ravi Kumar Anjani
This thesis attempts to develop a speaker-dependent speech system for voice impaired people. The system recognizes isolated utterances from a limited vocabulary, and is small and cost-efficient enough to be incorporated into a hand-held system. A 20-dimensional feature vector was generated based on zero crossing and energy content measurements of the speech waveforms. The generated feature vectors were used to train a neural network and the trained network was tested with known and unknown utterances. The system was implemented on an IBM Personal Computer and achieved a recognition rate of 76% on a ten-word database of 16 speakers (8 male and 8 female). A test database, which mimics a voice-impaired person's speech, was developed, and a recognition rate of 60% was observed. The system recognized utterances at an average rate of 0.15 seconds/recognition.
Voice input for decision support systems: the use of multiple discriminant analysis for word recognition
(Texas Tech University, 1987-08) Parameswaran, Jagadeeswaran
A Decision Support System (DSS) is characterized to have flexibility, ease of use, interactive capability and the capacity to support managerial decision making in ill-structured situations. The infrastructure of a DSS has been viewed to consist of a database, a modelbase, a user interface, and perhaps, a knowledgebase. Most DSS research has been directed towards the modules of database, modelbase, and knowledgebase. The work relevant to the user interface is limited. There is conclusive evidence, showing that within a problem-solving context, voice interaction is superior to other modes in terms of speed and task efficiency. Since speech recognition is an emerging field only few commercial systems are available currently. About 5% of the recognizers sold so far are still in use. Two major problems are: i) unpredictable performance in terms of recognition accuracy ii) inexpensive systems to compromise on algorithms. This study explores the possibility of a reliable voice input module for a DSS. Specifically, Multiple Discriminant Analysis (MDA), is used in modeling a speaker-trained, isolated word recognition environment. A design framework for MDA based recognizers is proposed. It provides details of alternatives available and guidelines for prototyping. Factors such as the training effort, the number of variables, estimation of covariance matrices, word population separations, computational requirements, ease of implementation in a DSS environment e t c , lead to the choice of a Linear Multiple Discriminant Analysis (LMDA) approach. This study compares the proposed LMDA model to the model based on Dynamic Time Wrapping (DTW) on performance criteria including accuracy, storage, and computational requirements. Part of the same Texas Instruments (TI) - database which was used in evaluating seven popular commercial recognizers was used to compare the substitution error and rejection error. Training size, and order of analysis were controlled and maintained across LMDA and DTW methods. The results validate the previous work with respect to training size, in that performance improved with up to 4 repetitions. With respect to substitution error the better performance of LMDA models is statistically validated. There was no statistically significant difference with respect to rejection error. The results indicate that the LMDA performance in reduced space peaks prior to reaching the full discriminant space. Inclusion of the last few discriminant functions tends to introduce distortion. It is recommended that the LMDA model should be operated in reduced space. The computational requirements of LMDA and DTW methods are compared using analysis of algorithms. Even in full discriminant space, the LMDA approach is superior to the DTW method, with respect to computational requirements. The LMDA approach for user-trained isolated word recognition problem, involves computationally higher training cost and reduced recognition cost. This study is limited to only LMDA based user-trained isolated word recognition systems. The vocabulary size was also small. This research can be extended to a large DSS vocabulary with various interfaces modes such as command-driven or menu-driven.

Browsing by Subject "Speech processing systems"

Results Per Page

Sort Options