Browsing by Subject "Speech synthesis"
Now showing 1 - 8 of 8
Results Per Page
Sort Options
Item A microcomputer-based system for speech recognition and synthesis(Texas Tech University, 1980-08) Yang, Ming-YuanNot availableItem Effects of repeated listening experiences on the perception of synthetic speech by individuals with mental retardation(Texas Tech University, 1999-04) Lees, Kathryn CarlaThis study evaluated the effects of training on the comprehension of specific words and sentences by individuals with intellectual disabilities (n=18) and a matched control group (n=10). More specifically, the effects of training on novel versus repeated words and sentences produced by the ETI Eloquence speech synthesizer were studied over three training sessions. Stimulus materials included four word lists of 20 words each and four sentence lists of 20 sentences each. One of the word and sentence lists was identified as repeated and the remaining three were identified as novel. The synthetic speech used was the ETI Eloquence (1998) adult male voice Wade. All stimuli were recorded onto a digital compact disc and were presented via a Sony Discman. To assess subject responses, 80, 8.5x11 inch cards were developed. Each card contained a black and white line drawing of the target word or sentence presented and three foil pictures. The cards used a grid of two rows by two columns, with one black and white drawing in each grid. The subjects were asked to point to the drawing depicting the stimulus item. A pretest was administered to eliminate subjects who could not obtain perfect scores when experimental stimuli were presented via natural speech. There were a total of three experimental sessions, each separated by a period of at least 24 hours. During each session, subjects were presented with a list of novel and repeated words and sentences. The repeated word and sentence lists were presented in each of the three sessions. All experimental group subjects met the following criteria: (a) a diagnosis of mild to moderate mental retardation; (b) reliable pointing skills to serve as an expressive response modality; (c) no uncorrected visual impairment and adequate visual discrimination skills; (d) ability to identify all pictures on a picture identification task. Thresholds of 25dB or better were obtained by all but two subjects who demonstrated mild hearing loss at 4000 Hz. All control group subjects were required to meet the same criteria as the experimental group subjects except they were not intellectually disabled as determined by the TONI-2 (Brown, Sherbenou, & Johnsen, 1990). In the experimental group, a significant main effect for stimulus complexity [F (1,136) = 42.37, p <.01] was found. A significant main effect was also noted for listening trials [F (2,136)= 88.12, p <.01]. There was no significant effect for stimulus type (i.e., repeated versus novel) on word identification and sentence comprehension accuracy [F (i. 136) = .008, p >.01]. Similarly, in the control group, analysis revealed a significant main effect for stimulus complexity [F (,,72) = 13.03, p <.01] and for listening trials [F ,2.72)= 45.94, p < 01 ]. Additionally, there was a significant two way interaction between listening trials and complexity [F (1,72) = 3.54, p< 05]. Because there were no empirical studies on the intelligibility of the ETI-Eloquence synthesizer (1998) with non-disabled individuals, the present study was designed to gather preliminary data on this synthesizer. Intelligibility of the ETI-Eloquence synthesizer was almost identical to that of the high quality DECtalk synthesizer which is used widely in research and clinical applications. Control group participants had a mean word identification accuracy score of 92% and a mean sentence comprehension accuracy score of 82% on the first training session. These results were comparable to results obtained in a large number of previous studies in which the DECtalk synthesizer was used. In conclusion, both experimental and control data revealed that perception of synthetic speech were enhanced as a result of repeated listening experiences. Synthetic speech comprehension was significantly superior (p < .05) across groups in the word identification task as opposed to sentence comprehension task. Repeated stimuli were not significantly (p < .05) more intelligible than novel stimuli across groups, which indicated that both experimental and control group subjects were able to generalize their knowledge of the acoustic-phonetic properties of synthetic speech to novel stimuli.Item Knowledge Based Speaker Analysis Using a Massive Naturalistic Corpus : Fearless Steps Apollo-11(2020-08) Shekar, Meena ChandraApollo-11 was the first manned space mission to successfully bring astronauts to the moon and return them safely. As part of NASAās goal in assessing team and mission success, all voice communications within Mission Control, Astronauts, and support staff, were captured using a multi-channel analog system, which until recently had never been made available. For such time and mission-critical naturalistic data, there is an extensive and diverse speaker variability, which impact performance of speaker recognition and diarization technologies. Hence, analyzing and assessing speaker recognition for this dataset has the potential to contribute to improved speaker models for such corpora and address multi-party speaker situations. In this study, a small subset of 100 hours derived from a collective 10,000 hours of the Fearless Steps Apollo-11 audio data were investigated, corresponding to three challenging phases of the mission: Lift-Off, Lunar-Landing and Lunar-Walking. A speaker recognition assessment is performed on 140 speakers from a collective set of 183 NASA mission specialists who participated, based on sufficient training data obtained from 5 (out of 30) mission channels. Since Apollo data consists of variable speaker turn duration per each speaker, analysis on how limited vs. sufficient train duration per speaker model could impact alternate baseline systems is investigated. Furthermore, observations for test duration were made by testing these trained speaker models with very short to long duration test segments. Speaker models trained on specific phases are also compared with each other to determine how stress, g-force/atmospheric pressure etc, can impact the robustness of the models. This represents one of the first investigations on speaker recognition for massively large team based communications involving naturalistic communication dataItem Perceptual learning of synthetic speech by individuals with severe mental retardation(Texas Tech University, 2002-05) Hester, Kasey LynneThe purpose of this study was to evaluate the magnitude and type of practice effects in individuals with severe mental retardation as a result of systematic exposure to synthetic speech. This study compared the performance of a group of individuals with severe mental retardation (n=14) with a matched control group (n=14) on word identification accuracy and latency tasks. Specifically, the effects of training on novel versus repeated words produced by the DECtalk synthesizer were analyzed. Stimulus materials included 4 lists of 10 words each. These words were selected from a list of the first 50 words used by typically developing preschoolers (Nelson, 1973) and a dictionary of symbol vocabulary used by youth with severe mental retardation (Adain.son, Romski, Deffebach, & Sevcik, 1992). One list was designated as repeated and the remaining three as novel. Within each list, 20% of the words were repeated to judge intra-subject reliability. The synthetic speech used was DECtalk Betty (i.e., simulated adult female voice). A Microsoft Visual Basic program was developed to present the stimuli and the prompts, and to record responses. The experimental stimuli were presented using a laptop computer and external speakers that were placed approximately 12 inches in front of the subject. The experimental stimuli were presented at 75 dB SPL as determined by a sound level meter. Subjects' were instructed that they would hear a series of words and that their job was to touch the picture on the computer screen depicting the stimulus item. A touch screen mounted on the computer screen in conjunction with the Visual Basic program automatically recorded responses. The touch screen was calibrated to ignore "miss hits" (i.e., the subject slid his hand across the screen and activated a wrong selection) by using a timed activation direct selection strategy. The computer screen displayed one target picture, a visual representation of the synthetic word, and three unrelated foils. The position of the pictures within each experiment was randomized to avoid position effects; the order of presentation of the lists was randomized to avoid order effects; and a constant inter-stimulus interval of 10 seconds was maintained during presentation of the words within each list. All subjects had to pass a pretest in order to participate in this study. This pretest was designed to exclude subjects who were unable to obtain 100% correct scores for experimental stimuli, presented via live natural speech. In the absence of perfect scores on the pretest, it would be difficult to determine whether the performance demonstrated by individuals with mental retardation was due to the difficulty in processing synthetic speech or due to lack of conceptual knowledge of the stimulus items. The pre-experimental procedures were conducted at least one week prior to the beginning of the experiment. There were a total of 3 experimental sessions, each separated by a period of at least 24 hours. During each session, subjects were presented with a list of novel words, and a list of repeated words. The same repeated word list was presented across all sessions while a new novel word list was presented in each session. Subjects were instructed that they would hear a series of words preceded by a carrier phrase and that they were to point to the drawing depicting the word. Additionally, they were told to make their best guess if they were uncertain. Immediately prior to each experimental session, practice items were run to ensure that subjects were familiar with the task. The practice items were different from those used in the experimental task. Data were analyzed using a repeated measure design. The two dependent measures were (1) word identification accuracy and (2) word identification latency. Data for word identification accuracy and latency were analyzed using a repeated measures (2X2X3) ANOVA in which group served as a between factor variable while type of task, type of stimuli, and listening sessions served as within subject variables. Analysis revealed a significant main effect for group [F (1, 52) = 7.523, p < .05] on the word identification accuracy task indicating that individuals with severe mental retardation had significantly lower word identification accuracy scores (mean = 80.95) than the control group (mean = 91.19). A non-significant trend toward improved word identification accuracy across sessions [F (2, 104) = 2.635, p > .0765] was noted. The most interesting finding of this study was the lack of a significant effect [F (1, 52) = 0.199, p > .05] for stimulus type (i.e., repeated vs. novel) across groups on the word identification accuracy task. The presence of a significant interaction between word identification latency and group [F (2, 104) = 8.53, p< .01] indicated that individuals with mental retardation were processing synthetic speech more quickly as a result of repeated exposure. In summary, current results indicated that perception of synthetic speech in individuals with mental retardation was enhanced (i.e., significant decrease in latency) as a result of systematic exposure to synthetic speech. Also, the absence of a significant effect for stimulus type indicated that individuals with mental retardation generalized their knowledge of the acoustic-phonetic properties of synthetic speech to novel stimuli. These results were significant because they indicated that individuals with mental retardation became more skilled at recognizing synthetic speech whh repeated exposure. This was an important finding in the context of increased use of VOCAs by individuals with significant communicative and cognitive impairments.Item Speech data compression(Texas Tech University, 1996-08) Ho, Chien-TeThe analysis-by-synthesis method is the most useful application for the parametric representation. The necessary components for the model are derived from signal analysis procedures while the output speech waveform is obtained from the synthetic procedure. This method, such as the Codebook Exited Coder (CELP) [1], is first implemented in the time domain. The basic approach is to model the correlation among the speech samples by using a linear time-varying filter. An excitation model can then be obtained by removing the correlation. Since the filter will not ignore the noise, the parametric representation does have problems with the noisy speech data. An alternative procedure is to implement the technique in the frequency domain. This leads to a flexible method for lower bit rate procedure transmission. Furthermore, it provides a suitable way to model the filter in a noisy environment. Methods such as the harmonic vocoder and Multiband Excitation Coder (MBE) [4] are all frequency domain techniques. Since the speech data is recovered from the parametric model, the output depends on the model parameters, which may greatly effect the quality of the speech. The objective of this thesis is to develop efficient algorithms for implementing the harmonic vocoder in the frequency domain. A reUable method is developed to realize the analysis procedure and to achieve the correct fundamental elements of speech signal. An efficient method is proposed to synthesize output speech signal and to improve speech quality. Also, the techniques of model refinement and enhancement will be described in this thesis. In practice, the analogue speech signal is sampled at 8000Hz and this rate is used throughout this research. The research is concentrated on the method for speech data compression and speech quality improvement rather than coding schemes.Item Speech recognition system(Texas Tech University, 1996-08) Mehta, Milan G.Automatic Speech Recognition (ARS) has progressed considerably over the past several decades, but still has not achieved the potential imagined at its very beginning. Almost all of the existing applications of ASR systems are PC based. This thesis is an attempt to develop a speech recognition system that is independent of any PC support and is small enough in size to be used in a daily use consumer appliance. This system would recognize isolated utterances from a limited vocabulary, provide speaker independence, require less memory and be cost-efficient compared to present ASR systems. In this system, speech recognition is performed with the help of algorithms such as Vector Quantization and Zero Crossing. Several features of a Digital Signal Processor (DSP) have been utilized to generate and execute the algorithms for recognition. The final system has been implemented on Texas Instmments TMS320C30 DSP. The system, when implemented using the Vector Quantizer approach, achieved an accuracy of 94% for a vocabulary of 6 words and a recognition time of 6 seconds. The zero crossing approach resulted in an accuracy of 89% for the same vocabulary while the recognition time was 0.8 seconds.Item Synthesizing Naturalistic and Meaningful Speech-Driven Behaviors(2017-12) Sadoughi Nourabadi, NajmehNonverbal behaviors externalized through head, face and body movements for conversational agents (CAs) play an important role in human computer interaction (HCI). Believable movements for CAs have to be meaningful and natural. Previous studies mainly relied on rule-based or speechdriven approaches. We propose to bridge the gap between these two approaches overcoming their limitations. We build a dynamic Bayesian network (DBN), with a discrete variable to constrain the behaviors. We implement and evaluate the approach with discourse functions as constraints (e.g., questions). The model learns the characteristic behaviors associated with a given discourse class learning the rules from the data. Objective and subjective evaluations demonstrate the benefits of the proposed approach over an unconstrained model. Another problem with speech-driven models is that they require all the potential utterances of the CA to be recorded. Using existing text to speech (TTS) systems scales the applications of these methods by providing the flexibility of using text instead of pre-recorded speech. However, training the models with natural speech, and testing them with TTS creates a mismatch affecting the performance of the system. We propose a novel strategy to address this mismatch. It starts by creating a parallel corpus with synthetic speech aligned with the original speech for which we have motion capture recordings. This parallel corpus is used to retrain the models from scratch, or adapt the models built with natural speech. Subjective and objective evaluations show the effectiveness of this solution in reducing the mismatch. In addition to head movements, face conveys a blend of verbal and nonverbal information playing an important role in daily interaction. While speech articulation mostly affects the orofacial area, emotional behaviors are externalized across the entire face. Furthermore, facial muscles connect areas across the face, creating principled relationships and dependencies between the movements that have to be taken into account. Using multi-task learning (MTL), we create speech-driven models that jointly capture the relationship not only between speech and facial movements, but also across facial movements. We build our models with bidirectional long-short term memory (BLSTM) units which are shown to be very successful in modeling dependencies for sequential data. Within the face, the orofacial area conveys information including speech articulation and emotions. These two factors add constraints to the facial movements creating non-trivial integrations and interplays. The relationship between these factors should be modeled, to generate more naturalistic movements for CAs. We provide deep learning speech-driven structures to integrate these factors. We use MTL, where related secondary tasks are jointly solved when synthesizing orofacial movements. In particular, we evaluate emotion and viseme recognition as secondary tasks. The approach creates orofacial movements with superior objective and subjective performances than baseline models. Taken collectively, this dissertation has made algorithmic advancements into speech and body movements sequential modeling to leverage knowledge extraction from speech for nonverbal characterization over time.Item Word identification and sentence comprehension of synthetic speech by individuals with mental retardation(Texas Tech University, 1995-12) Hanners, JenniferThe purpose of this study was to examine the performance of two text-to-speech systems (DECtalk and Real Voice) by individuals with mental retardation and matched controls. Each subject participated in two experimental sessions designed to measure word recognition, sentence verification accuracy, and sentence response latency. A pretest was administered to exclude the subjects who were unable to recognize the words or sentences when presented via natural speech. A total of 40 words was selected for evaluation of word recognition from a list of words provided by parents of nonspeaking children. Twenty three-word sentences were constructed to measure sentence verification accuracy and latency. The results indicated that both individuals with mental retardation and nondisabled individuals performed significantly better on DECtalk synthetic speech than on Real Voice. Additionally, performance of individuals with mental retardation was significantly poorer than that of nondisabled individuals on the sentence verification task. Across groups, subjects performed significantly better on the word identification task than on the sentence verification task. A non-significant trend towards greater response latencies was observed for individuals with mental retardation. In summary, the results of this study indicate that individuals with mental retardation have significant difficulty in identifying and comprehending synthetic speech. The results of this investigation raise several issues related to comprehension of synthetic speech by nonspeaking individuals who rely on voice output communication aids to achieve effective and efficient communication.