Archives and Documentation Center
Digital Archives

Statistical and discriminative language modeling for Turkish large vocabulary continuous speech recognition

Show simple item record

dc.contributor Ph.D. Program in Electrical and Electronic Engineering.
dc.contributor.advisor Saraçlar, Murat.
dc.contributor.author Arısoy, Ebru.
dc.date.accessioned 2023-03-16T10:25:02Z
dc.date.available 2023-03-16T10:25:02Z
dc.date.issued 2009.
dc.identifier.other EE 2009 A75 PhD
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/13091
dc.description.abstract Turkish, being an agglutinative language with rich morphology, presents challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. First, the agglutinative nature of Turkish leads to a high number of Out-of-Vocabulary (OOV) words which in turn lower Automatic Speech Recognition (ASR) accuracy. Second, Turkish has a relatively free word order that leads to non-robust language model estimates. These challenges have been mostly handled by using meaningful segmentations of words, called sub-lexical units, in language modeling. However, a shortcoming of sub-lexical units is over-generation which needs to be dealt with for higher accuracies. This dissertation aims to address the challenges of Turkish in LVCSR. Grammatical and statistical sub-lexical units for language modeling are investigated and they yield substantial improvements over the word language models. Our novel approach inspired by dynamic vocabulary adaptation mostly recovers the errors caused by over-generation and further improves the accuracy of sub-lexical units. Additionally, discriminative language models (DLMs) with linguistically and statistically motivated features are utilized. DLM outperforms the conventional approaches, partly due to the improved parameter estimates with discriminative training and partly due to integrating the complex language characteristics of Turkish into language modeling. The significance of this dissertation lies in being a comparative study of several sub-lexical units on the same LVCSR system, addressing the over-generation problem of sub-lexical units and extending sub-lexical-based generative language modeling of Turkish to discriminative language modeling. These approaches can be easily extended to other morphologically rich languages that suffer from similar problems.
dc.format.extent 30cm.
dc.publisher Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2009.
dc.relation Includes appendices.
dc.relation Includes appendices.
dc.subject.lcsh Automatic speech recognition.
dc.subject.lcsh Turkish language -- Morphology.
dc.title Statistical and discriminative language modeling for Turkish large vocabulary continuous speech recognition
dc.format.pages xx, 159 leaves;


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account