Statistical and discriminative language modeling for Turkish large vocabulary continuous speech recognition

Arısoy, Ebru.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Elektrik- Elektronik Mühendisliği
→
Ph.D. Theses
→
View Item

dc.contributor	Ph.D. Program in Electrical and Electronic Engineering.
dc.contributor.advisor	Saraçlar, Murat.
dc.contributor.author	Arısoy, Ebru.
dc.date.accessioned	2023-03-16T10:25:02Z
dc.date.available	2023-03-16T10:25:02Z
dc.date.issued	2009.
dc.identifier.other	EE 2009 A75 PhD
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/13091
dc.description.abstract	Turkish, being an agglutinative language with rich morphology, presents challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. First, the agglutinative nature of Turkish leads to a high number of Out-of-Vocabulary (OOV) words which in turn lower Automatic Speech Recognition (ASR) accuracy. Second, Turkish has a relatively free word order that leads to non-robust language model estimates. These challenges have been mostly handled by using meaningful segmentations of words, called sub-lexical units, in language modeling. However, a shortcoming of sub-lexical units is over-generation which needs to be dealt with for higher accuracies. This dissertation aims to address the challenges of Turkish in LVCSR. Grammatical and statistical sub-lexical units for language modeling are investigated and they yield substantial improvements over the word language models. Our novel approach inspired by dynamic vocabulary adaptation mostly recovers the errors caused by over-generation and further improves the accuracy of sub-lexical units. Additionally, discriminative language models (DLMs) with linguistically and statistically motivated features are utilized. DLM outperforms the conventional approaches, partly due to the improved parameter estimates with discriminative training and partly due to integrating the complex language characteristics of Turkish into language modeling. The significance of this dissertation lies in being a comparative study of several sub-lexical units on the same LVCSR system, addressing the over-generation problem of sub-lexical units and extending sub-lexical-based generative language modeling of Turkish to discriminative language modeling. These approaches can be easily extended to other morphologically rich languages that suffer from similar problems.
dc.format.extent	30cm.
dc.publisher	Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2009.
dc.relation	Includes appendices.
dc.relation	Includes appendices.
dc.subject.lcsh	Automatic speech recognition.
dc.subject.lcsh	Turkish language -- Morphology.
dc.title	Statistical and discriminative language modeling for Turkish large vocabulary continuous speech recognition
dc.format.pages	xx, 159 leaves;