Pages: [1]   Go Down
Print
Author Topic: Seminario Dott. Siniscalchi - Mercoledì 21/12  (Read 766 times)
0 Members e 1 Utente non registrato stanno visualizzando questa discussione.
Sebastiano Battiato
Global Moderator
Apprendista Forumista
*****
Offline Offline

Posts: 190


« on: 13-12-2011, 18:40:44 »

Nell'ambito del corso di Multimedia (Laurea Magistrale in Informatica) Mercoledì 21 Dicembre p.v. in Aula 4 (10.00-13.00) si terrà un incontro/seminario tenuto dal  Dott. Marco Siniscalchi (Università Kore - Enna) dal titolo: Speech-to-Text: An overview

Tutti gli interessati sono invitati a partecipare.
SB

Biosketch
Dr. Siniscalchi is currently an Assistant Professor at the University of Enna “Kore”, and his research topics are language modeling, language identification, and bottom-up automatic speech recognition. He earned his Computer Engineering PhD degree in January 2006 from the University of Palermo. The focus of his PhD dissertation was on acoustic modeling, and his thesis mainly dealt with the problem of integrating articulator knowledge into conventional hidden Markov model based automatic speech recognition systems. From January 2004 to September 2005, he worked, as a Visiting PhD Fellow, at the Center of Signal and Image Processing Laboratory under the direction of Prof. Mark A. Clements of the Electrical and Computer Engineering School at the Georgia Institute of Technology in Atlanta (USA). During this visit, he conceived the core of my PhD thesis.
From September 2005 to December 2006, he was a Post Doctoral Fellow at the Electrical and Computer Engineering School of the Georgia Institute of Technology in Atlanta under the guidance of Prof. Chin-Hui Lee. In that period, he worked on speech event detection by statistical methods. From 2007 to 2009, he was engaged as a researcher at the Department of Electronic and Telecommunication, The Norwegian University of Science and Technology, under the guidance of Prof. Torbjørn Svendsen. His goal was to development of a statistical framework for combining asynchronous and partly redundant speech information sources.
In 2002, outside of academia, he was a System Engineer at STMicroelectronics in Catania (Italy). There, he designed and implemented optimization algorithm for digital image processing on VLIW processor.


Abstract
Speech-to-text is the problem of delivering accurate written transcriptions for spoken utterances using computer programs. The goal of speech-to-text research is to enable computers with a speech recognition engines that can recognize in real-time with 100% accuracy all words spoken by any person, independent of vocabulary size, noise, speaker characteristics and accent, or channel conditions. In spite of the great effort posed in this research area by many scientists in more than four decades, word recognition accuracy greater than 90% is only attained when the task is constrained in some way. Different levels of performance can be attained depending on the task; for example, continuous digits recognition over a microphone channel (small vocabulary, no noise) can be greater than 99%. If the speech-to-text engine is trained to learn an individual speaker's voice, then much larger vocabularies are possible, although accuracy drops to somewhere between 90% and 95% for commercially-available systems. For large-vocabulary speech recognition of different speakers over different channels, accuracy is no greater than 87%, and processing can take hundreds of times real-time.

In this talk, the dominant technology used to tackle the speech-to-text problem, called the Hidden Markov Model, or HMM, will be presented. This technology recognizes speech by estimating the likelihood of each sub-word at contiguous, small regions of the speech signal. Each word in a vocabulary list is specified in terms of its component sub-words. The sequence of sub-words with the highest likelihood is found through a search procedure. This search is constrained to only look for sub-word sequences corresponding to allowable words in the vocabulary list, and the sub-word sequence with the highest total likelihood is identified with the word that was spoken. A statistical language model instead is used to incorporate both syntactic and semantic constrains of the language and the recognition task.

--
__________________________________________
Prof. Sebastiano Battiato (Ph.D.)

Università di Catania
Dipartimento di Matematica ed Informatica

IPLAB@CT: www.dmi.unict.it/~iplab
Home: www.dmi.unict.it/~battiato

Tel. +39 095 7337224
Fax. +39 095 7337223 or +39 095 330094

__________________________________________
Logged
Pages: [1]   Go Up
Print
Jump to: