Monday, July 6th

Assessment of voice and health disorders from the speech.

Prof. Ignacio Godino, UPM, Madrid

In the last few years a new area is emerging with the aim of applying and updating the state of the art of speech technology to the processing and evaluation of what is called disordered speech. In this field, acoustic analysis reveals as a non-invasive technique which is an efficient tool for the objective voice function assessment. On the other hand, the acoustic analysis is a complementary tool to other methods of evaluation based on the direct observation of the vocal folds using videoendoscopy. These techniques are the basis for an early detection and evaluation not only of voice disorders, but also of health conditions, such as Parkinson, Alzheimer or Obstructive Sleep Apnea. The application of this set of techniques is not restricted to the medical area alone, as it may also be of special interest in forensic applications, the assessment of voice quality for voice professionals such as singers, the evaluation of the stress, fatigue, etc.

The goal of this module is to provide the attendants with an overview of the state of the art in this field with a special focus on the methods that are currently being used for the assessment of voice pathologies. Current applications of the speech technology in the biomedical field will also be reviewed.

Geometrical Intuition of the Deep Neural Networks.

Prof. Enric Monte, UPC, Barcelona

This talk is about understanding the structure and principles of DNN with enphasis in the geometrical underlying intuition that explains why this structure can work and what ideas motivated this kind of Neural Network.

Tuesday, July 7th

Speaker Segmentation and Characterization.

Dr. Xavier Anguera and Dr. Jordi Luque, Telefonica Research, Barcelona

With the vast amounts of audio data being constantly collected and made available, it becomes of crucial importance to devise powerful algorithms and techniques to structure and analyze such recording in order to extract useful information from them. Proper analysis of the speech signal can bring, not only information about what is said, but also how and by whom things are said. Characterization of the speaker(s) in an audio recording is gaining relevance inside the speech community as it allows for many practical applications that were not approached just until a few years ago. Examples are the detection of the speaker's permanent physical characteristics (age, gender, height, voice likability, educational level, etc.) or their current state (mood, drunkness, non-permanent physical state, etc.).

Speaker characterization algorithms work best when speech from a single speaker is available. When speech recordings involve multiple speakers and other noises/sounds one needs to first separate them into speakers by using speaker segmentation and diarization techniques. This is common for recordings of phone conversations, interviews, meetings, etc.

In this module we will offer an overview of the speaker segmentation and characterization state of the art and we will describe in more detail how some of them can be applied for the task of call-center analytics within a large Telco.

Deep Neural Networks for Speaker Recognition.

Omid Ghahabi, Universitat Politècnica de Catalunya

Speaker recognition is the process of automatically recognizing who is speaking by using the speaker-specific information included in speech signals. Applicable services for speaker recognition include banking over a telephone network, telephone shopping, database access services, etc. Deep Neural Networks (DNNs) have recently opened a new research line in image, audio, and speech processing areas. However, few attempts have been carried out in speaker recognition. In this lecture we first give a brief overview on speaker recognition and DNN. Then we will talk about the application of DNNs for modeling i-vectors in both single and multi session speaker recognition tasks, impostor selection for DNNs, Universal Deep Belief Network (UDBN) and DBN adaptation, and Restricted Boltzmann Machine (RBM) transformation of supervectors.

Keywords: Deep Neural Network, Deep Belief Network, Restricted Boltzmann Machine, Speaker Recognition, RBM Supervector

Wednesday, July 8th

Statistical parametric speech synthesis: from HMM to LSTM-RNN.

Dr. Heiga Zen, Google, Cambridge, UK

This talk will present progress of acoustic modeling in statistical parametric speech synthesis from the conventional hidden Markov model HMM to the state-of-the-art long short-term memory recurrent neural network. The details of implementation and applications of statistical parametric speech synthesis are also included.

Thursday, July 9th

Speech Recognition and Neural Networks.

Keynote speaker: Prof. Dr. Hermann Ney. RWTH, Aachen, Germany

The last 25 years have seen a dramatic progress in statistical methods for speech and language processing like speech recognition, handwriting recognition and machine translation.

Most of the key statistical concepts had originally been developed for speech recognition. We will review them in this talk. Example of such key concepts are the Bayes decision rule for minimum error rate and probabilistic approaches to acoustic modelling (e.g.hidden Markov models) and language modelling. Recently the accuracy of speech recognition could be improved significantly by the use of artificial neural networks, such as deep multi-layer perceptrons or recurrent neural networks (incl. long-short-term-memory extension). We will discuss these structures in detail and how they fit into the probabilistic approach.

Keywords: Bayes decision rule, hidden Markov models, language models, training and training criteria, search; neural network structures, softmax, deep multilayer perceptron, recurrent neural networks.

Deep Neural Networks in Speech Recognition

José Adrián Rodríguez Fonollosa

This presentation will over the new End-to-End Speech Recognition with Deep Neural Networks, which transcribe audio data with text, without requiring intermediate phonetic representations. It will inlude aspects as data augmentaion that are important in these architectures.


Red Temática en Tecnologías del Habla