Interspeech 2023: In-Person Conferences Are Back!

6 min readSep 1, 2023

We’re back! The convention center in Dublin was full of good humour and engaging discussions last week, as roughly 2000 members of the speech science and technology community gathered for their annual flagship conference. This was the first fully in-person Interspeech since Covid-19 reared its ugly head. While there are some advantages to remote/hybrid conferences, I think the majority of the attendees were grateful to put online video presentations in the rear-view mirror. Another cause for celebration was that this was the first year that Interspeech adopted a double-blind review process. This was a long-overdue change that was sorely needed to help address paper acceptance disparities.

There were a few general trends that surfaced over the course of the conference. While in previous years there was a seemingly endless appetite for making models increasingly bigger, this year seemed to feature more papers that were trying to make models smaller and more efficient. There seemed to be more papers that were investigating smaller architectures, quantization and pruning techniques, as well as methods of getting models to work on edge devices.

It was interesting to see how the conformer (a transformer architecture variant from 2020) has become ubiquitous, even for use-cases outside of speech recognition. While the transformer and its variants have supplanted RNNs as the sequence-modeler “du jour”, RNNs seemed to be making a little bit of a comeback in the form of structured state-space models (discussed more below). Self-supervised and unsupervised methods still featured heavily in the proceedings (as they have in previous years) and there seemed to be significant progress in methods to evaluate trained representations, as well as methods to probe what has been learned by the models.

As is customary here at Cogito, we’ve selected a few of the papers from this year’s conference to give you a flavour of the current state of speech research.

*The exhibition hall in the convention center*

Paper highlights

On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning

This paper from researchers at Samsung investigates ways of reducing inefficiencies in speech representations learned with self-supervised learning (SSL). Many of these SSL models (e.g. wav2vec 2.0) use 1D convolutions applied directly on the raw audio waveform as initial layers. The authors identify these initial convolutional layers as being one of the main contributors to the high memory requirements and long training times of these models. They investigate a collection of alternatives and find that they can reduce the training time for wav2vec 2.0 from 7 days down to 1.8 days by simply using Mel filterbanks with 2D convolutions. The evaluation metrics are comparable to those using the raw waveform. The results from the paper make these SSL models more feasible for labs that may not have the budget of the larger institutions.

Multi-Head State Space Model for Speech Recognition

This paper from researchers at Cambridge and Meta builds on the Structured State Space for Sequence Modeling (S4) model by introducing some transformer-style attention mechanisms, and applies it to speech recognition. These S4-based models have the attractive property that they can operate as CNN during training, but then convert to an efficient RNN at inference time. Their proposed architecture, the “Stateformer”, achieves state-of-the-art performance on Librispeech.

Pardon my disfluency: The impact of disfluency effects on the perception of speaker competence and confidence

In this paper from researchers at KTH, the authors use synthesized speech to investigate the role of disfluencies in perceptions of competence, sincerity, and confidence. They found that listeners’ perceptions of confidence and competence decreased as the overall number of disfluencies increased. They analyzed how the different types of disfluencies, including filled pauses and repetitions, impacted the listeners’ ratings. They found that repetitions that are less common in spontaneous speech have a greater impact on competence and confidence ratings. They also showed that, if the listeners are able to attribute disfluency to anxiety, the effects of disfluency on competence ratings are reduced.

LanSER: Language-Model Supported Speech Emotion Recognition

This paper from researchers at KAIST and Google proposes a weak supervision method for speech emotion recognition (SER). The authors use massive speech corpora (People’s Speech and Condensed Movies) that don’t contain any SER labels, run them through ASR (using Whisper), and then use an LLM to generate weak SER labels. The weak labels are then used to pre-train their SER ResNet model, using acoustic features as input. After pre-training on the weak labels, the authors investigate using different amounts of the hand-labeled data to fine-tune the model. The results show convincing improvements in models that were pre-trained over those that weren’t, particularly when there is only a small amount of hand-labeled data available.

Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis

This paper from researchers at KTH proposes a novel model for automatic evaluation of text-to-speech (TTS) systems. Automatic evaluation of TTS models is a notoriously difficult problem. This paper proposes a tool that can be used to assess the naturalness of the turn-taking cues that are generated by TTS systems. The tool is based around the authors’ recently proposed Voice Activity Projection (VAP) model, which is a turn-taking prediction model that uses raw audio waveforms as its input. The VAP model was on display in this year’s show and tell session and it generates impressive predictions of turn-taking behaviors. To evaluate TTS systems, the VAP model is used to predict turn-taking behaviors generated by a given system during phrases that are naturally either turn-holding or turn-yielding. The output predictions of the model are compared against the expected hold/yield behavior to calculate various evaluation metrics. The authors provide an interesting evaluation of several commercial and open-source TTS systems.

Capturing Formality in Speech Across Domains and Languages

This paper from researchers at Columbia University and University of Edinburgh investigates how well the linguistic notion of formality can be detected via acoustic and prosodic elements of speech. While the concept of formality is well studied in written language it is less well understood in spoken language. The authors perform a correlation analysis between established textual measures of formality and proposed prosodic ones (e.g. speaking rate, pauses, jitter, shimmer). In a nutshell, it appears that it is quite difficult to predict formality using their prosodic features. This was a surprising result for the researchers, who concluded that “non-lexical indicators of formality in speech may be more subtle than our initial expectations.” It’s nice to see papers like this that report an unexpected negative result.

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

This paper from researchers at Google presents a model that is primarily a speech separation model, but can also do ASR and TTS! The model is trained on two main tasks: separate and transcribe each speech source, and generate speech from text. The model uses two precursor models, w2v-BERT and SoundStream, to generate discrete phonetic and acoustic representations. These discrete representations are computed for the mixed (two speaker) audio, as well as for the original unmixed speaker’s audio. A T5-based encoder-decoder sequence-to-sequence model is trained to predict the unmixed tokens, as well as the transcripts, from the input mixed tokens and masked versions of the transcripts. To generate the output waveforms, the SoundStream decoder is used to convert the SoundStream tokens into audio. Some impressive examples of the separated speech and the generated TTS output are available here.

To next year…

It was great to catch up with so many old friends and colleagues at this year’s event. We’ll hopefully see everybody again next year in Jerusalem.