Representation Learning for Audio, Speech, and Music Processing

Special Session at the International Joint Conference on Neural Networks (IJCNN), 2021

Important dates

Paper submission: ~~15th of January~~ 10th of February (extension), 2021
Notification of Paper Acceptance: ~~15th of March~~ 10th of April, 2021
Camera-Ready Paper Due: ~~30th of March~~ 25th of April, 2021
Conference: 18th - 22nd of July, 2021

Click here to submit your paper!

Scope and Topics

In the last decade, deep learning has revolutionized the research fields of audio and speech signal processing, acoustic scene analysis, and music information retrieval. In these research fields, methods relying on deep learning have achieved remarkable performance in various applications and tasks, surpassing legacy methods that rely on the independent usage of signal processing operations and machine learning algorithms. The huge success of deep learning methods relies on their ability to learn representations from sound signals that are useful for various downstream tasks. These representations encapsulate the underlying structure or features of the sound signals, or the latent variables that describe the underlying statistics of the respective signals.

Despite this success, learning representations of audio with deep models remains challenging. For example, the diversity of acoustic noise, the multiplicity of recording devices (e.g., high-end microphones vs. smartphones), and the source variability challenge machine learning methods when they are used in realistic environments. In audio event detection, which has recently become a vigorous research field, systems for the automatic detection of multiple overlapping events are still far from reaching human performance. Another major challenge is the design of robust speech processing systems. Speech enhancement technologies have significantly improved in the past years, notably thanks to deep learning methods. However, there is still a large performance gap between controlled environments and real-world situations. As a final example, in the music information retrieval field, modeling the high-level semantics based on local and long-term relations in music signals is still a core challenge. More generally, self-supervised approaches that can leverage a large amount of unlabeled data are very promising for learning models that can serve as a powerful base for many applications and tasks. Thus, it is of great interest for the scientific community to find new methods for representing audio signals using hierarchical models, such as deep neural networks. This will enable novel learning methods to leverage the large amount of information that audio, speech, and music signals convey.

The aim of this session is to establish a venue where engineers, scientists, and practitioners from both academia and industry, can present and discuss cutting-edge results in representation learning in audio, speech, and music signal processing. Driven by the constantly increasing popularity of audio, speech, and music representation learning, the organizing committee of this session is motivated to build, in the long-term, a solid reference within the computational intelligence community for the digital audio field

The scope of this proposed special session is representation learning, focused on audio, speech, and music. Representation learning is one of the main aspects of neural networks. Thus, the scope of this proposes special session is well aligned with the scope of the IJCNN, as the current special session is focused on a core aspect of neural networks, which is the representation learning.

The topics of the special session include (but are not limited to):

Audio, speech, and music signal generative models and methods
Single and multi-channel methods for separation, enhancement, and denoising
Spatial analysis, modification, and synthesis for augmented and virtual reality
Detection, localization, and tracking of audio sources/events
Style transfer, voice conversion, digital effects, and personalization
Adversarial attacks and real/synthetic discrimination methods
Information retrieval and classification methods
Multi- and inter-modal models and methods
Self-supervised/metric learning methods
Domain adaptation, transfer learning, knowledge distilation, and K-shot approaches
Differentiable signal processing based methods
Privacy preserving methods
Interpretability and explainability in deep models for audio
Context and structure-aware approaches

Organizing Committee

Konstantinos Drossos, Audio Research Group, Tampere University, Finland
Xavier Favory, Music Technology Group, Universitat Pompeu Fabra, Spain
Paul Magron, CNRS, IRIT, Université de Toulouse, France
Stylianos I. Mimilakis, Fraunhofer Institute for Digital Media Technology, Germany
Emanuele Principi, Università Politecnica delle Marche, Italy

You can contact us for anything about the special session, by creating an issue at the GitHub repository of this website. or by sending an email to Konstantinos Drossos using firstname [dot] lastname [at] tuni [dot] fi

K. Drossos is supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 957337, project MARVEL.

Representation Learning for Audio, Speech, and Music Processing

Important dates

Paper submission: 15th of January 10th of February (extension), 2021

Notification of Paper Acceptance: 15th of March 10th of April, 2021

Camera-Ready Paper Due: 30th of March 25th of April, 2021

Conference: 18th - 22nd of July, 2021

Click here to submit your paper!

Scope and Topics

Audio, speech, and music signal generative models and methods

Single and multi-channel methods for separation, enhancement, and denoising

Spatial analysis, modification, and synthesis for augmented and virtual reality

Detection, localization, and tracking of audio sources/events

Style transfer, voice conversion, digital effects, and personalization

Adversarial attacks and real/synthetic discrimination methods

Information retrieval and classification methods

Multi- and inter-modal models and methods

Self-supervised/metric learning methods

Domain adaptation, transfer learning, knowledge distilation, and K-shot approaches

Differentiable signal processing based methods

Privacy preserving methods

Interpretability and explainability in deep models for audio

Context and structure-aware approaches

Organizing Committee

Konstantinos Drossos, Audio Research Group, Tampere University, Finland

Xavier Favory, Music Technology Group, Universitat Pompeu Fabra, Spain

Paul Magron, CNRS, IRIT, Université de Toulouse, France

Stylianos I. Mimilakis, Fraunhofer Institute for Digital Media Technology, Germany

Emanuele Principi, Università Politecnica delle Marche, Italy