Representation Learning for Audio, Speech, and Music Processing

Important dates

  • Paper submission: 15th of January 10th of February (extension), 2021

  • Notification of Paper Acceptance: 15th of March 10th of April, 2021

  • Camera-Ready Paper Due: 30th of March 25th of April, 2021

  • Conference: 18th - 22nd of July, 2021


Click here to submit your paper!


Scope and Topics

In the last decade, deep learning has revolutionized the research fields of audio and speech signal processing, acoustic scene analysis, and music information retrieval. In these research fields, methods relying on deep learning have achieved remarkable performance in various applications and tasks, surpassing legacy methods that rely on the independent usage of signal processing operations and machine learning algorithms. The huge success of deep learning methods relies on their ability to learn representations from sound signals that are useful for various downstream tasks. These representations encapsulate the underlying structure or features of the sound signals, or the latent variables that describe the underlying statistics of the respective signals.

Despite this success, learning representations of audio with deep models remains challenging. For example, the diversity of acoustic noise, the multiplicity of recording devices (e.g., high-end microphones vs. smartphones), and the source variability challenge machine learning methods when they are used in realistic environments. In audio event detection, which has recently become a vigorous research field, systems for the automatic detection of multiple overlapping events are still far from reaching human performance. Another major challenge is the design of robust speech processing systems. Speech enhancement technologies have significantly improved in the past years, notably thanks to deep learning methods. However, there is still a large performance gap between controlled environments and real-world situations. As a final example, in the music information retrieval field, modeling the high-level semantics based on local and long-term relations in music signals is still a core challenge. More generally, self-supervised approaches that can leverage a large amount of unlabeled data are very promising for learning models that can serve as a powerful base for many applications and tasks. Thus, it is of great interest for the scientific community to find new methods for representing audio signals using hierarchical models, such as deep neural networks. This will enable novel learning methods to leverage the large amount of information that audio, speech, and music signals convey.

The aim of this session is to establish a venue where engineers, scientists, and practitioners from both academia and industry, can present and discuss cutting-edge results in representation learning in audio, speech, and music signal processing. Driven by the constantly increasing popularity of audio, speech, and music representation learning, the organizing committee of this session is motivated to build, in the long-term, a solid reference within the computational intelligence community for the digital audio field

The scope of this proposed special session is representation learning, focused on audio, speech, and music. Representation learning is one of the main aspects of neural networks. Thus, the scope of this proposes special session is well aligned with the scope of the IJCNN, as the current special session is focused on a core aspect of neural networks, which is the representation learning.

The topics of the special session include (but are not limited to):

  • Audio, speech, and music signal generative models and methods
  • Single and multi-channel methods for separation, enhancement, and denoising
  • Spatial analysis, modification, and synthesis for augmented and virtual reality
  • Detection, localization, and tracking of audio sources/events
  • Style transfer, voice conversion, digital effects, and personalization
  • Adversarial attacks and real/synthetic discrimination methods
  • Information retrieval and classification methods
  • Multi- and inter-modal models and methods
  • Self-supervised/metric learning methods
  • Domain adaptation, transfer learning, knowledge distilation, and K-shot approaches
  • Differentiable signal processing based methods
  • Privacy preserving methods
  • Interpretability and explainability in deep models for audio
  • Context and structure-aware approaches

Organizing Committee

  • Konstantinos Drossos, Audio Research Group, Tampere University, Finland

  • Xavier Favory, Music Technology Group, Universitat Pompeu Fabra, Spain

  • Paul Magron, CNRS, IRIT, Université de Toulouse, France

  • Stylianos I. Mimilakis, Fraunhofer Institute for Digital Media Technology, Germany

  • Emanuele Principi, Università Politecnica delle Marche, Italy

You can contact us for anything about the special session, by creating an issue at the GitHub repository of this website. or by sending an email to Konstantinos Drossos using firstname [dot] lastname [at] tuni [dot] fi

K. Drossos is supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 957337, project MARVEL.