In the last decade, deep learning has revolutionized the research fields of audio and speech signal processing, acoustic scene analysis, and music information retrieval. In these research fields, methods relying on deep learning have achieved remarkable performance in various applications and tasks, surpassing legacy methods that rely on the independent usage of signal processing operations and machine learning algorithms. The huge success of deep learning methods relies on their ability to learn representations from sound signals that are useful for various downstream tasks. These representations encapsulate the underlying structure or features of the sound signals, or the latent variables that describe the underlying statistics of the respective signals.
Despite this success, learning representations of audio with deep models remains challenging. For example, the diversity of acoustic noise, the multiplicity of recording devices (e.g., high-end microphones vs. smartphones), and the source variability challenge machine learning methods when they are used in realistic environments. In audio event detection, which has recently become a vigorous research field, systems for the automatic detection of multiple overlapping events are still far from reaching human performance. Another major challenge is the design of robust speech processing systems. Speech enhancement technologies have significantly improved in the past years, notably thanks to deep learning methods. However, there is still a large performance gap between controlled environments and real-world situations. As a final example, in the music information retrieval field, modeling the high-level semantics based on local and long-term relations in music signals is still a core challenge. More generally, self-supervised approaches that can leverage a large amount of unlabeled data are very promising for learning models that can serve as a powerful base for many applications and tasks. Thus, it is of great interest for the scientific community to find new methods for representing audio signals using hierarchical models, such as deep neural networks. This will enable novel learning methods to leverage the large amount of information that audio, speech, and music signals convey.
The aim of this session is to establish a venue where engineers, scientists, and practitioners from both academia and industry, can present and discuss cutting-edge results in representation learning in audio, speech, and music signal processing. Driven by the constantly increasing popularity of audio, speech, and music representation learning, the organizing committee of this session is motivated to build, in the long-term, a solid reference within the computational intelligence community for the digital audio field
The scope of this proposed special session is representation learning, focused on audio, speech, and music. Representation learning is one of the main aspects of neural networks. Thus, the scope of this proposes special session is well aligned with the scope of the IJCNN, as the current special session is focused on a core aspect of neural networks, which is the representation learning.
The topics of the special session include (but are not limited to):
You can contact us for anything about the special session, by creating an issue at the GitHub repository of this website. or by sending an email to Konstantinos Drossos using firstname [dot] lastname [at] tuni [dot] fi
K. Drossos is supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 957337, project MARVEL.