Over the last decades, the development of digitization techniques and the increasing capacity of storage media together with decreasing storage costs has led to an explosion of digital content. Huge amounts of audio, images, and videos are generated daily and are mostly stored in an unstructured repository of multimedia information, much of which can be accessed through the Internet. To deal with these huge amounts of data, one of the main imperatives are tools and standards for data search and analysis. In particular, there is a need for efficient techniques that are able to extract semantic information directly from the content. The work presented in this book proposes methods to bridge the semantic gap in the computer vision and the musical audio mining domain. We propose features that are related with human perception and methods relying on machine learning techniques to relate these features with concepts. Next, we present a system to model, disclose, and manage different types of metadata. Furthermore, the interoperability with metadata standards is also tackled.