This talk describes the Xtrieve cross-modal information retrieval system, a prototype system for demonstrating cross-modal retrieval of multimedia content. Multimedia data is rapidly accumulating and powerful computational methods for accessing specific content are desperately needed. This is a difficult problem, involving speech recognition, image understanding, and pattern recognition over a diverse range of noisy and imprecise data. General solutions to this problem are proving elusive. Cross-modal information retrieval is a new multimedia data access technology based on searching one media in response to a query and presenting the result in an alternative, potentially heterogeneous media. In many cases correlations exist between media streams which can be exploited to facilitate this process and allow reliable retrieval. One example is the temporal relationship between speech audio and a transcript of that audio. If this temporal relationship can be discovered, a query into the text transcript can produce an audio result.
Xtrieve is the first experimental system to specifically address cross-modal information retrieval. Xtrieve implements a general structure for cross-modal access. Specific cross-modal correlation of text to audio and text to alternative text are demonstrated. Many important issues addressed in the Xtrieve development process will be described including retrieval granularity, temporal presentation of results, and content structure analysis. The underlying synchronization technology, multiple media stream correlation, will also be discussed. Additional applications for this new technology include language translation align