Abstract:
In the information filtering (or publish/ subscribe) paradigm, clients subscribe to a server with continuous queries that express their information needs while information sources publish documents to servers. Whenever a document is published, the continuous queries satisfying this document are found and notifications are sent to appropriate subscribed clients. Although information filtering has been in the research agenda for about half a century, there is a huge paradox when it comes to benchmarking the performance of such systems. There is a striking lack of a benchmarking mechanism (in the form of a large-scale standarised test collection of continuous queries and the relevant document publications) specifically created for evaluating filtering tasks. This work aims at filling this gap by proposing a methodology for automatically creating massive continuous query datasets from available document collections. We intend to publicly release all related material (including the software accompanying the proposed methodology) to the research community after publication.