FAMILI: The Mirror DBMS at TREC-9

The Mirror DBMS at TREC-9 Arjen P. de Vries arjen@acm.org CWI, Amsterdam, The Netherlands Abstract The Mirror DBMS is a prototype database system especially designed for multimedia and web retrieval. From a database perspective, this year's purpose has been to check whether we can get su cient e ciency on the larger data set used in TREC-9. From an IR perspective, the experiments are limited to rather primitive web-retrieval, teaching us that web-retrieval is (un?-)fortunately not just retrieving text from a di erent data source. We report on some limited (and disappointing) experiments in an attempt to bene t from the manually assigned data in the metatags. We further discuss observations with respect to the e ectiveness of title-only topics. 1 Introduction The Mirror DBMS [dV99] combines content management and data management in a single system. The main advantage of such integration is the facility to combine IR with traditional data retrieval. Furthermore, IR researchers can experiment more easily with new retrieval models, using and combining various sources of information. The IR retrieval model is completely integrated in the Mirror DBMS database architecture, emphasizing e cient set-oriented query processing. The logical layer of its architecture supports a nested object algebra called Moa; the physical layer uses the Monet mainmemory DBMS and its MIL query language [BK99]. Experiments performed in last year's evaluation are described in [dVH99]; its support for IR is presented in detail in [dV98] and [dVW99]. The main goal of this year's participation in TREC has been to migrate from plain text retrieval to retrieving web documents, and simultaneously improve our algorithms to handle the signi cantly larger collections. The paper is organized as follows. Section 2 details our lab environment. Section 3 interprets our results and discusses our plans for next year with the Mirror DBMS, followed by conclusions. 1 Arjen P. de Vries 2 bootnet reviews ... ... compaq ambitious delivers impressive design performance and features compaq presario so near yet so far win means the issue of sound blaster compatibility is dead and buried right wrong ... card cage Figure 1: The intermediate format produced (XML). 2 Lab Environment This section discusses the processing of the WT10g data collection used to produce our runs. The hardware platform running the experiments is a (dedicated) dual Pentium III 600 MHz, running Linux, with 1 Gb of main memory and 100 Gb of disk space. Adapting our existing IR setup to handling Web data caused much more trouble than expected. As a side-e ect of this problem, the submitted runs contained some errors, and even xing does not give us the best runs ever; a lot of work still remains to be done to improve our current platform. Managing a new document collection and getting it indexed has turned out to be a rather timeconsuming problem. Obviously, the WT10g is 10 times larger than TREC, so we decided to treat it as a collection of 104 subcollections following the layout on the compact discs. But, handling a collection of this size was not the real issue; our main problems related to the `quality' of data gathered from the web. 2.1 Parsing After some initial naive e orts to hack a home-grown HTML parser, we bailed out and used the readily available Perl package HTML::Parser. It is pretty good at `correcting' bad HTML on-the- y; the only real problem we bumped into was that it assumes a document to always have at least a and a , which fails on WTX089-B33. We convert the sloppy HTML documents into (rather simple) XML documents that are easier to manage in the subsequent indexing steps. We keep the `normal' content words, the content of 's ALT attribute, as well as the following meta tags: keywords, description, classification, abstract, author, build. In this rst step, we also normalize the textual data to some extent by converting to lower-case and throwing out `strange characters'; unfortunately, due to working against a very tight schedule (too tight), this included the removal of all punctuation and numeric characters (not helping topics referring to a particular year, like topics 456 and 481). An example result le is shown in Figure 1. What a ected our results severely is our assumption that HTML documents are nicely wrapped in and tags. Unfortunately, this rst `bug' removed about Arjen P. de Vries 3 half of the collection from our index. 2.2 Indexing The second step reads the intermediate XML les and converts them into load tables for our database system. The contents of the title, body, and img tags are unioned together in `term' sets, and all other tags in `meta' sets.1 Notice that we do not have to union these tags together; but, any alternative requires an IR model that can handle elds properly, which is still beyond our skill. After loading these tables, the complete collection is represented by the following schema: define WT10g_docs as SET< SET< TUPLE< Atomic : docno, SET< Atomic > : term, SET< Atomic > : meta > >: subCollection >; The Mirror DBMS supports e cient ranking using the CONTREP domain-speci c Moa structures, specialized in IR. These structures now have to be created from the above representation; this indexing procedure is still implemented in a separate indexing MIL script which is directly fed into Monet, the DBMS. The script performs stopping, stemming, creates the global statistics, and creates the internally used < dj ; ti; tf i;j >{tuples. Web-data is quite di erent from newspaper articles: the strangest terms can be found in the indexing vocabulary after stopping and stemming. Examples vary from `yippieyayeheee' and the like, to complete sentences lacking spaces between the words. After quick inspection, we decided to prune the vocabulary aggressively: all words longer than 20 characters are plainly removed from the indexing vocabulary, as well as all words containing a sequence of more than four identical characters.2 2.3 Retrieval After running the indexing script, we obtain the following schema that can be used in Moa expressions to perform ranking: 1Notice that these `sets' are really multi-sets or bags. 2We realize that this is a rather drastic ad-hoc approach; it is not likely to survive into the codebase of next year. Arjen P. de Vries 4 define WT10g_index as SET< SET< TUPLE< Atomic: docno, CONTREP : term, CONTREP : meta > >: subCollection >; The Monet database containing both the parsed web documents and its index takes 9 Gb of disk space. Like in TREC-8, we use Hiemstra's LMM retrieval model (see also [Hie00] and our technical report [HdV00]). It builds a simple statistical language model for each document in the collection. The probability that a query T1; T2; ; Tn of length n is generated by the language model of the document with identi er D is de ned by the following equation: P(T1=t1; ; Tn=tnjD=d) = Yn i=1 ( 1 df(ti) P t df(t) + 2 tf (ti; d) P t tf (t; d) ) (1) The getBL-operator (i.e. get beliefs) de ned for the CONTREP structure ranks documents according to this retrieval model. We planned two series of experiments: ranking using the raw content collected in the `term' sets, and ranking using a weighted combination from the rst ranking with the ranking based on the annotations extracted from the meta tags collected in the `meta' sets. These two experiments are (approximately) described by the following Moa expressions:3 (1) flatten(map[map[TUPLE](THIS)](WT10g_index)); (2) map[TUPLE< THIS.docno, THIS.termBel + 0.1*THIS.metaBel >]( flatten(map[ map[ TUPLE< THIS.docno, sum(getBL(THIS.term, termstat, query)): termBel, sum(getBL(THIS.meta, metastat, query)): metaBel > ](THIS) ]( WT10g_index ))); We do not expect the reader to grasp the full meaning of these queries, but only intend to give an overall impression; the inner map computes the conditional probabilities for 3Details like sorting and selecting the top ranked documents have been left out. Arjen P. de Vries 5 documents in a subcollection, which are accessed with the outer map; the flattens remove nesting. Similar to TREC-8, the query plan generated by the Moa rewriter (in MIL) has been manually edited to loop over the 50 topics, log the computed ranking for each topic, and use two additional tables, one with precomputed normalized inverse document frequencies (a materialized view), and one with the document-speci c constants for normalizing the term frequencies. The flatten and the outer map, which iterate over the 104 subcollections and merge the results, were written directly in MIL as well. Unfortunately, we introduced a second (real) bug in this merging phase, which messed up our main results: we used slice(1,1000) whereas this really selects from the 2nd up to the 1001st ranking document; hence throwing out the 104 `best' documents (best according to our model). 3 Discussion Table 1 presents a summary of the average precision (AP, as reported by trec eval) measured on our runs. The rst column has the results in which the top 100 documents are missing; the second column has the xed results.4 run name description AP AP ( xed) CWI0000 title 0.0176 0.1814 CWI0001 title & description 0.0174 0.1503 CWI0002 title & description & narrative 0.0122 0.1081 CWI0010 title (term & meta) 0.0125 { Table 1: Result summary. Very surprising is the fact that using the description and/or narrative has not been helpful at all. This is completely di erent from our experience with evaluating TREC-6 and TREC-8 topics. Closer examination shows that the description (and sometimes the narrative) help signi cantly for the following topics: 452: `do beavers live in salt water?'. Here, the description adds more general words such as `habitat'; 455: the description speci es `major league' and the narrative gives nally the desired `baseball'; 476: the description adds the scope (`television' and `movies') to title `Jennifer Aniston'; 478: The description adds `mayor' to `Baltimore' 4The run with combined term and meta data has not been xed, due to some strange software problems that we have not gured out yet. Arjen P. de Vries 6 498: The title does not mention `how many' and `cost' which reveal the real information need. Precision drops however in most other cases; especially the very concise title queries 485 (`gps clock'), 497 (`orchid') and 499 (`pool cue') su er from overspeci cation in description and narrative. For example, topic 486 shows that the `casino' from the title is not weighted enough in comparison to generic terms like `Eldorado'. It warrants further investigation to view if query term weighting would address this counter-intuitive result. We have not succeeded to make e ective use of the information collected in the meta tags. Using only the meta tags leads to a poor average precision of 0.0033. Closer investigation of topics 451, 492, 494, for which the meta tags do retrieve relevant documents in the top 10, it turns out that these are also ranked high using the bare document content. Thus, we can only conclude that in our current approach, the meta tags can safely be ignored. Despite of this disappointing experience, we hope still that using this information source may be more beni cial for processing blind relevance feedback. Table 2 shows the results after spell-checking the topics semi-automatically using a combination of ispell and common sense, showing that this would have a minor positive e ect on the title-only queries. Results for topics 463 (Tartin), 475 (compostion), and 487 (angioplast7) improve signi cantly; but, splitting `nativityscenes' in topic 464 causes a loss in precision, because the (Porter) stemmer reduces `nativity' to `nativ'. We conclude from the other runs with spelled topics that we should address proper weighting of title terms rst. run name description AP CWI0000s title 0.1924 CWI0001s title & description 0.1525 CWI0002s title & description & narrative 0.1085 Table 2: Results with spell-checked topics. 4 Conclusions The honest conclusion of this year's evaluation should be that we underestimated the problem of handling Web data. Surprising is the performance of the title-only queries doing better than queries including description or even narrative. It seems that the web-track topics are really di erent from the previous TREC topics in the ad-hoc task, for which we never weighted title terms di erent from description or narrative. For next year, our primary goal will be to improve the current indexing situation. The indexing process can be described declaratively using the notion of feature grammars described in [dVWAK00]. Also, we will split the indexing vocabulary in a `trusted' vocabulary (based on a proper dictionary), a numeric and named entity collection, and Arjen P. de Vries 7 a `trash' dictionary with rare and possibly non-sense terms. A third goal should be evident from the following quote from last year's TREC paper: Next year, the Mirror DBMS should be ready to participate in the large WEB track. In other words: we still intend to tackle the large web track. For our cross-lingual work in CLEF [dV00] we have to address the problems of spelling errors and named entities anyways, which is directly related to some of our TREC problems. Some other ideas include to work on using links, blind relevance feedback, as well as improve our understanding on how to e ectively exploit the `meta' sets. Acknowledgements Henk Ernst Blok did a great just-in-time job of getting a late delivery of hardware prepared for running these experiments. Further thanks go to Djoerd Hiemstra and Niels Nes for their support with IR models and Monet respectively. References [BK99] P.A. Boncz and M.L. Kersten. MIL primitives for querying a fragmented world. The VLDB Journal, 8(2):101{119, 1999. [dV98] A.P. de Vries. Mirror: Multimedia query processing in extensible databases. In Proceedings of the fourteenth Twente workshop on language technology (TWLT14): Language Technology in Multimedia Information Retrieval, pages 37{48, Enschede, The Netherlands, December 1998. [dV99] A.P. de Vries. Content and multimedia database management systems. PhD thesis, University of Twente, Enschede, The Netherlands, December 1999. [dV00] A.P. de Vries. A poor man's approach to CLEF. In CLEF 2000: Workshop on crosslanguage information retrieval and evaluation, Lisbon, Portugal, September 2000. Working Notes. [dVH99] A.P. de Vries and D. Hiemstra. The Mirror DBMS at TREC-8. In Proceedings of the Eighth Text Retrieval Conference TREC-8, number 500{246 in NIST Special publications, pages 725{734, Gaithersburg, Maryland, November 1999. [dVW99] A.P. de Vries and A.N. Wilschut. On the integration of IR and databases. In Database issues in multimedia; short paper proceedings, international conference on database semantics (DS-8), pages 16{31, Rotorua, New Zealand, January 1999. [dVWAK00] A.P. de Vries, M. Windhouwer, P.M.G. Apers, and M. Kersten. Information access in multimedia databases based on feature models. New Generation Computing, 18(4):323{ 339, August 2000. [HdV00] Djoerd Hiemstra and Arjen de Vries. Relating the new language models of information retrieval to the traditional retrieval models. Technical Report TR{CTIT{00{09, Centre for Telematics and Information Technology, May 2000. [Hie00] D. Hiemstra. A probabilistic justi cation for using tf_idf term weighting in information retrieval. International Journal on Digital Libraries, 3(2):131{139, 2000.

FAMILI

Sabtu, 23 Juni 2012

The Mirror DBMS at TREC-9

Tidak ada komentar:

Posting Komentar