ACM CIKM'99/Tutorial Program

All tutorials will take place on November 2, 1999. Tutorial T3 will be in the morning; T2 and T4 will be in the afternoon.

T2
Information Retrieval

by Fredric C. Gey
University of California

1pm-5pm

Models in Information Retrieval, including Cross Language Retrieval

Information retrieval algorithms have emerged as the key to effective search of large collections of unstructured text such as found on the Internet. Vector space algorithms are used by Lycos and AltaVista, while Inktome uses a probabilistic document retrieval algorithms.

The three major theoretical models in information retrieval are Boolean/logic, vector space, and probabilistic. This tutorial will explain the unique characteristics and problems of each model and how each model has evolved along different lines. Modern variants of the basic models are explained.

A major application area of IR is in cross-language retrieval, which combines linguistic techniques with traditional mono-lingual retrieval techniques. This is a burgeoning research area and deserves special devoted attention to the techniques of machine translation, bilingual dictionaries, and corpora-based learning. I will also discuss the special challenges of Asian language retrieval (Japanese, Chinese, Indian subcontinent languages).

The attendees of this tutorial will obtain a basic understanding of the major theoretical models upon which modern text retrieval software is based. The tutorial should provide each participant with a starting point for further elf-education.

Who should attend: This course is designed to provide a fast-paced yet rigorous introduction to the basic models of Information Retrieval for academic and industrial research and development computer scientists whose background lies outside the Information Retrieval area.

1/2 hour: Background and historical development Luhn and statistical text characteristics Statistical weights and the IDF concept
1/2 hour: Boolean set and logic models Fuzzy logic (RUBRIC/TOPIC) Weighted boolean and P-Norm (INQUERY) Recent logic models
1/2 hour: Vector space and geometric models Basic vector similarity measures Latent Semantic Indexing Probabilistic models Inference nets and neural network approaches Regression models
1 hour: Cross-language retrieval dictionary and machine-translation approaches learning from bi-lingual corpora special challenges of Asian languages
1/2 hour: Performance measurement and analysis Recall, precision, fallout measures Limitations to performance assessment -- interjudge consistency, completeness Statistical significance tests

Materials: 110 Course overheads provided.

About the instructor:Fredric Gey's research specializes in probabilistic document retrieval using logistic regression techniques. He is principal investigator of NSF grant IRI 9630765 Probabilistic Retrieval of Full-Text Document Collections Using Logistic Regression. He is Co-principal Investigator for the ARPA research contract "Search Support for Unfamiliar Metadata Vocabularies," July 1997-June 2000. He directs the UC Berkeley entries to the TREC conferences, and is designated as General Chairman for SIGIR99 to be held at the University of California, Berkeley during the summer of 1999. He holds a PhD in Information Science from UC Berkeley.

T3
Information Systems

by Christine Parent,
Stefano Spaccapietra,
and Esteban Zimanyi

8am-12pm

Spatio-Temporal Information Systems: A Conceptual Perspective

Despite the well-established benefits of conceptual modeling for application design, current spatio-temporal models do not cope satisfactorily with designers' requirements. In this tutorial we first identify the goals of a spatio-temporal conceptual model and then we describe the MADS model along the structural, spatial, and temporal dimensions. As the modeling concepts are orthogonal, the proposed model achieves both simplicity (as concepts are independent from each other) and expressive power (as concepts may be freely associated). The model has been implemented and can be translated to operational models of existing products. The tutorial briefly describes the architecture we defined for providing users with a set of conceptual interfaces for defining and accessing spatio-temporal information systems.

Finally, the tutorial reports on results of an experimentation which allowed us to assess the qualities of the model.

KEYWORDS: information systems, conceptual modeling, data models, spatial databases, temporal databases, database design, GIS, geographic information systems, CASE tools, practical experiments.

Outline of the Tutorial

Introduction
1. State of the Art Review of different spatial and temporal database models showing that they do not answer satisfactorily to users' requirement.
2. The MADS Conceptual Model General objectives pursued in the development of the MADS (Modeling of application Data with Spatio-temporal features) model.
Modeling Classical Data Structural features of the MADS model: object types, relationship types, attributes, generalization links, aggregation links, and associated integrity constraints.
Modeling Spatial Characteristics
1. Description of Spatiality Description of the spatial abstract types supported by the MADS model. Generalization hierarchy of these abstract types.
2. Spatial Attributes and Object Types Associating spatiality to information as well as the corresponding constraints.
3. Interior and Envelope of Spatial Objects Definition of the concepts of envelope, interior, and topological relationships for general composite objects.
4. Spatial Relationship Types Explicitly describing spatial relationships in conceptual schemas. Two categories of spatial relationships: topological relationships and spatial aggregation.
5. Space-varying Attributes Describing continuous fields using a first-class concept: reconciliating the discrete (vector) view and the continuous (raster) view.
6. Generalization and Spatiality Semantics of generalization when it relates spatial and non-spatial object types. Inheritance mechanisms when the spatiality of the supertype is referred in the subtype: redefinition and refinement.
Modeling Temporal Characteristics
1. Temporal Attributes Assigning temporality to attributes, both simple and complex: semantics and associated integrity constraints.
2. Temporal Object Types Assigning temporality to objects: life cycle of objets. Temporal objects having temporal attributes.
3. Generalization and Temporality Semantics of generalization when it relates temporal and non-temporal object types. Inheritance mechanisms when the temporality of the supertype is referred in the subtype: redefinition and refinement.
4. Temporal Relationship Types Assigning temporality to relationships. Semantics of non-temporal/temporal relationships relating non-temporal/temporal objects.
5. Modeling Temporal Dynamics Describing inter-object dynamics where time plays an essential role. Categories of temporal relationships: transition relationship, generation relationship, timing relationship, and snapshot aggregation.
Implementation
1. Architecture of MADS Implementation Providing to users a set of conceptual interfaces for spatio-temporal informations systems: visual schema editor, visual query editor, sketch interface, cartographical interface, front-end to the underlying GIS/DBMS
2. The GESREAU Application Example application in which the MADS model was used for the managment of water ressources of the Vaud county in Switzerland.
Conclusions and Perspectives

T4
Data Mining

by Rajeev Rastogi and
Kyuseok Shim
Bell Labs and KAIST

1pm-5pm

Recent Advances in Data Mining Algorithms on Large Databases

A large number of corporations have invested heavily in information technology to manage their businesses more effectively, and vast amounts of critical business data have been stored in database systems. The volume of this data is expected to grow considerably in the near future. Yet many organizations have been unable to collect valuable insights from the data to guide their marketing strategy, investment and management policies. One of the reasons for this is that most information is stored implicitly in the large amounts of data. Fortunately, new and sophisticated techniques being developed in the area of data mining can help companies leverage their data more effectively and extract insightful information from their data.

This tutorial describes the fundamental algorithms for data mining, many of which have been proposed in recent years. These techniques include association rules, correlation, causal relationship, clustering, outlier detection, similar time sequences, similar images, sequential patterns and classification. In addition, since we will cover technical material in some degree of depth, the audience will get a good exposure to the results in the area, and also future research directions.

Who should attend: Professionals who would like to get introduced to/know about the state of the art data mining techniques and products for large databases.

Tutorial Outline

1. Introduction: Brief overview and discussion on data mining techniques developed for large databases.
2. Association Rules and Sequential Patterns: Presents association rules, optimized association rules, sequential patterns, correlations, and causal relationships. Also covers the idea of pushing constraints on these algorithms.
3. Bayesian Network: Introduce the concept of Bayesian networks and their role in mining causal relationships.
4. Classification: Describes the state of the art classifiers for large databases. These include PUBLIC, BOAT, Rain-Forest, SLIQ and SPRINT algorithms. We also cover nearest neighbor and Bayesian classifiers.
5. Clustering: Covers the characteristics of traditional clustering algorithms and present techniques developed for large databases. We cover CURE, ROCK, CLARANS, DBSCAN, BIRCH and CLIQUE algorithms.
6. Similar Time Sequences and Similar Images: Describe the existing techniques developed for similar time sequences and similar image retrieval. We also cover QBIC, WBIIS and WALRUS systems that are developed for similar image retrieval algorithms.
7. Outlier Detection algorithms: We present the recent techniques developed for mining outliers from large databases.
8. Other Applications and Future Research: Discuss other interesting problems and research issues in data mining.

About the instructors: Rastogi Rastogi received the B. Tech degree in Computer Science from the Indian Institute of Technology, Bombay in 1988, and the masters and Ph.D. degrees in Computer Science from the University of Texas, Austin, in 1990 and 1993, respectively. He joined Bell Laboratories in Murray Hill, New Jersey, in 1993 and is currently a member of technical staff (MTS) in the Information Sciences Research Center.

Rajeev Rastogi is active in the field of databases and has served as a program committee member for several conferences in the area. His writings have appeared in a number of ACM and IEEE publications and other professional conferences and journals. His research interests include database systems, storage systems and knowledge discovery. His most recent research has focused on the areas of high-performance transaction systems, continuous-media storage servers, tertiary storage systems, data mining, and multidatabase transaction management.

Kyuseok Shim is currently leading the Serendip Data Mining project in Bell Laboratories. Before that, he worked for Rakesh Agrawal's Quest Data Mining project at IBM Almaden Research Center. He also worked as a summer intern for two summers at Hewlett Packard Laboratories. He received B.S. degree in Electrical Engineering from Seoul National University, and the MS and Ph.D. degrees in Computer Science from University of Maryland, College Park.

Kyuseok Shim has been working in the area of databases focusing on data mining, data warehousing, query processing and query optimization, and constraint-based database systems. He has published several research papers in prestigious database conferences and journals. He has also served as a program committee member on database and knowledge discovery conferences.