Retrieving Content and Structure
Lecturer(s):Maarten de Rijke (Informatics Institute, University of Amsterdam) and Jaap Kamps (University of Amsterdam) and Maarten Marx (University of Amsterdam)
Type:Advanced Course
Section:Language and Computation
Time: 14.00-15.30 (Slot 3)
Room:EM 1.82


The web is the world's largest knowledge base.  Data on the web
resides in various formats (e.g., HTML, XML, text files, relational).
To accommodate all forms of data and access to it, the database
research community has introduced the ``semistructured data model,"
where data is self-describing, irregular, and graph-like.  The new
model captures naturally Web data, such as HTML, XML, or other
application specific formats.

In this course we present and explore different ways of
retrieving information from semistructured documents.  To access
such documents we must use technology from both the database and
the information retrieval disciplines.  This combination makes
"Retrieving Content and Structure" a challenging and exciting
research topic, with many open questions.  

Here is a day-by-day breakdown of the envisaged course:

Day 1. "Content Only"
- overview document retrieval
- basic retrieval models
- language modeling for information retrieval
- evaluation; TREC (Text REtrieval Conference)
- indexing content

Day 2. "Content and Light-Weight Structure"
- HTML documents, documents annotated with metadata
- Web retrieval (anchor text, link information)
- evaluation
- indexing light-weight structure

Day 3. "Structure"
- XML documents
- XPath
- query evaluation for XML documents
- expressiveness
- complexity

Day 4. "Content and Structure 1"
- information retrieval approaches:
- database approaches:
- evaluation; INEX (INitiative for the Evaluation of
  XML retrieval)
- indexing content and structure

Day 5. "Content and Structure 2"
- evaluation metrics
- applications and application scenarios
- future


© ESSLLI 2005 Organising Committee 2004-12-01