Retrieving Content and Structure
|Lecturer(s):||Maarten de Rijke (Informatics Institute, University of Amsterdam) and Jaap Kamps (University of Amsterdam) and Maarten Marx (University of Amsterdam)|
|Section:||Language and Computation|
|Time:|| 14.00-15.30 (Slot 3)|
The web is the world's largest knowledge base. Data on the web
resides in various formats (e.g., HTML, XML, text files, relational).
To accommodate all forms of data and access to it, the database
research community has introduced the ``semistructured data model,"
where data is self-describing, irregular, and graph-like. The new
model captures naturally Web data, such as HTML, XML, or other
application specific formats.
In this course we present and explore different ways of
retrieving information from semistructured documents. To access
such documents we must use technology from both the database and
the information retrieval disciplines. This combination makes
"Retrieving Content and Structure" a challenging and exciting
research topic, with many open questions.
Here is a day-by-day breakdown of the envisaged course:
Day 1. "Content Only"
- overview document retrieval
- basic retrieval models
- language modeling for information retrieval
- evaluation; TREC (Text REtrieval Conference)
- indexing content
Day 2. "Content and Light-Weight Structure"
- HTML documents, documents annotated with metadata
- Web retrieval (anchor text, link information)
- indexing light-weight structure
Day 3. "Structure"
- XML documents
- query evaluation for XML documents
Day 4. "Content and Structure 1"
- information retrieval approaches:
- database approaches:
- evaluation; INEX (INitiative for the Evaluation of
- indexing content and structure
Day 5. "Content and Structure 2"
- evaluation metrics
- applications and application scenarios
||© ESSLLI 2005 Organising Committee