REWERSE-RP-2006-085

Tamir Hassan, Robert Baumgartner:
Using Graph Matching Techniques to Wrap Data from PDF Documents.


Complete Text [
.pdf, 177KB]
Poster [.pdf, 2,36MB]
In: Proceedings of 15th International World Wide Web Conference (WWW2006), Edinburgh, Scotland (23rd - 26th May 2006), 901-902, May 2006
© ACM Press

Abstract
Wrapping is the process of navigating a data source, semi- automatically extracting data and transforming it into a form suitable for data processing applications. There are currently a number of established products on the market for wrapping data from web pages. One such approach is Lixto [1], a product of research performed at our institute. Our work is concerned with extending the wrapping func- tionality of Lixto to PDF documents. As the PDF format is relatively unstructured, this is a challenging task. We have developed a method to segment the page into blocks, which are represented as nodes in a relational graph. This paper describes our current research in the use of relational match- ing techniques on this graph to locate wrapping instances.

URL:
http://rewerse.net/publications/rewerse-publications.html#REWERSE-RP-2006-085

BibTeX:

@inproceedings{REWERSE-RP-2006-085,
	author = {Tamir Hassan and Robert Baumgartner},
	title = {Using Graph Matching Techniques to Wrap Data from PDF Documents},
	booktitle = {Proceedings of 15th International World Wide Web Conference, Edinburgh, Scotland (23rd--26th May 2006)},
	year = {2006},
	pages = {901--902},
	url = {http://rewerse.net/publications/rewerse-publications.html#REWERSE-RP-2006-085}
}