Tamir Hassan, Robert Baumgartner:
Using Graph Matching Techniques to Wrap Data from PDF Documents.
Abstract
Wrapping is the process of navigating a data source, semi- automatically extracting data and transforming it into a form suitable for data processing applications. There are currently a number of established products on the market for wrapping data from web pages. One such approach is Lixto [1], a product of research performed at our institute. Our work is concerned with extending the wrapping func- tionality of Lixto to PDF documents. As the PDF format is relatively unstructured, this is a challenging task. We have developed a method to segment the page into blocks, which are represented as nodes in a relational graph. This paper describes our current research in the use of relational match- ing techniques on this graph to locate wrapping instances.
URL:
http://rewerse.net/publications/rewerse-publications.html#REWERSE-RP-2006-085
@inproceedings{REWERSE-RP-2006-085, author = {Tamir Hassan and Robert Baumgartner}, title = {Using Graph Matching Techniques to Wrap Data from PDF Documents}, booktitle = {Proceedings of 15th International World Wide Web Conference, Edinburgh, Scotland (23rd--26th May 2006)}, year = {2006}, pages = {901--902}, url = {http://rewerse.net/publications/rewerse-publications.html#REWERSE-RP-2006-085} }