Document Layout Analysis and Hierarchical Summarization for Web Pages

Tunga Güngör






A search engine is a type of web information retrieval system that is frequently used by end users. By looking at the information displayed by the search engine in response to a query, the users try to locate the relevant pages and load them to find answers to their information needs. Current search engines usually extract a few lines from the contents of a web page that include the query terms and display these as a representation of the document to the user. Such extracts pose two difficulties for the user in deciding the relevancy of the page. They are too short and include limited information, and also they focus on the query words only. In this talk, we present a novel approach that displays the query results in the form of summaries of the web pages. We propose a system that performs document layout analysis and learns a summarization model by using a number of machine learning techniques. The summarization framework makes use of new heuristics that take the output of the layout analysis into account. Experiments on two standard datasets showed that the proposed methodology significantly outperforms traditional search engines.

