Browse wiki

Jump to: navigation, search
An efficient web-based wrapper and annotator for tabular data
Abstract In the last few years, several works in thIn the last few years, several works in the literature have addressed the problem of data extraction from web pages. The importance of this problem derives from the fact that, once extracted, data can be handled in a way similar to instances of a traditional database, which in turn can facilitate application of web data integration and various other domain specific problems. In this paper, we propose a novel table extraction technique that works on web pages generated dynamically from a back-end database. The proposed system can automatically discover table structure by relevant pattern mining from web pages in an efficient way, and can generate regular expression for the extraction process. Moreover, the proposed system can assign intuitive column names to the columns of the extracted table by leveraging Wikipedia knowledge base for the purpose of table annotation. To improve accuracy of the assignment, we exploit the structural homogeneity of the column values and their co-location information to weed out less likely candidates. This approach requires no human intervention and experimental results have shown its accuracy to be promising. Moreover, the wrapper generation algorithm works in linear time.generation algorithm works in linear time.
Abstractsub In the last few years, several works in thIn the last few years, several works in the literature have addressed the problem of data extraction from web pages. The importance of this problem derives from the fact that, once extracted, data can be handled in a way similar to instances of a traditional database, which in turn can facilitate application of web data integration and various other domain specific problems. In this paper, we propose a novel table extraction technique that works on web pages generated dynamically from a back-end database. The proposed system can automatically discover table structure by relevant pattern mining from web pages in an efficient way, and can generate regular expression for the extraction process. Moreover, the proposed system can assign intuitive column names to the columns of the extracted table by leveraging Wikipedia knowledge base for the purpose of table annotation. To improve accuracy of the assignment, we exploit the structural homogeneity of the column values and their co-location information to weed out less likely candidates. This approach requires no human intervention and experimental results have shown its accuracy to be promising. Moreover, the wrapper generation algorithm works in linear time.generation algorithm works in linear time.
Bibtextype article  +
Doi 10.1142/S0218194010004657  +
Has author Amin M.S. + , Jamil H. +
Has extra keyword Back-end database + , Colocations + , Data extraction + , Domain specific + , Extraction process + , Extraction techniques + , Human intervention + , Information extraction + , Knowledge base + , Linear time + , Regular expression + , Relevant patterns + , Structural homogeneity + , Table structure + , Tabular data + , Web data integration + , Web page + , Wikipedia + , Wrapper generation algorithms + , Data handling + , Information analysis + , Knowledge based systems + , Knowledge management + , Websites +
Has keyword Information extraction + , Missing column name annotation + , Wrapper +
Issn 2181940  +
Issue 2  +
Language English +
Number of citations by publication 0  +
Number of references by publication 0  +
Pages 215–231  +
Published in International Journal of Software Engineering and Knowledge Engineering +
Title An efficient web-based wrapper and annotator for tabular data +
Type journal article  +
Volume 20  +
Year 2010 +
Creation dateThis property is a special property in this wiki. 6 November 2014 16:12:14  +
Categories Publications without license parameter  + , Publications without remote mirror parameter  + , Publications without archive mirror parameter  + , Publications without paywall mirror parameter  + , Journal articles  + , Publications without references parameter  + , Publications  +
Modification dateThis property is a special property in this wiki. 6 November 2014 16:12:14  +
DateThis property is a special property in this wiki. 2010  +
hide properties that link here 
An efficient web-based wrapper and annotator for tabular data + Title
 

 

Enter the name of the page to start browsing from.