Gosse Bouma

From WikiPapers
Jump to: navigation, search

Gosse Bouma is an author.


Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Term extraction from sparse, ungrammatical domain-specific documents Business intelligence
Natural Language Processing
Product development-customer service
Term extraction
Text mining
Expert Systems with Applications English 2013 Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers' repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection. © 2012 Elsevier B.V. All rights reserved. 0 0
Classifying image galleries into a taxonomy using metadata and wikipedia Classification
Hierarchical classification
Image gallery
Lecture Notes in Computer Science English 2012 This paper presents a method for the hierarchical classification of image galleries into a taxonomy. The proposed method links textual gallery metadata to Wikipedia pages and categories. Entity extraction from metadata, entity ranking, and selection of categories is based on Wikipedia and does not require labeled training data. The resulting system performs well above a random baseline, and achieves a (micro-averaged) F-score of 0.59 on the 9 top categories of the taxonomy and 0.40 when using all 57 categories. 0 0
Cross-lingual Alignment and Completion of Wikipedia Templates English 2009 0 0
Cross-lingual Dutch to english alignment using EuroWordNet and Dutch Wikipedia CEUR Workshop Proceedings English 2009 This paper describes a system for linking the thesaurus of the Netherlands Institute for Sound and Vision to English WordNet and dbpedia. We used EuroWordNet, a multilingual wordnet, and Dutch Wikipedia as intermediaries for the two alignments. EuroWordNet covers most of the subject terms in the thesaurus, but the organization of the cross-lingual links makes selection of the most appropriate English target term almost impossible. Using page titles, redirects, disambiguation pages, and anchor text harvested from Dutch Wikipedia gives reasonable performance on subject terms and geographical terms. Many person and organization names in the thesaurus could not be located in (Dutch or English) Wikipedia. 0 0
Linking Dutch wikipedia categories to eurowordnet SA-OT accounts for pronoun resolution in child language Computational Linguistics in the Netherlands 2009 - Selected Papers from the 19th CLIN Meeting, CLIN 2009 English 2009 Wikipedia provides category information for a large number of named entities but the category structure of Wikipedia is associative, and not always suitable for linguistic applications. For this reason, a merger ofWikipedia andWordNet has been proposed. In this paper, we address the word sense disambiguation problem that needs to be solved when linking Dutch Wikipedia categories to polysemous Dutch EuroWordNet literals. We show that a method based on automatically acquired predominant word senses outperforms a method based on word overlap between Wikipedia supercategories and WordNet hypernyms. We compare the coverage of the resulting categorization with that of a corpus-based system that uses automatically acquired category labels. Copyright 0 0
Question answering with joost at CLEF 2007 Lecture Notes in Computer Science English 2008 We describe our system for the monolingual Dutch and multilingual English to Dutch QA tasks. We describe the preprocessing of Wikipedia, inclusion of query expansion in IR, anaphora resolution in follow-up questions, and a question classification module for the multilingual task. Our best runs achieved 25.5% accuracy for the Dutch monolingual task, and 13.5% accuracy for the multilingual task. 0 0