| Training data|
(Alternative names for this keyword)
|Export and share|
|BibTeX, CSV, RDF, JSON|
|Browse properties · List of keywords|
Training data is included as keyword or extra keyword in 0 datasets, 0 tools and 27 publications.
There is no datasets for this keyword.
There is no tools for this keyword.
|Title||Author(s)||Published in||Language||DateThis property is a special property in this wiki.||Abstract||R||C|
|A supervised method for lexical annotation of schema labels based on wikipedia||Sorrentino S.
|Lecture Notes in Computer Science||English||2012||Lexical annotation is the process of explicit assignment of one or more meanings to a term w.r.t. a sense inventory (e.g., a thesaurus or an ontology). We propose an automatic supervised lexical annotation method, called ALA TK (Automatic Lexical Annotation -Topic Kernel), based on the Topic Kernel function for the annotation of schema labels extracted from structured and semi-structured data sources. It exploits Wikipedia as sense inventory and as resource of training data.||0||0|
|Mining the web for points of interest||Rae A.
|SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval||English||2012||A point of interest (POI) is a focused geographic entity such as a landmark, a school, an historical building, or a business. Points of interest are the basis for most of the data supporting location-based applications. In this paper we propose to curate POIs from online sources by bootstrapping training data from Web snippets, seeded by POIs gathered from social media. This large corpus is used to train a sequential tagger to recognize mentions of POIs in text. Using Wikipedia data as the training data, we can identify POIs in free text with an accuracy that is 116% better than the state of the art POI identifier in terms of precision, and 50% better in terms of recall. We show that using Foursquare and Gowalla checkins as seeds to bootstrap training data from Web snippets, we can improve precision between 16% and 52%, and recall between 48% and 187% over the state-of-the-art. The name of a POI is not sufficient, as the POI must also be associated with a set of geographic coordinates. Our method increases the number of POIs that can be localized nearly three-fold, from 134 to 395 in a sample of 400, with a median localization accuracy of less than one kilometer.||0||0|
|Cross-domain Dutch coreference resolution||De Clercq O.
|International Conference Recent Advances in Natural Language Processing, RANLP||English||2011||This article explores the portability of a coreference resolver across a variety of eight text genres. Besides newspaper text, we also include administrative texts, autocues, texts used for external communication, instructive texts, wikipedia texts, medical texts and unedited new media texts. Three sets of experiments were conducted. First, we investigated each text genre individually, and studied the effect of larger training set sizes and including genre-specific training material. Then, we explored the predictive power of each genre for the other genres conducting cross-domain experiments. In a final step, we investigated whether excluding genres with less predictive power increases overall performance. For all experiments we use an existing Dutch mention-pair resolver and report on our experimental results using four metrics: MUC, B-cubed, CEAF and BLANC. We show that resolving out-of-domain genres works best when enough training data is included. This effect is further intensified by including a small amount of genre-specific text. As far as the cross-domain performance is concerned we see that especially genres of a very specific nature tend to have less generalization power.||0||0|
|End-to-end Relation Extraction using distant supervision from external semantic repositories||Nguyen T.-V.T.
|ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies||English||2011||In this paper, we extend distant supervision (DS) based on Wikipedia for Relation Extraction (RE) by considering (i) relations defined in external repositories, e.g. YAGO, and (ii) any subset of Wikipedia documents. We show that training data constituted by sentences containing pairs of named entities in target relations is enough to produce reliable supervision. Our experiments with state-of-the-art relation extraction models, trained on the above data, show a meaningful F1 of 74.29% on a manually annotated test set: this highly improves the state-of-art in RE using DS. Additionally, our end-to-end experiments demonstrated that our extractors can be applied to any general text document.||0||0|
|Georeferencing Wikipedia pages using language models from Flickr||De Rouck C.
Van Laere O.
|CEUR Workshop Proceedings||English||2011||The task of assigning geographic coordinates to web resources has recently gained in popularity. In particular, several recent initiatives have focused on the use of language models for georeferencing Flickr photos, with promising results. Such techniques, however, require the availability of large numbers of spatially grounded training data. They are therefore not directly applicable for georeferencing other types of resources, such as Wikipedia pages. As an alternative, in this paper we explore the idea of using language models that are trained on Flickr photos for finding the coordinates of Wikipedia pages. Our experimental results show that the resulting method is able to outperform popular methods that are based on gazetteer look-up.||0||0|
|Harvesting domain-specific terms using wikipedia||Kim S.N.
|ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium||English||2011||We present a simple but effective method of automatically extracting domain-specific terms using Wikipedia as training data (i.e. self-supervised learning). Our first goal is to show, using human judgments, that Wikipedia categories are domainspecific and thus can replace manually annotated terms. Second, we show that identifying such terms using harvested Wikipedia categories and entities as seeds is reliable when compared to the use of dictionary terms. Our technique facilitates the construction of large semantic resources in multiple domains without requiring manually annotated training data.||0||0|
|Integrating visual classifier ensemble with term extraction for Automatic Image Annotation||Lei Y.
|Proceedings of the 2011 6th IEEE Conference on Industrial Electronics and Applications, ICIEA 2011||English||2011||Existing Automatic Image Annotation (AIA) systems are typically developed, trained and tested using high quality, manually labelled images. The tremendous manual efforts required with an untested ability to scale and tolerate noise all have an impact on existing systems' applicability to real-world data. In this paper, we propose a novel AIA system which harnesses the collective intelligence on the Web to automatically construct training data to work with an ensemble of Support Vector Machine (SVM) classifiers based on Multi-Instance Learning (MIL) and global features. An evaluation of the proposed annotation approach using an automatically constructed training set from Wikipedia demonstrates a slight improvement of in annotation accuracy in comparison with two existing systems.||0||0|
|Sequential supervised learning for hypernym discovery from Wikipedia||Litz B.
|Communications in Computer and Information Science||English||2011||Hypernym discovery is an essential task for building and extending ontologies automatically. In comparison to the whole Web as a source for information extraction, online encyclopedias provide far more structuredness and reliability. In this paper we propose a novel approach that combines syntactic and lexical-semantic information to identify hypernymic relationships. We compiled semi-automatically and manually created training data and a gold standard for evaluation with the first sentences from the German version of Wikipedia. We trained a sequential supervised learner with a semantically enhanced tagset. The experiments showed that the cleanliness of the data is far more important than the amount of the same. Furthermore, it was shown that bootstrapping is a viable approach to ameliorate the results. Our approach outperformed the competitive lexico-syntactic patterns by 7% leading to an F1-measure of over .91.||0||0|
|A baseline approach for detecting sentences containing uncertainty||Sang E.T.K.||CoNLL-2010: Shared Task - Fourteenth Conference on Computational Natural Language Learning, Proceedings of the Shared Task||English||2010||We apply a baseline approach to the CoNLL-2010 shared task data sets on hedge detection. Weights have been assigned to cue words marked in the training data based on their occurrences in certain and uncertain sentences. New sentences received scores that correspond with those of their best scoring cue word, if present. The best acceptance scores for uncertain sentences were determined using 10-fold cross validation on the training data. This approach performed reasonably on the shared task's biological (F=82.0) and Wikipedia (F=62.8) data sets.||0||0|
|A hedgehop over a max-margin framework using hedge cues||Georgescul M.||CoNLL-2010: Shared Task - Fourteenth Conference on Computational Natural Language Learning, Proceedings of the Shared Task||English||2010||In this paper, we describe the experimental settings we adopted in the context of the 2010 CoNLL shared task for detecting sentences containing uncertainty. The classification results reported on are obtained using discriminative learning with features essentially incorporating lexical information. Hyper-parameters are tuned for each domain: using BioScope training data for the biomedical domain and Wikipedia training data for the Wikipedia test set. By allowing an efficient handling of combinations of large-scale input features, the discriminative approach we adopted showed highly competitive empirical results for hedge detection on the Wikipedia dataset: our system is ranked as the first with an F-score of 60.17%.||0||0|
|A lucene and maximum entropy model based hedge detection system||Long Chen
Di Eugenio B.
|CoNLL-2010: Shared Task - Fourteenth Conference on Computational Natural Language Learning, Proceedings of the Shared Task||English||2010||This paper describes the approach to hedge detection we developed, in order to participate in the shared task at CoNLL-2010. A supervised learning approach is employed in our implementation. Hedge cue annotations in the training data are used as the seed to build a reliable hedge cue set. Maximum Entropy (MaxEnt) model is used as the learning technique to determine uncertainty. By making use of Apache Lucene, we are able to do fuzzy string match to extract hedge cues, and to incorporate part-of-speech (POS) tags in hedge cues. Not only can our system determine the certainty of the sentence, but is also able to find all the contained hedges. Our system was ranked third on the Wikipedia dataset. In later experiments with different parameters, we further improved our results, with a 0.612 F-score on the Wikipedia dataset, and a 0.802 F-score on the biological dataset.||0||0|
|Chinese characters conversion system based on lookup table and language model||Li M.-H.
|Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing, ROCLING 2010||Chinese||2010||The character sets used in China and Taiwan are both Chinese, but they are divided into simplified and traditional Chinese characters. There are large amount of information exchange between China and Taiwan through books and Internet. To provide readers a convenient reading environment, the character conversion between simplified and traditional Chinese is necessary. The conversion between simplified and traditional Chinese characters has two problems: one-to-many ambiguity and term usage problems. Since there are many traditional Chinese characters that have only one corresponding simplified character, when converting simplified Chinese into traditional Chinese, the system will face the one-to-many ambiguity. Also, there are many terms that have different usages between the two Chinese societies. This paper focus on designing an extensible conversion system, that can take the advantage of community knowledge by accumulating lookup tables through Wikipedia to tackle the term usage problem and can integrate language model to disambiguate the one-to-many ambiguity. The system can reduce the cost of proofreading of character conversion for books, e-books, or online publications. The extensible architecture makes it easy to improve the system with new training data.||1||0|
|Co-star: A co-training style algorithm for hyponymy relation acquisition from structured and unstructured text||Oh J.-H.
|Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference||English||2010||This paper proposes a co-training style algorithm called Co-STAR that acquires hyponymy relations simultaneously from structured and unstructured text. In Co- STAR, two independent processes for hyponymy relation acquisition - one handling structured text and the other handling unstructured text - collaborate by repeatedly exchanging the knowledge they acquired about hyponymy relations. Unlike conventional co-training, the two processes in Co-STAR are applied to different source texts and training data. We show the effectiveness of this algorithm through experiments on large scale hyponymy-relation acquisition from Japanese Wikipedia and Web texts. We also show that Co-STAR is robust against noisy training data.||0||0|
|On the sampling of web images for learning visual concept classifiers||Zhu S.
|CIVR 2010 - 2010 ACM International Conference on Image and Video Retrieval||English||2010||Visual concept learning often requires a large set of training images. In practice, nevertheless, acquiring noise-free training labels with sufficient positive examples is always expensive. A plausible solution for training data collection is by sampling the largely available user-tagged images from social media websites. With the general belief that the probability of correct tagging is higher than that of incorrect tagging, such a solution often sounds feasible, though is not without challenges. First, user-tags can be subjective and, to certain extent, are ambiguous. For instance, an image tagged with "whales" may be simply a picture about ocean museum. Learning concept "whales" with such training samples will not be effective. Second, user-tags can be overly abbreviated. For instance, an image about concept "wedding" may be tagged with "love" or simply the couple's names. As a result, crawling sufficient positive training examples is difficult. This paper empirically studies the impact of exploiting the tagged images towards concept learning, investigating the issue of how the quality of pseudo training images affects concept detection performance. In addition, we propose a simple approach, named semantic field, for predicting the relevance between a target concept and the tag list associated with the images. Specifically, the relevance is determined through concept-tag co-occurrence by exploring external sources such as WordNet and Wikipedia. The proposed approach is shown to be effective in selecting pseudo training examples, exhibiting better performance in concept learning than other approaches such as those based on keyword sampling and tag voting. Copyright||0||0|
|Overview of VideoCLEF 2009: New perspectives on speech-based multimedia content enrichment||Larson M.
|Lecture Notes in Computer Science||English||2010||VideoCLEF 2009 offered three tasks related to enriching video content for improved multimedia access in a multilingual environment. For each task, video data (Dutch-language television, predominantly documentaries) accompanied by speech recognition transcripts were provided. The Subject Classification Task involved automatic tagging of videos with subject theme labels. The best performance was achieved by approaching subject tagging as an information retrieval task and using both speech recognition transcripts and archival metadata. Alternatively, classifiers were trained using either the training data provided or data collected from Wikipedia or via general Web search. The Affect Task involved detecting narrative peaks, defined as points where viewers perceive heightened dramatic tension. The task was carried out on the "Beeldenstorm" collection containing 45 short-form documentaries on the visual arts. The best runs exploited affective vocabulary and audience directed speech. Other approaches included using topic changes, elevated speaking pitch, increased speaking intensity and radical visual changes. The Linking Task, also called "Finding Related Resources Across Languages," involved linking video to material on the same subject in a different language. Participants were provided with a list of multimedia anchors (short video segments) in the Dutch-language "Beeldenstorm" collection and were expected to return target pages drawn from English-language Wikipedia. The best performing methods used the transcript of the speech spoken during the multimedia anchor to build a query to search an index of the Dutch-language Wikipedia. The Dutch Wikipedia pages returned were used to identify related English pages. Participants also experimented with pseudo-relevance feedback, query translation and methods that targeted proper names.||0||0|
|Building knowledge base for Vietnamese information retrieval||Nguyen T.C.
|IiWAS2009 - The 11th International Conference on Information Integration and Web-based Applications and Services||English||2009||At present, Vietnamese knowledge base (vnKB) is one of the most important focuses of Vietnamese researchers because of its applications in wide areas such as Information Retrieval (IR), Machine Translation (MT) etc. There have been several separate projects developing vnKB in various domains. The training in vnBK is the most difficulty because of quantity and quality of training data, and lacking of available Vietnamese corpus with acceptable quality. This paper introduces an approach, which first extracts semantic information from Vietnamese Wikipedia (vnWK), then trains the proposed vnKB by applying support vector machine (SVM) technique. The experimentation of the proposed approach shows that it is a potential solution because of its good results and proves that it can provide more valuable benefits when applying to our Vietnamese Semantic Information Retrieval system.||0||0|
|Finding hedges by chasing weasels: Hedge detection using Wikipedia tags and shallow linguistic features||Viola Ganter
|ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.||English||2009||We investigate the automatic detection of sentences containing linguistic hedges using corpus statistics and syntactic patterns. We take Wikipedia as an already annotated corpus using its tagged weasel words which mark sentences and phrases as non-factual. We evaluate the quality of Wikipedia as training data for hedge detection, as well as shallow linguistic features.||0||0|
|Improving classification accuracy using automatically extracted training data||Fuxman A.
|Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining||English||2009||Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used. We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and noncommercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost. Copyright 2009 ACM.||0||0|
|Metadata and multilinguality in video classification||He J.
|Lecture Notes in Computer Science||English||2009||The VideoCLEF 2008 Vid2RSS task involves the assignment of thematic category labels to dual language (Dutch/English) television episode videos. The University of Amsterdam chose to focus on exploiting archival metadata and speech transcripts generated by both Dutch and English speech recognizers. A Support Vector Machine (SVM) classifier was trained on training data collected from Wikipedia. The results provide evidence that combining archival metadata with speech transcripts can improve classification performance, but that adding speech transcripts in an additional language does not yield performance gains.||0||0|
|Overview of videoCLEF 2008: Automatic generation of topic-based feeds for dual language audio-visual content||Larson M.
|Lecture Notes in Computer Science||English||2009||The VideoCLEF track, introduced in 2008, aims to develop and evaluate tasks related to analysis of and access to multilingual multimedia content. In its first year, VideoCLEF piloted the Vid2RSS task, whose main subtask was the classification of dual language video (Dutch-language television content featuring English-speaking experts and studio guests). The task offered two additional discretionary subtasks: feed translation and automatic keyframe extraction. Task participants were supplied with Dutch archival metadata, Dutch speech transcripts, English speech transcripts and ten thematic category labels, which they were required to assign to the test set videos. The videos were grouped by class label into topic-based RSS-feeds, displaying title, description and keyframe for each video. Five groups participated in the 2008 VideoCLEF track. Participants were required to collect their own training data; both Wikipedia and general web content were used. Groups deployed various classifiers (SVM, Naive Bayes and k-NN) or treated the problem as an information retrieval task. Both the Dutch speech transcripts and the archival metadata performed well as sources of indexing features, but no group succeeded in exploiting combinations of feature sources to significantly enhance performance. A small scale fluency/adequacy evaluation of the translation task output revealed the translation to be of sufficient quality to make it valuable to a non-Dutch speaking English speaker. For keyframe extraction, the strategy chosen was to select the keyframe from the shot with the most representative speech transcript content. The automatically selected shots were shown, with a small user study, to be competitive with manually selected shots. Future years of VideoCLEF will aim to expand the corpus and the class label list, as well as to extend the track to additional tasks.||0||0|
|VideoCLEF 2008: ASR classification with wikipedia categories||Kusrsten J.
|Lecture Notes in Computer Science||English||2009||This article describes our participation at the VideoCLEF track. We designed and implemented a prototype for the classification of the Video ASR data. Our approach was to regard the task as text classification problem. We used terms from Wikipedia categories as training data for our text classifiers. For the text classification the Naive-Bayes and kNN classifier from the WEKA toolkit were used. We submitted experiments for classification task 1 and 2. For the translation of the feeds to English (translation task) Google's AJAX language API was used. Although our experiments achieved only low precision of 10 to 15 percent, we assume those results will be useful in a combined setting with the retrieval approach that was widely used. Interestingly, we could not improve the quality of the classification by using the provided metadata.||0||0|
|WikiSense: Supersense tagging of Wikipedia named entities based WordNet||Jian Chang
|PACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation||English||2009||In this paper, we introduce a minimally supervised method for learning to classify named-entity titles in a given encyclopedia into broad semantic categories in an existing ontology. Our main idea involves using overlapping entries in the encyclopedia and ontology and a small set of 30 handed tagged parenthetic explanations to automatically generate the training data. The proposed method involves automatically recognizing whether a title is a named entity, automatically generating two sets of training data, and automatically building a classification model for training a classification model based on textual and non-textual features. We present WikiSense, an implementation of the proposed method for extending the named entity coverage of WordNet by sense tagging Wikipedia titles. Experimental results show WikiSense achieves accuracy of over 95% and near 80% applicability for all NE titles in Wikipedia. WikiSense cleanly produces over 1.2 million of NEs tagged with broad categories, based on the lexicographers' files of WordNet, effectively extending WordNet to form a very large scale semantic category, a potentially useful resource for many natural language related tasks. © 2009 by Joseph Chang, Richard Tzong-Han Tsai, and Jason S. Chang.||0||0|
|Word sense disambiguation based on Wikipedia link structure||Angela Fogarolli||ICSC 2009 - 2009 IEEE International Conference on Semantic Computing||English||2009||In this paper an approach based on Wikipedia link structure for sense disambiguation is presented and evaluated. Wikipedia is used as a reference to obtain lexicographic relationships and in combination with statistical information extraction it is possible to deduce concepts related to the terms extracted from a corpus. In addition, since the corpus covers a representation of a part of the real world the corpus itself is used as training data for choosing the sense which best fit the corpus.||0||0|
|Augmenting wikipedia-extraction with results from the web||Fei Wu
|AAAI Workshop - Technical Report||English||2008||Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper explains and evaluates a method for improving recall by extracting from the broader Web. There are two key advances necessary to make Web supplementation effective: 1) a method to filter promising sentences from Web pages, and 2) a novel retraining technique to broaden extractor recall. Experiments show that, used in concert with shrinkage, our techniques increase recall by a factor of up to 8 while maintaining or increasing precision. Copyright||0||0|
|Learning to tag and tagging to learn: A case study on wikipedia||Peter Mika
|IEEE Intelligent Systems||English||2008||Information technology experts suggest that natural language technologies will play an important role in the Web's future. The latest Web developments, such as the huge success of Web 2.0, demonstrate annotated data's significant potential. The problem of semantically annotating Wikipedia inspires a novel method for dealing with domain and task adaptation of semantic taggers in cases where parallel text and metadata are available. One main approach to tagging for acquiring knowledge from Wikipedia involves self-training that adds automatically annotated data from the target domain to the original training data. Another key approach involves structural correspondence learning, which tries to build a shared feature representation of the data.||0||0|
|Named entity disambiguation on an ontology enriched by Wikipedia||Nguyen H.T.
|RIVF 2008 - 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies||English||2008||Currently, for named entity disambiguation, the shortage of training data is a problem. This paper presents a novel method that overcomes this problem by automatically generating an annotated corpus based on a specific ontology. Then the corpus was enriched with new and informative features extracted from Wikipedia data. Moreover, rather than pursuing rule-based methods as in literature, we employ a machine learning model to not only disambiguate but also identify named entities. In addition, our method explores in details the use of a range of features extracted from texts, a given ontology, and Wikipedia data for disambiguation. This paper also systematically analyzes impacts of the features on disambiguation accuracy by varying their combinations for representing named entities. Empirical evaluation shows that, while the ontology provides basic features of named entities, Wikipedia is a fertile source for additional features to construct accurate and robust named entity disambiguation systems.||0||0|
|PORE: Positive-only relation extraction from wikipedia text||Gang Wang
|Lecture Notes in Computer Science||English||2007||Extracting semantic relations is of great importance for the creation of the Semantic Web content. It is of great benefit to semi-automatically extract relations from the free text of Wikipedia using the structured content readily available in it. Pattern matching methods that employ information redundancy cannot work well since there is not much redundancy information in Wikipedia, compared to the Web. Multi-class classification methods are not reasonable since no classification of relation types is available in Wikipedia. In this paper, we propose PORE (Positive-Only Relation Extraction), for relation extraction from Wikipedia text. The core algorithm B-POL extends a state-of-the-art positive-only learning algorithm using bootstrapping, strong negative identifi cation, and transductive inference to work with fewer positive training exam ples. We conducted experiments on several relations with different amount of training data. The experimental results show that B-POL can work effectively given only a small amount of positive training examples and it significantly out per forms the original positive learning approaches and a multi-class SVM. Furthermore, although PORE is applied in the context of Wiki pedia, the core algorithm B-POL is a general approach for Ontology Population and can be adapted to other domains.||0||0|