Machine learning

From WikiPapers
Jump to: navigation, search

machine learning is included as keyword or extra keyword in 0 datasets, 0 tools and 40 publications.

Datasets

There is no datasets for this keyword.

Tools

There is no tools for this keyword.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Evaluating the helpfulness of linked entities to readers Yamada I.
Ito T.
Usami S.
Takagi S.
Hideaki Takeda
Takefuji Y.
HT 2014 - Proceedings of the 25th ACM Conference on Hypertext and Social Media English 2014 When we encounter an interesting entity (e.g., a person's name or a geographic location) while reading text, we typically search and retrieve relevant information about it. Entity linking (EL) is the task of linking entities in a text to the corresponding entries in a knowledge base, such as Wikipedia. Recently, EL has received considerable attention. EL can be used to enhance a user's text reading experience by streamlining the process of retrieving information on entities. Several EL methods have been proposed, though they tend to extract all of the entities in a document including unnecessary ones for users. Excessive linking of entities can be distracting and degrade the user experience. In this paper, we propose a new method for evaluating the helpfulness of linking entities to users. We address this task using supervised machine-learning with a broad set of features. Experimental results show that our method significantly outperforms baseline methods by approximately 5.7%-12% F1. In addition, we propose an application, Linkify, which enables developers to integrate EL easily into their web sites. 0 0
Inferring attitude in online social networks based on quadratic correlation Chao Wang
Bulatov A.A.
Lecture Notes in Computer Science English 2014 The structure of an online social network in most cases cannot be described just by links between its members. We study online social networks, in which members may have certain attitude, positive or negative, toward each other, and so the network consists of a mixture of both positive and negative relationships. Our goal is to predict the sign of a given relationship based on the evidences provided in the current snapshot of the network. More precisely, using machine learning techniques we develop a model that after being trained on a particular network predicts the sign of an unknown or hidden link. The model uses relationships and influences from peers as evidences for the guess, however, the set of peers used is not predefined but rather learned during the training process. We use quadratic correlation between peer members to train the predictor. The model is tested on popular online datasets such as Epinions, Slashdot, and Wikipedia. In many cases it shows almost perfect prediction accuracy. Moreover, our model can also be efficiently updated as the underlying social network evolves. 0 0
Tell me more: An actionable quality model for wikipedia Morten Warncke-Wang
Dan Cosley
John Riedl
Proceedings of the 9th International Symposium on Open Collaboration, WikiSym + OpenSym 2013 English 2013 In this paper we address the problem of developing actionable quality models for Wikipedia, models whose features directly suggest strategies for improving the quality of a given article. We rst survey the literature in order to understand the notion of article quality in the context of Wikipedia and existing approaches to automatically assess article quality. We then develop classication models with varying combinations of more or less actionable features, and nd that a model that only contains clearly actionable features delivers solid performance. Lastly we discuss the implications of these results in terms of how they can help improve the quality of articles across Wikipedia. Categories and Subject Descriptors H.5 [Information Interfaces and Presentation]: Group and Organization InterfacesCollaborative computing, Computer-supported cooperative work, Web-based interac- Tion. Copyright 2010 ACM. 0 0
Thai wikipedia link suggestion framework Rungsawang A.
Siangkhio S.
Surarerk A.
Manaskasemsak B.
Lecture Notes in Electrical Engineering English 2013 The paper presents a framework that exploits the Thai Wikipedia articles as a knowledge source to train the machine learning classifier for link suggestion purpose. Given an input document, important concepts in the text have been automatically extracted, and the chosen corresponding Wikipedia pages have been determined and suggested to be the destination links for additional information. Preliminary experiments from the prototype running on a test set of Thai Wikipedia articles show that this automatic link suggestion framework provides reasonably up to 90 % link suggestion accuracy. 0 0
An english-translated parallel corpus for the CJK wikipedia collections Tang L.-X.
Shlomo Geva
Andrew Trotman
Proceedings of the 17th Australasian Document Computing Symposium, ADCS 2012 English 2012 In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information retrieval research community and knowledge sharing in Wikipedia in many ways; for example, this corpus could be used for experimentations in cross-lingual information retrieval, cross-lingual link discovery, or omni-lingual information retrieval research. Furthermore, the translated CJK articles could be used to further expand the current coverage of the English Wikipedia. Copyright 0 0
Detecting Korean hedge sentences in Wikipedia documents Kang S.-J.
Jeong J.-S.
Kang I.-S.
Lecture Notes in Computer Science English 2012 In this paper we propose automatic hedge detection methods for Korean. We select sentential contextual features adjusted for Korean, and used supervised machine-learning algorithms to train models to detect hedges in Wikipedia documents. Our SVM-based model achieved an F1-score of 90.8% for Korean. 0 0
Detecting Wikipedia vandalism with a contributing efficiency-based approach Tang X.
Guangyou Zhou
Fu Y.
Gan L.
Yu W.
Li S.
Lecture Notes in Computer Science English 2012 The collaborative nature of wiki has distinguished Wikipedia as an online encyclopedia but also makes the open contents vulnerable against vandalism. The current vandalism detection methods relying on basic statistic language features work well for explicitly offensive edits that perform massive changes. However, these techniques are evadable for the elusive vandal edits which make only a few unproductive or dishonest modifications. In this paper we proposed a contributing efficiency-based approach to detect the vandalism in Wikipedia and implement it with machine-learning based classifiers that incorporate the contributing efficiency along with other languages features. The results of extensional experiment show that the contributing efficiency can improve the recall of machine learning-based vandalism detection algorithms significantly. 0 0
Learning from history: Predicting reverted work at the word level in wikipedia Jeffrey Rzeszotarski
Aniket Kittur
English 2012 Wikipedia's remarkable success in aggregating millions of contributions can pose a challenge for current editors, whose hard work may be reverted unless they understand and follow established norms, policies, and decisions and avoid contentious or proscribed terms. We present a machine learning model for predicting whether a contribution will be reverted based on word level features. Unlike previous models relying on editor-level characteristics, our model can make accurate predictions based only on the words a contribution changes. A key advantage of the model is that it can provide feedback on not only whether a contribution is likely to be rejected, but also the particular words that are likely to be controversial, enabling new forms of intelligent interfaces and visualizations. We examine the performance of the model across a variety of Wikipedia articles. 0 0
Pattern for python De Smedt T.
Daelemans W.
Journal of Machine Learning Research English 2012 Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/ pattern. 0 0
Towards building a global oracle: A physical mashup using artificial intelligence technology Fortuna C.
Vucnik M.
Blaz Fortuna
Kenda K.
Moraru A.
Mladenic D.
ACM International Conference Proceeding Series English 2012 In this paper, we describe Videk - a physical mashup which uses artificial intelligence technology. We make an analogy between human senses and sensors; and between human brain and artificial intelligence technology respectively. This analogy leads to the concept of Global Oracle. We introduce a mashup system which automatically collects data from sensors. The data is processed and stored by SenseStream while the meta-data is fed into ResearchCyc. SenseStream indexes aggregates, performs clustering and learns rules which then it exports as RuleML. ResearchCyc performs logical inference on the meta-data and transliterates logical sentences. The GUI mashes up sensor data with SenseStream output, ResearchCyc output and other external data sources: GoogleMaps, Geonames, Wikipedia and Panoramio. Copyright 0 0
Autonomous Link Spam Detection in Purely Collaborative Environments Andrew G. West
Avantika Agrawal
Phillip Baker
Brittney Exline
Insup Lee
WikiSym English October 2011 Collaborative models (e.g., wikis) are an increasingly prevalent Web technology. However, the open-access that defines such systems can also be utilized for nefarious purposes. In particular, this paper examines the use of collaborative functionality to add inappropriate hyperlinks to destinations outside the host environment (i.e., link spam). The collaborative encyclopedia, Wikipedia, is the basis for our analysis.

Recent research has exposed vulnerabilities in Wikipedia's link spam mitigation, finding that human editors are latent and dwindling in quantity. To this end, we propose and develop an autonomous classifier for link additions. Such a system presents unique challenges. For example, low barriers-to-entry invite a diversity of spam types, not just those with economic motivations. Moreover, issues can arise with how a link is presented (regardless of the destination).

In this work, a spam corpus is extracted from over 235,000 link additions to English Wikipedia. From this, 40+ features are codified and analyzed. These indicators are computed using "wiki" metadata, landing site analysis, and external data sources. The resulting classifier attains 64% recall at 0.5% false-positives (ROC-AUC=0.97). Such performance could enable egregious link additions to be blocked automatically with low false-positive rates, while prioritizing the remainder for human inspection. Finally, a live Wikipedia implementation of the technique has been developed.
0 0
Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence Andrew G. West
Insup Lee
PAN-CLEF English September 2011 There is much literature on Wikipedia vandalism detection. However, this writing addresses two facets given little treatment to date. First, prior efforts emphasize zero-delay detection, classifying edits the moment they are made. If classification can be delayed (e.g., compiling offline distributions), it is possible to leverage ex post facto evidence. This work describes/evaluates several features of this type, which we find to be overwhelmingly strong vandalism indicators.

Second, English Wikipedia has been the primary test-bed for research. Yet, Wikipedia has 200+ language editions and use of localized features impairs portability. This work implements an extensive set of language-independent indicators and evaluates them using three corpora (German, English, Spanish). The work then extends to include language-specific signals. Quantifying their performance benefit, we find that such features can moderately increase classifier accuracy, but significant effort and language fluency are required to capture this utility.

Aside from these novel aspects, this effort also broadly addresses the task, implementing 65 total features. Evaluation produces 0.840 PR-AUC on thezero-delay task and 0.906 PR-AUC with ex post facto evidence (averaging languages). Performance matches the state-of-the-art (English), sets novel baselines (German, Spanish), and is validated by a first-place finish over the 2011 PAN-CLEF test set.
0 0
Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features B. Thomas Adler
Luca de Alfaro
Santiago M. Mola Velasco
Paolo Rosso
Andrew G. West
Lecture Notes in Computer Science English February 2011 Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatio-temporal analysis of metadata (STiki), a reputation-based system (WikiTrust), and natural language processing features. The performance of the resulting joint system improves the state-of-the-art from all previous methods and establishes a new baseline for Wikipedia vandalism detection. We examine in detail the contribution of the three approaches, both for the task of discovering fresh vandalism, and for the task of locating vandalism in the complete set of Wikipedia revisions. 0 1
Automatic assessment of document quality in web collaborative digital libraries Dalip D.H.
Goncalves M.A.
Marco Cristo
Pável Calado
Journal of Data and Information Quality English 2011 The old dream of a universal repository containing all of human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and open edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its quality. In this work, we explore a significant number of quality indicators and study their capability to assess the quality of articles from three Web collaborative digital libraries. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment. Through experiments, we show that the most important quality indicators are those which are also the easiest to extract, namely, the textual features related to the structure of the article. Moreover, to the best of our knowledge, this work is the first that shows an empirical comparison between Web collaborative digital libraries regarding the task of assessing article quality. 0 0
Clasificación de textos en lenguaje natural usando la wikipedia Quinteiro-Gonzalez J.M.
Martel-Jordan E.
Hernandez-Morera P.
Ligero-Fleitas J.A.
Lopez-Rodriguez A.
RISTI - Revista Iberica de Sistemas e Tecnologias de Informacao Spanish 2011 Automatic Text Classifiers are needed in environments where the amount of data to handle is so high that human classification would be ineffective. In our study, the proposed classifier takes advantage of the Wikipedia to generate the corpus defining each category. The text is then analyzed syntactically using Natural Language Processing software. The proposed classifier is highly accurate and outperforms Machine Learning trained classifiers. 0 0
Distributed tuning of machine learning algorithms using MapReduce Clusters Yasser Ganjisaffar
Debeauvais T.
Sara Javanmardi
Caruana R.
Lopes C.V.
Proceedings of the 3rd Workshop on Large Scale Data Mining: Theory and Applications, LDMTA 2011 - Held in Conjunction with ACM SIGKDD 2011 English 2011 Obtaining the best accuracy in machine learning usually requires carefully tuning learning algorithm parameters for each problem. Parameter optimization is computationally challenging for learning methods with many hyperparameters. In this paper we show that MapReduce Clusters are particularly well suited for parallel parameter optimization. We use MapReduce to optimize regularization parameters for boosted trees and random forests on several text problems: three retrieval ranking problems and a Wikipedia vandalism problem. We show how model accuracy improves as a function of the percent of parameter space explored, that accuracy can be hurt by exploring parameter space too aggressively, and that there can be significant interaction between parameters that appear to be independent. Our results suggest that MapReduce is a two-edged sword: it makes parameter optimization feasible on a massive scale that would have been unimaginable just a few years ago, but also creates a new opportunity for overfitting that can reduce accuracy and lead to inferior learning parameters. 0 0
Wikipedia vandalism detection Santiago M. Mola Velasco World Wide Web English 2011 0 0
A machine learning approach to link prediction for interlinked documents Kc M.
Chau R.
Hagenbuchner M.
Tsoi A.C.
Lee V.
Lecture Notes in Computer Science English 2010 This paper provides an explanation to how a recently developed machine learning approach, namely the Probability Measure Graph Self-Organizing Map (PM-GraphSOM) can be used for the generation of links between referenced or otherwise interlinked documents. This new generation of SOM models are capable of projecting generic graph structured data onto a fixed sized display space. Such a mechanism is normally used for dimension reduction, visualization, or clustering purposes. This paper shows that the PM-GraphSOM training algorithm "inadvertently" encodes relations that exist between the atomic elements in a graph. If the nodes in the graph represent documents, and the links in the graph represent the reference (or hyperlink) structure of the documents, then it is possible to obtain a set of links for a test document whose link structure is unknown. A significant finding of this paper is that the described approach is scalable in that links can be extracted in linear time. It will also be shown that the proposed approach is capable of predicting the pages which would be linked to a new document, and is capable of predicting the links to other documents from a given test document. The approach is applied to web pages from Wikipedia, a relatively large XML text database consisting of many referenced documents. 0 0
An exploration of learning to link with wikipedia: Features, methods and training collection He J.
Maarten de Rijke
Lecture Notes in Computer Science English 2010 We describe our participation in the Link-the-Wiki track at INEX 2009. We apply machine learning methods to the anchor-to-best-entry-point task and explore the impact of the following aspects of our approaches: features, learning methods as well as the collection used for training the models. We find that a learning to rank-based approach and a binary classification approach do not differ a lot. The new Wikipedia collection which is of larger size and which has more links than the collection previously used, provides better training material for learning our models. In addition, a heuristic run which combines the two intuitively most useful features outperforms machine learning based runs, which suggests that a further analysis and selection of features is necessary. 0 0
Applying wikipedia-based explicit semantic analysis for query-biased document summarization Yunqing Zhou
Zhongqi Guo
Peng Ren
Yong Yu
ICIC English 2010 0 0
Elusive vandalism detection in Wikipedia: A text stability-based approach Wu Q.
Danesh Irani
Calton Pu
Lakshmish Ramaswamy
International Conference on Information and Knowledge Management, Proceedings English 2010 The open collaborative nature of wikis encourages participation of all users, but at the same time exposes their content to vandalism. The current vandalism-detection techniques, while effective against relatively obvious vandalism edits, prove to be inadequate in detecting increasingly prevalent sophisticated (or elusive) vandal edits. We identify a number of vandal edits that can take hours, even days, to correct and propose a text stability-based approach for detecting them. Our approach is focused on the likelihood of a certain part of an article being modified by a regular edit. In addition to text-stability, our machine learning-based technique also takes into account edit patterns. We evaluate the performance of our approach on a corpus comprising of 15000 manually labeled edits from the Wikipedia Vandalism PAN corpus. The experimental results show that text-stability is able to improve the performance of the selected machine-learning algorithms significantly. 0 0
Identifying featured articles in Wikipedia: Writing style matters Nedim Lipka
Benno Stein
Proceedings of the 19th International Conference on World Wide Web, WWW '10 English 2010 Wikipedia provides an information quality assessment model with criteria for human peer reviewers to identify featured articles. For this classification task "Is an article featured or not?" we present a machine learning approach that exploits an article's character trigram distribution. Our approach differs from existing research in that it aims to writing style rather than evaluating meta features like the edit history. The approach is robust, straightforward to implement, and outperforms existing solutions. We underpin these claims by an experiment design where, among others, the domain transferability is analyzed. The achieved performances in terms of the F-measure for featured articles are 0.964 within a single Wikipedia domain and 0.880 in a domain transfer situation. 0 1
Information extraction from Wikipedia using pattern learning Mihaltz M. Acta Cybernetica English 2010 In this paper we present solutions for the crucial task of extracting structured information from massive free-text resources, such as Wikipedia, for the sake of semantic databases serving upcoming Semantic Web technologies. We demonstrate both a verb frame-based approach using deep natural language processing techniques with extraction patterns developed by human knowledge experts and machine learning methods using shallow linguistic processing. We also propose a method for learning verb frame-based extraction patterns automatically from labeled data. We show that labeled training data can be produced with only minimal human effort by utilizing existing semantic resources and the special characteristics of Wikipedia. Custom solutions for named entity recognition are also possible in this scenario. We present evaluation and comparison of the different approaches for several different relations. 0 0
Semantics-based representation model for multi-layer text classification Jiali Yun
Liping Jing
Jian Yu
Houkuan Huang
Lecture Notes in Computer Science English 2010 Text categorization is one of the most common themes in data mining and machine learning fields. Unlike structured data, unstructured text data is more complicated to be analyzed because it contains too much information, e.g., syntactic and semantic. In this paper, we propose a semantics-based model to represent text data in two levels. One level is for syntactic information and the other is for semantic information. Syntactic level represents each document as a term vector, and the component records tf-idf value of each term. The semantic level represents document with Wikipedia concepts related to terms in syntactic level. The syntactic and semantic information are efficiently combined by our proposed multi-layer classification framework. Experimental results on benchmark dataset (Reuters-21578) have shown that the proposed representation model plus proposed classification framework improves the performance of text classification by comparing with the flat text representation models (term VSM, concept VSM, term+concept VSM) plus existing classification methods. 0 0
Using machine learning to support continuous ontology development Ramezani M.
Witschel H.F.
Braun S.
Zacharias V.
Lecture Notes in Computer Science English 2010 This paper presents novel algorithms to support the continuous development of ontologies; i.e. the development of ontologies during their use in social semantic bookmarking, semantic wiki or other social semantic applications. Our goal is to assist users in placing a newly added concept in a concept hierarchy. The proposed algorithm is evaluated using a data set from Wikipedia and provides good quality recommendation. These results point to novel possibilities to apply machine learning technologies to support social semantic applications. 0 0
Vision of a visipedia Perona P. Proceedings of the IEEE English 2010 The web is not perfect: while text is easily searched and organized, pictures (the vast majority of the bits that one can find online) are not. In order to see how one could improve the web and make pictures first-class citizens of the web, I explore the idea of Visipedia, a visual interface for Wikipedia that is able to answer visual queries and enables experts to contribute and organize visual knowledge. Five distinct groups of humans would interact through Visipedia: users, experts, editors, visual workers, and machine vision scientists. The latter would gradually build automata able to interpret images. I explore some of the technical challenges involved in making Visipedia happen. I argue that Visipedia will likely grow organically, combining state-of-the-art machine vision with human labor. 0 0
Wiki Vandalysis - Wikipedia Vandalism Analysis Manoj Harpalani
Thanadit Phumprao
Megha Bassi
Michael Hart
Rob Johnson
CLEF English 2010 Wikipedia describes itself as the "free encyclopedia that anyone can edit". Along with the helpful volunteers who contribute by improving the articles, a great number of malicious users abuse the open nature of Wikipedia by vandalizing articles. Deterring and reverting vandalism has become one of the

major challenges of Wikipedia as its size grows. Wikipedia editors fight vandalism both manually and with automated bots that use regular expressions and other simple rules to recognize malicious edits. Researchers have also proposed Machine Learning algorithms for vandalism detection, but these algorithms are still in their infancy and have much room for improvement. This paper presents an approach to fighting vandalism by extracting various features from the edits for machine learning classification. Our classifier uses information about the editor, the sentiment of the edit, the "quality" of the edit (i.e. spelling errors), and targeted regular expressions to capture patterns common in blatant

vandalism, such as insertion of obscene words or multiple exclamations. We have successfully been able to achieve an area under the ROC curve (AUC) of 0.91 on a training set of 15000 human annotated edits and 0.887 on a random sample of 17472 edits from 317443.
0 0
Um método automático para estimativa da qualidade de enciclopédias colaborativas on-line: um estudo de caso sobre a Wikipédia Daniel Hasan Dalip Portuguese April 2009 The old dream of a universal repository containing all the human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative, participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its relative quality. In this work we explore a significant number of quality indicators, some of them proposed by us and used here for the first time, and study their capability to assess the quality of Wikipedia articles. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment judgment. Through experiments, we show that the most important quality indicators are the easiest ones to extract on a open digital library, namely, textual features related to length, structure and style. We were also able to determine which indicators did not contribute significantly to the quality assessment. These were, coincidentally, the most complex features, such as those based on link analysis. Finally, we compare our combination method with state-of-the-art solutions and show significant improvements in terms of effective quality prediction. 11 0
Argument based machine learning from examples and text Mozina M.
Claudio Giuliano
Bratko I.
Proceedings - 2009 1st Asian Conference on Intelligent Information and Database Systems, ACIIDS 2009 English 2009 We introduce a novel approach to cross-media learning based on argument based machine learning (ABML). ABML is a recent method that combines argumentation and machine learning from examples, and its main idea is to use arguments for some of the learning examples. Arguments are usually provided by a domain expert. In this paper, we present an alternative approach, where arguments used in ABML are automatically extracted from text with a technique for relation extraction. We demonstrate and evaluate the approach through a case study of learning to classify animals by using arguments automatically extracted from Wikipedia. 0 0
Automatic quality assessment of content created collaboratively by web communities: a case study of Wikipedia Daniel H. Dalip
Marcos A. Gonçalves
Marco Cristo
Pável Calado
English 2009 The old dream of a universal repository containing all the human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative, participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its relative quality. In this work we explore a significant number of quality indicators, some of them proposed by us and used here for the first time, and study their capability to assess the quality of Wikipedia articles. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment judgment. Through experiments, we show that the most important quality indicators are the easiest ones to extract, namely, textual features related to length, structure and style. We were also able to determine which indicators did not contribute significantly to the quality assessment. These were, coincidentally, the most complex features, such as those based on link analysis. Finally, we compare our combination method with state-of-the-art solution and show significant improvements in terms of effective quality prediction. 0 3
Linking Wikipedia entries to blog feeds by machine learning Mariko Kawaba
Hiroyuki Nakasaki
Daisuke Yokomoto
Takehito Utsuro
Tomohiro Fukuhara
ACM International Conference Proceeding Series English 2009 This paper studies the issue of conceptually indexing the blogosphere through the whole hierarchy of Wikipedia entries. This paper proposes how to link Wikipedia entries to blog feeds in the Japanese blogosphere by machine learning, where about 300,000 Wikipedia entries are used for representing a hierarchy of topics. In our experimental evaluation, we achieved over 80% precision in the task. Copyright 2009 ACM. 0 0
Overview of the INEX 2008 XML mining track categorization and clustering of XML documents in a graph of documents Ludovic Denoyer
Patrick Gallinari
Lecture Notes in Computer Science English 2009 We describe here the XML Mining Track at INEX 2008. This track was launched for exploring two main ideas: first identifying key problems for mining semi-structured documents and new challenges of this emerging field and second studying and assessing the potential of machine learning techniques for dealing with generic Machine Learning (ML) tasks in the structured domain i.e. classification and clustering of semi structured documents. This year, the track focuses on the supervised classification and the unsupervised clustering of XML documents using link information. We consider a corpus of about 100,000 Wikipedia pages with the associated hyperlinks. The participants have developed models using the content information, the internal structure information of the XML documents and also the link information between documents. 0 0
Semi-automatic extraction and modeling of ontologies using wikipedia XML corpus De Silva L.
Lakshman Jayaratne
2nd International Conference on the Applications of Digital Information and Web Technologies, ICADIWT 2009 English 2009 This paper introduces WikiOnto: a system that assists in the extraction and modeling of topic ontologies in a semi-automatic manner using a preprocessed document corpus derived from Wikipedia. Based on the Wikipedia XML Corpus, we present a three-tiered framework for extracting topic ontologies in quick time and a modeling environment to refine these ontologies. Using Natural Language Processing (NLP) and other Machine Learning (ML) techniques along with a very rich document corpus, this system proposes a solution to a task that is generally considered extremely cumbersome. The initial results of the prototype suggest strong potential of the system to become highly successful in ontology extraction and modeling and also inspire further research on extracting ontologies from other semi-structured document corpora as well. 0 0
The importance of manual assessment in link discovery Huang W.C.
Andrew Trotman
Shlomo Geva
Proceedings - 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009 English 2009 Using a ground truth extracted from the Wikipedia, and a ground truth created through manual assessment, we show that the apparent performance advantage seen in machine learning approaches to link discovery are an artifact of trivial links that are actively rejected by manual assessors. 0 0
WikiOnto: A system for semi-automatic extraction and modeling of ontologies using Wikipedia XML corpus De Silva L.
Lakshman Jayaratne
ICSC 2009 - 2009 IEEE International Conference on Semantic Computing English 2009 This paper introduces WikiOnto: a system that assists in the extraction and modeling of topic ontologies in a semi-automatic manner using a preprocessed document corpus of one of the largest knowledge bases in the world - the Wikipedia. Based on the Wikipedia XML Corpus, we present a three-tiered framework for extracting topic ontologies in quick time and a modeling environment to refine these ontologies. Using Natural Language Processing (NLP) and other Machine Learning (ML) techniques along with a very rich document corpus, this system proposes a solution to a task that is generally considered extremely cumbersome. The initial results of the prototype suggest strong potential of the system to become highly successful in ontology extraction and modeling and also inspire further research on extracting ontologies from other semi-structured document corpora as well. 0 0
Automatic vandalism detection in wikipedia: Towards a machine learning approach Smets K.
Goethals B.
Verdonk B.
AAAI Workshop - Technical Report English 2008 Since the end of 2006 several autonomous bots are, or have been, running on Wikipedia to keep the encyclopedia free from vandalism and other damaging edits. These expert systems, however, are far from optimal and should be improved to relieve the human editors from the burden of manually reverting such edits. We investigate the possibility of using machine learning techniques to build an autonomous system capable to distinguish vandalism from legitimate edits. We highlight the results of a small but important step in this direction by applying commonly known machine learning algorithms using a straightforward feature representation. Despite the promising results, this study reveals that elementary features, which are also used by the current approaches to fight vandalism, are not sufficient to build such a system. They will need to be accompanied by additional information which, among other things, incorporates the semantics of a revision. Copyright 0 3
Machine learning for semi-structured multimedia documents: Application to pornographic filtering and thematic categorization Ludovic Denoyer
Patrick Gallinari
Cognitive Technologies English 2008 We propose a generative statistical model for the classification of semi-structured multimedia documents. Its main originality is its ability to simultaneously take into account the structural and the content information present in a semi-structured document and also to cope with different types of content (text, image, etc.). We then present the results obtained on two sets of experiments: • One set concerns the filtering of pornographic Web pages • The second one concerns the thematic classification of Wikipedia documents. 0 0
Named entity disambiguation on an ontology enriched by Wikipedia Nguyen H.T.
Cao T.H.
RIVF 2008 - 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies English 2008 Currently, for named entity disambiguation, the shortage of training data is a problem. This paper presents a novel method that overcomes this problem by automatically generating an annotated corpus based on a specific ontology. Then the corpus was enriched with new and informative features extracted from Wikipedia data. Moreover, rather than pursuing rule-based methods as in literature, we employ a machine learning model to not only disambiguate but also identify named entities. In addition, our method explores in details the use of a range of features extracted from texts, a given ontology, and Wikipedia data for disambiguation. This paper also systematically analyzes impacts of the features on disambiguation accuracy by varying their combinations for representing named entities. Empirical evaluation shows that, while the ontology provides basic features of named entities, Wikipedia is a fertile source for additional features to construct accurate and robust named entity disambiguation systems. 0 0
Aisles through the category forest;Utilising the Wikipedia Category System for Corpus Building in Machine Learning Rudiger Gleim
Alexander Mehler
Matthias Dehmer
Olga Pustylnikov
Webist 2007 - 3rd International Conference on Web Information Systems and Technologies, Proceedings English 2007 The Word Wide Web is a continuous challenge to machine learning. Established approaches have to be enhanced and new methods be developed in order to tackle the problem of finding and organising relevant information. It has often been motivated that semantic classifications of input documents help solving this task. But while approaches of supervised text categorisation perform quite well on genres found in written text, newly evolved genres on the web are much more demanding. In order to successfully develop approaches to web mining, respective corpora are needed. However, the composition of genre- or domain-specific web corpora is still an unsolved problem. It is time consuming to build large corpora of good quality because web pages typically lack reliable meta information. Wikipedia along with similar approaches of collaborative text production offers a way out of this dilemma. We examine how social tagging, as supported by the MediaWiki software, can be utilised as a source of corpus building. Further, we describe a representation format for social ontologies and present the Wikipedia Category Explorer, a tool which supports categorical views to browse through the Wikipedia and to construct domain specific corpora for machine learning. 0 0
Boosting inductive transfer for text classification using Wikipedia Somnath Banerjee Proceedings - 6th International Conference on Machine Learning and Applications, ICMLA 2007 English 2007 Inductive transfer is applying knowledge learned on one set of tasks to improve the performance of learning a new task. Inductive transfer is being applied in improving the generalization performance on a classification task using the models learned on some related tasks. In this paper, we show a method of making inductive transfer for text classification more effective using Wikipedia. We map the text documents of the different tasks to a feature space created using Wikipedia, thereby providing some background knowledge of the contents of the documents. It has been observed here that when the classifiers are built using the features generated from Wikipedia they become more effective in transferring knowledge. An evaluation on the daily classification task on the Reuters RCV1 corpus shows that our method can significantly improve the performance of inductive transfer. Our method was also able to successfully overcome a major obstacle observed in a recent work on a similar setting. 0 0