Artificial intelligence

From WikiPapers
Jump to: navigation, search

Artificial intelligence is included as keyword or extra keyword in 0 datasets, 0 tools and 157 publications.

Datasets

There is no datasets for this keyword.

Tools

There is no tools for this keyword.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Situated Interaction in a Multilingual Spoken Information Access Framework Niklas Laxström
Kristiina Jokinen
Graham Wilcock
IWSDS 2014 English 18 January 2014 0 0
A methodology based on commonsense knowledge and ontologies for the automatic classification of legal cases Capuano N.
De Maio C.
Salerno S.
Toti D.
ACM International Conference Proceeding Series English 2014 We describe a methodology for the automatic classification of legal cases expressed in natural language, which relies on existing legal ontologies and a commonsense knowledge base. This methodology is founded on a process consisting of three phases: an enrichment of a given legal ontology by associating its terms with topics retrieved from the Wikipedia knowledge base; an extraction of relevant concepts from a given textual legal case; and a matching between the enriched ontological terms and the extracted concepts. Such a process has been successfully implemented in a corresponding tool that is part of a larger framework for self-litigation and legal support for the Italian law. 0 0
A scalable gibbs sampler for probabilistic entity linking Houlsby N.
Massimiliano Ciaramita
Lecture Notes in Computer Science English 2014 Entity linking involves labeling phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with the large number of topics we propose a novel efficient Gibbs sampling scheme which can also incorporate side information, such as the Wikipedia graph. This conceptually simple probabilistic approach achieves state-of-the-art performance in entity-linking on the Aida-CoNLL dataset. 0 0
Building distant supervised relation extractors Nunes T.
Schwabe D.
Proceedings - 2014 IEEE International Conference on Semantic Computing, ICSC 2014 English 2014 A well-known drawback in building machine learning semantic relation detectors for natural language is the lack of a large number of qualified training instances for the target relations in multiple languages. Even when good results are achieved, the datasets used by the state-of-the-art approaches are rarely published. In order to address these problems, this work presents an automatic approach to build multilingual semantic relation detectors through distant supervision combining two of the largest resources of structured and unstructured content available on the Web, DBpedia and Wikipedia. We map the DBpedia ontology back to the Wikipedia text to extract more than 100.000 training instances for more than 90 DBpedia relations for English and Portuguese languages without human intervention. First, we mine the Wikipedia articles to find candidate instances for relations described in the DBpedia ontology. Second, we preprocess and normalize the data filtering out irrelevant instances. Finally, we use the normalized data to construct regularized logistic regression detectors that achieve more than 80% of F-Measure for both English and Portuguese languages. In this paper, we also compare the impact of different types of features on the accuracy of the trained detector, demonstrating significant performance improvements when combining lexical, syntactic and semantic features. Both the datasets and the code used in this research are available online. 0 0
Developing creativity competency of engineers Waychal P.K. ASEE Annual Conference and Exposition, Conference Proceedings English 2014 The complete agreement of all stakeholders on the importance of developing the creativity competency of engineering graduates motivated us to undertake this study. We chose a senior-level course in Software Testing and Quality Assurance which offered an excellent platform for the experiment as both testing and quality assurance activities can be executed using either routine or mechanical methods or highly creative ones. The earlier attempts reported in literature to develop the creativity competency do not appear to be systematic i.e. they do not follow the measurement ->action plan ->measurement cycle. The measurements, wherever done, are based on the Torrance Test of Critical Thinking (TTCT) and the Myers Briggs Type Indicator (MBTI). We found these tests costly and decided to search for an appropriate alternative that led us to the Felder Solomon Index of Learning Style (ILS). The Sensing / Intuition dimension of the ILS, like MBTI, is originated in Carl Jung's Theory of Psychological Types. Since a number of MBTI studies have used the dimension for assessing creativity, we posited that the same ILS dimension could be used to measure the competency. We carried out pre-ILS assessment, designed and delivered the course with a variety of activities that could potentially enhance creativity, and carried out course-end post-ILS assessment. Although major changes would not normally be expected after a one-semester course, a hypothesis in the study was that a shift from sensing toward intuition on learning style profiles would be observed, and indeed it was. A paired t- Test indicated that the pre-post change in the average sensing/intuition preference score was statistically significant (p = 0.004). While more research and direct assessment of competency is needed to be able to draw definitive conclusions about both the use of the instrument for measuring creativity and the efficacy of the course structure and contents in developing the competency, the results suggest that the approach is worth exploring. 0 0
Graph-based domain-specific semantic relatedness from Wikipedia Sajadi A. Lecture Notes in Computer Science English 2014 Human made ontologies and lexicons are promising resources for many text mining tasks in domain specific applications, but they do not exist for most domains. We study the suitability of Wikipedia as an alternative resource for ontologies regarding the Semantic Relatedness problem. We focus on the biomedical domain because (1) high quality manually curated ontologies are available and (2) successful graph based methods have been proposed for semantic relatedness in this domain. Because Wikipedia is not hierarchical and links do not convey defined semantic relationships, the same methods used on lexical resources (such as WordNet) cannot be applied here straightforwardly. Our contributions are (1) Demonstrating that Wikipedia based methods outperform state of the art ontology based methods on most of the existing ontologies in the biomedical domain (2) Adapting and evaluating the effectiveness of a group of bibliometric methods of various degrees of sophistication on Wikipedia for the first time (3) Proposing a new graph-based method that is outperforming existing methods by considering some specific features of Wikipedia structure. 0 0
Identifying the topic of queries based on domain specify ontology ChienTa D.C.
Thi T.P.
WIT Transactions on Information and Communication Technologies English 2014 In order to identify the topic of queries, a large number of past researches have relied on lexicon-syntactic and handcrafted knowledge sources in Machine Learning and Natural Language Processing (NLP). Conversely, in this paper, we introduce the application system that detects the topic of queries based on domain-specific ontology. On this system, we work hard on building this domainspecific ontology, which is composed of instances automatically extracted from available resources such as Wikipedia, WordNet, and ACM Digital Library. The experimental evaluation with many cases of queries related to information technology area shows that this system considerably outperforms a matching and identifying approach. 0 0
Inferring attitude in online social networks based on quadratic correlation Chao Wang
Bulatov A.A.
Lecture Notes in Computer Science English 2014 The structure of an online social network in most cases cannot be described just by links between its members. We study online social networks, in which members may have certain attitude, positive or negative, toward each other, and so the network consists of a mixture of both positive and negative relationships. Our goal is to predict the sign of a given relationship based on the evidences provided in the current snapshot of the network. More precisely, using machine learning techniques we develop a model that after being trained on a particular network predicts the sign of an unknown or hidden link. The model uses relationships and influences from peers as evidences for the guess, however, the set of peers used is not predefined but rather learned during the training process. We use quadratic correlation between peer members to train the predictor. The model is tested on popular online datasets such as Epinions, Slashdot, and Wikipedia. In many cases it shows almost perfect prediction accuracy. Moreover, our model can also be efficiently updated as the underlying social network evolves. 0 0
Intelligent searching using delay semantic network Dvorscak S.
Machova K.
SAMI 2014 - IEEE 12th International Symposium on Applied Machine Intelligence and Informatics, Proceedings English 2014 Article introduces different way how to implement semantic search, using semantic search agent over information obtained directly from web. The paper describes time delay form of semantic network, which we have used for providing of semantic search. Using of time-delay aspect inside semantic network has positive impact in several ways. It provides way how to represent knowledges dependent on time via semantic network, but also how to optimize a process of inference. That is all realized for Wikipedia articles in the form of search engine. The core's implementation is realized in way of massive multithread inference mechanism for massive semantic network. 0 0
Learning to compute semantic relatedness using knowledge from wikipedia Zheng C.
Zhe Wang
Bie R.
Zhou M.
Lecture Notes in Computer Science English 2014 Recently, Wikipedia has become a very important resource for computing semantic relatedness (SR) between entities. Several approaches have already been proposed to compute SR based on Wikipedia. Most of the existing approaches use certain kinds of information in Wikipedia (e.g. links, categories, and texts) and compute the SR by empirically designed measures. We have observed that these approaches produce very different results for the same entity pair in some cases. Therefore, how to select appropriate features and measures to best approximate the human judgment on SR becomes a challenging problem. In this paper, we propose a supervised learning approach for computing SR between entities based on Wikipedia. Given two entities, our approach first maps entities to articles in Wikipedia; then different kinds of features of the mapped articles are extracted from Wikipedia, which are then combined with different relatedness measures to produce nine raw SR values of the entity pair. A supervised learning algorithm is proposed to learn the optimal weights of different raw SR values. The final SR is computed as the weighted average of raw SRs. Experiments on benchmark datasets show that our approach outperforms baseline methods. 0 0
Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools Lopuszynski M.
Bolikowski L.
Communications in Computer and Information Science English 2014 In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.). 0 0
Tracking topics on revision graphs of wikipedia edit history Li B.
Wu J.
Mizuho Iwaihara
Lecture Notes in Computer Science English 2014 Wikipedia is known as the largest online encyclopedia, in which articles are constantly contributed and edited by users. Past revisions of articles after edits are also accessible from the public for confirming the edit process. However, the degree of similarity between revisions is very high, making it difficult to generate summaries for these small changes from revision graphs of Wikipedia edit history. In this paper, we propose an approach to give a concise summary to a given scope of revisions, by utilizing supergrams, which are consecutive unchanged term sequences. 0 0
Using linked data to mine RDF from Wikipedia's tables Munoz E.
Hogan A.
Mileo A.
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 The tables embedded in Wikipedia articles contain rich, semi-structured encyclopaedic content. However, the cumulative content of these tables cannot be queried against. We thus propose methods to recover the semantics of Wikipedia tables and, in particular, to extract facts from them in the form of RDF triples. Our core method uses an existing Linked Data knowledge-base to find pre-existing relations between entities in Wikipedia tables, suggesting the same relations as holding for other entities in analogous columns on different rows. We find that such an approach extracts RDF triples from Wikipedia's tables at a raw precision of 40%. To improve the raw precision, we define a set of features for extracted triples that are tracked during the extraction phase. Using a manually labelled gold standard, we then test a variety of machine learning methods for classifying correct/incorrect triples. One such method extracts 7.9 million unique and novel RDF triples from over one million Wikipedia tables at an estimated precision of 81.5%. 0 0
What makes a good team of Wikipedia editors? A preliminary statistical analysis Bukowski L.
Jankowski-Lorek M.
Jaroszewicz S.
Sydow M.
Lecture Notes in Computer Science English 2014 The paper concerns studying the quality of teams of Wikipedia authors with statistical approach. We report preparation of a dataset containing numerous behavioural and structural attributes and its subsequent analysis and use to predict team quality. We have performed exploratory analysis using partial regression to remove the influence of attributes not related to the team itself. The analysis confirmed that the key issue significantly influencing article's quality are discussions between teem members. The second part of the paper successfully uses machine learning models to predict good articles based on features of the teams that created them. 0 0
WikiReviz: An edit history visualization for wiki systems Wu J.
Mizuho Iwaihara
Lecture Notes in Computer Science English 2014 Wikipedia maintains a linear record of edit history with article content and meta-information for each article, which conceals precious information on how each article has evolved. This demo describes the motivation and features of WikiReviz, a visualization system for analyzing edit history in Wikipedia and other Wiki systems. From the official exported edit history of a single Wikipedia article, WikiReviz reconstructs the derivation relationships among revisions precisely and efficiently by revision graph extraction and indicate meaningful article evolution progress by edit summarization. 0 0
A bookmark recommender system based on social bookmarking services and wikipedia categories Yoshida T.
Inoue U.
SNPD 2013 - 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing English 2013 Social book marking services allow users to add bookmarks of web pages with freely chosen keywords as tags. Personalized recommender systems recommend new and useful bookmarks added by other users. We propose a new method to find similar users and to select relevant bookmarks in a social book marking service. Our method is lightweight, because it uses a small set of important tags for each user to find useful bookmarks to recommend. Our method is also powerful, because it employs the Wikipedia category database to deal with the diversity of tags among users. The evaluation using the Hatena bookmark service in Japan shows that our method significantly increases the number of relevant bookmarks recommended without notable increase of irrelevant bookmarks. 0 0
A collaborative multi-source intelligence working environment: A systems approach Eachus P.
Short B.
Stedmon A.W.
Brown J.
Wilson M.
Lemanski L.
Lecture Notes in Computer Science English 2013 This research applies a systems approach to aid the understanding of collaborative working during intelligence analysis using a dedicated (Wiki) environment. The extent to which social interaction, and problem solving was facilitated by the use of the wiki, was investigated using an intelligence problem derived from the Vast 2010 challenge. This challenge requires "intelligence analysts" to work with a number of different intelligence sources in order to predict a possible terrorist attack. The study compared three types of collaborative working, face-to-face without a wiki, face-to-face with a wiki, and use of a wiki without face-to-face contact. The findings revealed that in terms of task performance the use of the wiki without face-to-face contact performed best and the wiki group with face-to-face contact performed worst. Measures of interpersonal and psychological satisfaction were highest in the face-to-face group not using a wiki and least in the face-to-face group using a wiki. Overall it was concluded that the use of wikis in collaborative working is best for task completion whereas face-to-face collaborative working without a wiki is best for interpersonal and psychological satisfaction. 0 0
A support framework for argumentative discussions management in the web Cabrio E.
Villata S.
Fabien Gandon
Lecture Notes in Computer Science English 2013 On the Web, wiki-like platforms allow users to provide arguments in favor or against issues proposed by other users. The increasing content of these platforms as well as the high number of revisions of the content through pros and cons arguments make it difficult for community managers to understand and manage these discussions. In this paper, we propose an automatic framework to support the management of argumentative discussions in wiki-like platforms. Our framework is composed by (i) a natural language module, which automatically detects the arguments in natural language returning the relations among them, and (ii) an argumentation module, which provides the overall view of the argumentative discussion under the form of a directed graph highlighting the accepted arguments. Experiments on the history of Wikipedia show the feasibility of our approach. 0 0
A virtual player for "who Wants to Be a Millionaire?" based on Question Answering Molino P.
Pierpaolo Basile
Santoro C.
Pasquale Lops
De Gemmis M.
Giovanni Semeraro
Lecture Notes in Computer Science English 2013 This work presents a virtual player for the quiz game "Who Wants to Be a Millionaire?". The virtual player demands linguistic and common sense knowledge and adopts state-of-the-art Natural Language Processing and Question Answering technologies to answer the questions. Wikipedia articles and DBpedia triples are used as knowledge sources and the answers are ranked according to several lexical, syntactic and semantic criteria. Preliminary experiments carried out on the Italian version of the boardgame proves that the virtual player is able to challenge human players. 0 0
Automated non-content word list generation using hLDA Krug W.
Tomlinson M.T.
FLAIRS 2013 - Proceedings of the 26th International Florida Artificial Intelligence Research Society Conference English 2013 In this paper, we present a language-independent method for the automatic, unsupervised extraction of non-content words from a corpus of documents. This method permits the creation of word lists that may be used in place of traditional function word lists in various natural language processing tasks. As an example we generated lists of words from a corpus of English, Chinese, and Russian posts extracted from Wikipedia articles and Wikipedia Wikitalk discussion pages. We applied these lists to the task of authorship attribution on this corpus to compare the effectiveness of lists of words extracted with this method to expert-created function word lists and frequent word lists (a common alternative to function word lists). hLDA lists perform comparably to frequent word lists. The trials also show that corpus-derived lists tend to perform better than more generic lists, and both sets of generated lists significantly outperformed the expert lists. Additionally, we evaluated the performance of an English expert list on machine translations of our Chinese and Russian documents, showing that our method also outperforms this alternative. Copyright © 2013, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Boosting cross-lingual knowledge linking via concept annotation Zhe Wang
Jing-Woei Li
Tang J.
IJCAI International Joint Conference on Artificial Intelligence English 2013 Automatically discovering cross-lingual links (CLs) between wikis can largely enrich the cross-lingual knowledge and facilitate knowledge sharing across different languages. In most existing approaches for cross-lingual knowledge linking, the seed CLs and the inner link structures are two important factors for finding new CLs. When there are insufficient seed CLs and inner links, discovering new CLs becomes a challenging problem. In this paper, we propose an approach that boosts cross-lingual knowledge linking by concept annotation. Given a small number of seed CLs and inner links, our approach first enriches the inner links in wikis by using concept annotation method, and then predicts new CLs with a regression-based learning model. These two steps mutually reinforce each other, and are executed iteratively to find as many CLs as possible. Experimental results on the English and Chinese Wikipedia data show that the concept annotation can effectively improve the quantity and quality of predicted CLs. With 50,000 seed CLs and 30% of the original inner links in Wikipedia, our approach discovered 171,393 more CLs in four runs when using concept annotation. 0 0
Complementary information for Wikipedia by comparing multilingual articles Fujiwara Y.
Yu Suzuki
Konishi Y.
Akiyo Nadamoto
Lecture Notes in Computer Science English 2013 Information of many articles is lacking in Wikipedia because users can create and edit the information freely. We specifically examined the multilinguality of Wikipedia and proposed a method to complement information of articles which lack information based on comparing different language articles that have similar contents. However, much non-complementary information is unrelated to a user's browsing article in the results. Herein, we propose improvement of the comparison area based on the classified complementary target. 0 0
Computing semantic relatedness using Wikipedia features Hadj Taieb M.A.
Ben Aouicha M.
Ben Hamadou A.
Knowledge-Based Systems English 2013 Measuring semantic relatedness is a critical task in many domains such as psychology, biology, linguistics, cognitive science and artificial intelligence. In this paper, we propose a novel system for computing semantic relatedness between words. Recent approaches have exploited Wikipedia as a huge semantic resource that showed good performances. Therefore, we utilized the Wikipedia features (articles, categories, Wikipedia category graph and redirection) in a system combining this Wikipedia semantic information in its different components. The approach is preceded by a pre-processing step to provide for each category pertaining to the Wikipedia category graph a semantic description vector including the weights of stems extracted from articles assigned to the target category. Next, for each candidate word, we collect its categories set using an algorithm for categories extraction from the Wikipedia category graph. Then, we compute the semantic relatedness degree using existing vector similarity metrics (Dice, Overlap and Cosine) and a new proposed metric that performed well as cosine formula. The basic system is followed by a set of modules in order to exploit Wikipedia features to quantify better as possible the semantic relatedness between words. We evaluate our measure based on two tasks: comparison with human judgments using five datasets and a specific application "solving choice problem". Our result system shows a good performance and outperforms sometimes ESA (Explicit Semantic Analysis) and TSA (Temporal Semantic Analysis) approaches. © 2013 Elsevier B.V. All rights reserved. 0 0
Cross language prediction of vandalism on wikipedia using article views and revisions Tran K.-N.
Christen P.
Lecture Notes in Computer Science English 2013 Vandalism is a major issue on Wikipedia, accounting for about 2% (350,000+) of edits in the first 5 months of 2012. The majority of vandalism are caused by humans, who can leave traces of their malicious behaviour through access and edit logs. We propose detecting vandalism using a range of classifiers in a monolingual setting, and evaluated their performance when using them across languages on two data sets: the relatively unexplored hourly count of views of each Wikipedia article, and the commonly used edit history of articles. Within the same language (English and German), these classifiers achieve up to 87% precision, 87% recall, and F1-score of 87%. Applying these classifiers across languages achieve similarly high results of up to 83% precision, recall, and F1-score. These results show characteristic vandal traits can be learned from view and edit patterns, and models built in one language can be applied to other languages. 0 0
Cross lingual entity linking with bilingual topic model Zhang T.
Kang Liu
Jun Zhao
IJCAI International Joint Conference on Artificial Intelligence English 2013 Cross lingual entity linking means linking an entity mention in a background source document in one language with the corresponding real world entity in a knowledge base written in the other language. The key problem is to measure the similarity score between the context of the entity mention and the document of the cand idate entity. This paper presents a general framework for doing cross lingual entity linking by leveraging a large scale and bilingual knowledge base, Wikipedia. We introduce a bilingual topic model that mining bilingual topic from this knowledge base with the assumption that the same Wikipedia concept documents of two different languages share the same semantic topic distribution. The extracted topics have two types of representation, with each type corresponding to one language. Thus both the context of the entity mention and the document of the cand idate entity can be represented in a space using the same semantic topics. We use these topics to do cross lingual entity linking. Experimental results show that the proposed approach can obtain the competitive results compared with the state-of-art approach. 0 0
Distant supervision learning of DBPedia relations Zajac M.
Przepiorkowski A.
Lecture Notes in Computer Science English 2013 This paper presents DBPediaExtender, an information extraction system that aims at extending an existing ontology of geographical entities by extracting information from text. The system uses distant supervision learning - the training data is constructed on the basis of matches between values from infoboxes (taken from the Polish DBPedia) and Wikipedia articles. For every relevant relation, a sentence classifier and a value extractor are trained; the sentence classifier selects sentences expressing a given relation and the value extractor extracts values from selected sentences. The results of manual evaluation for several selected relations are reported. 0 0
Encoding local correspondence in topic models Mehdi R.E.
Mohamed Q.
Mustapha A.
Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI English 2013 Exploiting label correlations is a challenging and crucial problem especially in multi-label learning context. Labels correlations are not necessarily shared by all instances and have generally a local definition. This paper introduces LOC-LDA, which is a latent variable model that adresses the problem of modeling annotated data by locally exploiting correlations between annotations. In particular, we represent explicitly local dependencies to define the correspondence between specific objects, i.e. regions of images and their annotations. We conducted experiments on a collection of pictures provided by the Wikipedia 'Picture of the day' website, and evaluated our model on the task of 'automatic image annotation'. The results validate the effectiveness of our approach. 0 0
English nominal compound detection with Wikipedia-based methods Nagy T. I.
Veronika Vincze
Lecture Notes in Computer Science English 2013 Nominal compounds (NCs) are lexical units that consist of two or more elements that exist on their own, function as a noun and have a special added meaning. Here, we present the results of our experiments on how the growth of Wikipedia added to the performance of our dictionary labeling methods to detecting NCs. We also investigated how the size of an automatically generated silver standard corpus can affect the performance of our machine learning-based method. The results we obtained demonstrate that the bigger the dataset, the better the performance will be. 0 0
MDL-based models for transliteration generation Nouri J.
Pivovarova L.
Yangarber R.
Lecture Notes in Computer Science English 2013 This paper presents models for automatic transliteration of proper names between languages that use different alphabets. The models are an extension of our work on automatic discovery of patterns of etymological sound change, based on the Minimum Description Length Principle. The models for pairwise alignment are extended with algorithms for prediction that produce transliterated names. We present results on 13 parallel corpora for 7 languages, including English, Russian, and Farsi, extracted from Wikipedia headlines. The transliteration corpora are released for public use. The models achieve up to 88% on word-level accuracy and up to 99% on symbol-level F-score. We discuss the results from several perspectives, and analyze how corpus size, the language pair, the type of names (persons, locations), and noise in the data affect the performance. 0 0
Querying multilingual DBpedia with QAKiS Cabrio E.
Cojan J.
Fabien Gandon
Hallili A.
Lecture Notes in Computer Science English 2013 We present an extension of QAKiS, a system for open domain Question Answering over linked data, that allows to query DBpedia multilingual chapters. Such chapters can contain different information with respect to the English version, e.g. they provide more specificity on certain topics, or fill information gaps. QAKiS exploits the alignment between properties carried out by DBpedia contributors as a mapping from Wikipedia terms to a common ontology, to exploit information coming from DBpedia multilingual chapters, broadening therefore its coverage. For the demo, English, French and German DBpedia chapters are the RDF data sets to be queried using a natural language interface. 0 0
Recompilation of broadcast videos based on real-world scenarios Ichiro Ide Lecture Notes in Computer Science English 2013 In order to effectively make use of videos stored in a broadcast video archive, we have been working on their recompilation. In order to realize this, we take an approach that considers the videos in the archive as video materials, and recompiling them by considering various kinds of social media information as "scenarios". In this paper, we will introduce our works in news, sports, and cooking domains, that makes use of Wikipedia articles, demoscopic polls, twitter tweets, and cooking recipes in order to recompile video clips from corresponding TV shows. 0 0
Talking topically to artificial dialog partners: Emulating humanlike topic awareness in a virtual agent Alexa Breuing
Ipke Wachsmuth
Communications in Computer and Information Science English 2013 During dialog, humans are able to track ongoing topics, to detect topical shifts, to refer to topics via labels, and to decide on the appropriateness of potential dialog topics. As a result, they interactionally produce coherent sequences of spoken utterances assigning a thematic structure to the whole conversation. Accordingly, an artificial agent that is intended to engage in natural and sophisticated human-agent dialogs should be endowed with similar conversational abilities. This paper presents how to enable topically coherent conversations between humans and interactive systems by emulating humanlike topic awareness in the virtual agent Max. Therefore, we firstly realized automatic topic detection and tracking on the basis of contextual knowledge provided by Wikipedia and secondly adapted the agent's conversational behavior by means of the gained topic information. As a result, we contribute to improve human-agent dialogs by enabling topical talk between human and artificial interlocutors. This paper is a revised and extended version of [1]. 0 0
The category structure in wikipedia: To analyze and know its quality using k-core decomposition Wang Q.
Xiaolong Wang
Zheng Chen
Lecture Notes in Computer Science English 2013 Wikipedia is a famous and free encyclopedia. A network based on its category structure is built and then analyzed from various aspects, such as the connectivity distribution, evolution of the overall topology. As an innovative point of our paper, the model that is on the base of the k-core decomposition is used to analyze evolution of the overall topology and test the quality (that is, the error and attack tolerance) of the structure when nodes are removed. The model based on removal of edges is compared. Our results offer useful insights for the growth and the quality of the category structure, and the methods how to better organize the category structure. 0 0
A graph-based summarization system at QA@INEX track 2011 Laureano-Cruces A.L.
Ramirez-Rodriguez J.
Lecture Notes in Computer Science English 2012 In this paper we use REG, a graph-based system to study a fundamental problem of Natural Language Processing: the automatic summarization of documents. The algorithm models a document as a graph, to obtain weighted sentences. We applied this approach to the INEX@QA 2011 task (question-answering). We have extracted the title and some key or related words according to two people from the queries, in order to recover 50 documents from english wikipedia. Using this strategy, REG obtained good results with the automatic evaluation system FRESA. 0 0
A hybrid method based on WordNet and Wikipedia for computing semantic relatedness between texts Malekzadeh R.
Bagherzadeh J.
Noroozi A.
AISP 2012 - 16th CSI International Symposium on Artificial Intelligence and Signal Processing English 2012 In this article we present a new method for computing semantic relatedness between texts. For this purpose we use a tow-phase approach. The first phase involves modeling document sentences as a matrix to compute semantic relatedness between sentences. In the second phase, we compare text relatedness by using the relation of their sentences. Since Semantic relation between words must be searched in lexical semantic knowledge source, selecting a suitable source is very important, so that produced accurate results with correct selection. In this work, we attempt to capture the semantic relatedness between texts with a more accuracy. For this purpose, we use a collection of tow well known knowledge bases namely, WordNet and Wikipedia, so that provide more complete data source for calculate the semantic relatedness with a more accuracy. We evaluate our approach by comparison with other existing techniques (on Lee datasets). 0 0
A learning-based framework to utilize E-HowNet ontology and Wikipedia sources to generate multiple-choice factual questions Chu M.-H.
Chen W.-Y.
Lin S.-D.
Proceedings - 2012 Conference on Technologies and Applications of Artificial Intelligence, TAAI 2012 English 2012 This paper proposes a framework that automatically generates multiple-choice questions. Unlike most other similar works that focus on generating questions for English proficiency tests, this paper provides a framework to generate factual questions in Chinese. We have decomposed this problem into several sub-tasks: a) the identification of sentences that contain factual knowledge, b) the identification of the query term from each factual sentence, and c) the generation of distractors. Learning-based approaches are applied to address the first two problems. We then propose a way to generate distractors by using E-HowNet ontology database and Wikipedia sources. The system was evaluated through user study and test theory, and achieved a satisfaction rate of up to 70.6%. 0 0
An efficient voice enabled web content retrieval system for limited vocabulary Bharath Ram G.R.
Jayakumaur R.
Narayan R.
Shahina A.
Khan A.N.
Communications in Computer and Information Science English 2012 Retrieval of relevant information is becoming increasingly difficult owing to the presence of an ocean of information in the World Wide Web. Users in need of quick access to specific information are sub-jected to a series of web re-directions before finally arriving at the page that contains the required information. In this paper, an optimal voice based web content retrieval system is proposed that makes use of an open source speech recognition engine to deal with voice inputs. The proposed system performs a quicker retrieval of relevant content from Wikipedia and instantly presents the textual information along with the related image to the user. This search is faster than the conventional web content retrieval technique. The current system is built with limited vocabulary but can be extended to support a larger vocabulary. Additionally, the system is also scalable to retrieve content from few other sources of information apart from Wikipedia. 0 0
Combining AceWiki with a CAPTCHA system for collaborative knowledge acquisition Nalepa G.J.
Adrian W.T.
Szymon Bobek
Maslanka P.
Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI English 2012 Formalized knowledge representation methods allow to build useful and semantically enriched knowledge bases which can be shared and reasoned upon. Unfortunately, knowledge acquisition for such formalized systems is often a time-consuming and tedious task. The process requires a domain expert to provide terminological knowledge, a knowledge engineer capable of modeling knowledge in a given formalism, and also a great amount of instance data to populate the knowledge base. We propose a CAPTCHA-like system called AceCAPTCHA in which users are asked questions in a controlled natural language. The questions are generated automatically based on a terminology stored in a knowledge base of the system, and the answers provided by users serve as instance data to populate it. The implementation uses AceWiki semantic wiki and a reasoning engine written in Prolog. 0 0
DAnIEL: Language independent character-based news surveillance Lejeune G.
Brixtel R.
Antoine Doucet
Lucas N.
Lecture Notes in Computer Science English 2012 This study aims at developing a news surveillance system able to address multilingual web corpora. As an example of a domain where multilingual capacity is crucial, we focus on Epidemic Surveillance. This task necessitates worldwide coverage of news in order to detect new events as quickly as possible, anywhere, whatever the language it is first reported in. In this study, text-genre is used rather than sentence analysis. The news-genre properties allow us to assess the thematic relevance of news, filtered with the help of a specialised lexicon that is automatically collected on Wikipedia. Afterwards, a more detailed analysis of text specific properties is applied to relevant documents to better characterize the epidemic event (i.e., which disease spreads where?). Results from 400 documents in each language demonstrate the interest of this multilingual approach with light resources. DAnIEL achieves an F 1-measure score around 85%. Two issues are addressed: the first is morphology rich languages, e.g. Greek, Polish and Russian as compared to English. The second is event location detection as related to disease detection. This system provides a reliable alternative to the generic IE architecture that is constrained by the lack of numerous components in many languages. 0 0
Evaluating reranking methods based onlink co-occurrence and category in Wikipedia Takiguchi Y.
Kurakado K.
Oishi T.
Koshimura M.
Fujita H.
Hasegawa R.
ICAART 2012 - Proceedings of the 4th International Conference on Agents and Artificial Intelligence English 2012 We often use search engines in order to find appropriate documents on the Web. However, it is often the case that we cannot find desired information easily by giving a single query. In this paper, we present a method to extract related words for the query by using the various features of Wikipedia and rank learning. We aim at developing a system to assist the user in retrieving Web pages by reranking search results. 0 0
Exploration and visualization of administrator network in wikipedia Yousaf J.
Jing-Woei Li
Haisu Zhang
Hou L.
Lecture Notes in Computer Science English 2012 Wikipedia has become one of the most widely used knowledge systems on the Web. It contains the resources and information with different qualities contributed by different set of authors. A special group of authors named administrators plays an important role for content quality in Wikipedia. Understanding the behaviors of administrators in Wikipedia can facilitate the management of Wikipedia system, and empower some applications such as article recommendation and expertise administrator finding for given articles. This paper addresses the work of the exploration and visualization of the administrator network in Wikipedia. Administrator network is firstly constructed by using co-editing relationship and six characteristics for administrators are proposed to describe the behaviors of administrators in Wikipedia from different perspectives. Quantified calculation of these characteristics is then put forwarded by using social network analysis techniques. Topic model is used to relate content of Wikipedia to the interest diversity of administrators. Based on the media wiki history records from the January 2010 to January 2011, we develop an administrator exploration prototype system which can rank the selected characteristics for administrators and can be used as a decision support system. Furthermore, some meaningful observations are found to show that the administrator network is a healthy small world community and a strong centralization of the network around some hubs/stars is obtained to mean a considerable nucleus of very active administrators that seems to be omnipresent. These top ranked administrators ranking is found to be consistent with the number of barn stars awarded to them. 0 0
Extracting difference information from multilingual wikipedia Fujiwara Y.
Yu Suzuki
Konishi Y.
Akiyo Nadamoto
Lecture Notes in Computer Science English 2012 Wikipedia articles for a particular topic are written in many languages. When we select two articles which are about a single topic but which are written in different languages, the contents of these two articles are expected to be identical because of the Wikipedia policy. However, these contents are actually different, especially topics related to culture. In this paper, we propose a system to extract different Wikipedia information between that shown for Japan and that of other countries. An important technical problem is how to extract comparison target articles of Wikipedia. A Wikipedia article is written in different languages, with their respective linguistic structures. For example, "Cricket" is an important part of English culture, but the Japanese Wikipedia article related to cricket is too simple. Actually, it is only a single page. In contrast, the English version is substantial. It includes multiple pages. For that reason, we must consider which articles can be reasonably compared. Subsequently, we extract comparison target articles of Wikipedia based on a link graph and article structure. We implement our proposed method, and confirm the accuracy of difference extraction methods. 0 0
Extracting knowledge from web search engine results Kanavos A.
Theodoridis E.
Tsakalidis A.
Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI English 2012 Nowadays, people frequently use search engines in order to find the information they need on the web. However, usually web search engines return web page references in a global ranking making it difficult to the users to browse different topics captured in the result set and thus making it difficult to find quickly the desired web pages. There is need for special computational systems, that will discover knowledge in these web search results providing the user with the possibility to browse different topics contained in a given result set. In this paper, we focus on the problem of determining different thematic groups on web search engine results that existing web search engines provide. We propose a novel system that exploits a set of reformulation strategies so as to help users gain more relevant results to their desired query. It additionally tries to discover among the result set different topic groups, according to the various meanings of the provided query. The proposed method utilizes a number of semantic annotation techniques using Knowledge Bases, like Word Net and Wikipedia, in order to perceive the different senses of each query term. Finally, the method annotates the extracted topics using information derived from the clusters and presents them to the end user. 0 0
Extraction of bilingual cognates from Wikipedia Gamallo P.
Garcia M.
Lecture Notes in Computer Science English 2012 In this article, we propose a method to extract translation equivalents with similar spelling from comparable corpora. The method was applied on Wikipedia to extract a large amount of Portuguese-Spanish bilingual terminological pairs that were not found in existing dictionaries. The resulting bilingual lexicons consists of more than 27,000 new pairs of lemmas and multiwords, with about 92% accuracy. 0 0
Governance of open content creation: A conceptualization and analysis of control and guiding mechanisms in the open content domain Schroeder A.
Christian Wagner
Journal of the American Society for Information Science and Technology English 2012 The open content creation process has proven itself to be a powerful and influential way of developing text-based content, as demonstrated by the success of Wikipedia and related sites. Distributed individuals independently edit, revise, or refine content, thereby creating knowledge artifacts of considerable breadth and quality. Our study explores the mechanisms that control and guide the content creation process and develops an understanding of open content governance. The repertory grid method is employed to systematically capture the experiences of individuals involved in the open content creation process and to determine the relative importance of the diverse control and guiding mechanisms. Our findings illustrate the important control and guiding mechanisms and highlight the multifaceted nature of open content governance. A range of governance mechanisms is discussed with regard to the varied levels of formality, the different loci of authority, and the diverse interaction environments involved. Limitations and opportunities for future research are provided. 0 0
Harnessing Wikipedia semantics for computing contextual relatedness Jabeen S.
Gao X.
Andreae P.
Lecture Notes in Computer Science English 2012 This paper proposes a new method of automatically measuring semantic relatedness by exploiting Wikipedia as an external knowledge source. The main contribution of our research is to propose a relatedness measure based on Wikipedia senses and hyperlink structure for computing contextual relatedness of any two terms. We have evaluated the effectiveness of our approach using three datasets and have shown that our approach competes well with other well known existing methods. 0 0
Heuristics- and statistics-based wikification Nguyen H.T.
Cao T.H.
Nguyen T.T.
Vo-Thi T.-L.
Lecture Notes in Computer Science English 2012 With the wide usage of Wikipedia in research and applications, disambiguation of concepts and entities to Wikipedia is an essential component in natural language processing. This paper addresses the task of identifying and linking specific words or phrases in a text to their referents described by Wikipedia articles. In this work, we propose a method that combines some heuristics with a statistical model for disambiguation. The method exploits disambiguated entities to disambiguate the others in an incremental process. Experiments are conducted to evaluate and show the advantages of the proposed method. 0 0
Incident and problem management using a Semantic Wiki-enabled ITSM platform Frank Kleiner
Andreas Abecker
Mauritczat M.
ICAART 2012 - Proceedings of the 4th International Conference on Agents and Artificial Intelligence English 2012 IT Service Management (ITSM) is concerned with providing IT services to customers. In order to improve the provision of services, ITSM frameworks (e.g., ITIL) mandate the storage of all IT-relevant information in a central Configuration Management System (CMS). This paper describes our Semantic Incident and Problem Analyzer, which builds on a Semantic Wiki-based Configuration Management System. The Semantic Incident and Problem Analyzer assists IT-support personnel in tracking down the causes of incidents and problems in complex IT landscapes. It covers two use cases: (1) by analyzing the similarities between two or more system configurations with problems, it suggests possible locations of the problem; (2) by analyzing changes over time of a component with a problem, possible configuration changes are reported which might have led to the problem. 0 0
Knowledge pattern extraction and their usage in exploratory search Nuzzolese A.G. Lecture Notes in Computer Science English 2012 Knowledge interaction in Web context is a challenging problem. For instance, it requires to deal with complex structures able to filter knowledge by drawing a meaningful context boundary around data. We assume that these complex structures can be formalized as Knowledge Patterns (KPs), aka frames. This Ph.D. work is aimed at developing methods for extracting KPs from the Web and at applying KPs to exploratory search tasks. We want to extract KPs by analyzing the structure of Web links from rich resources, such as Wikipedia. 0 0
Leave or stay: The departure dynamics of wikipedia editors Dell Zhang
Karl Prior
Mark Levene
Mao R.
Van Liere D.
Lecture Notes in Computer Science English 2012 In this paper, we investigate how Wikipedia editors leave the community, i.e., become inactive, from the following three aspects: (1) how long Wikipedia editors will stay active in editing; (2) which Wikipedia editors are likely to leave; and (3) what reasons would make Wikipedia editors leave. The statistical models built on Wikipedia edit log datasets provide insights about the sustainable growth of Wikipedia. 0 0
Let's talk topically with artificial agents!: Providing agents with humanlike topic awareness in everyday dialog situations Alexa Breuing
Ipke Wachsmuth
ICAART 2012 - Proceedings of the 4th International Conference on Agents and Artificial Intelligence English 2012 Spoken interactions between humans are characterized by coherent sequences of utterances assigning a the-matical structure to the whole conversation. Such coherence and the success of a meaningful and flexible dialog are based on the cognitive ability to be aware of the ongoing conversational topic. This paper presents how to enable such topically coherent conversations between humans and interactive systems by emulating humanlike topic awareness in artificial agents. Therefore, we firstly automated human topic awareness on the basis of preprocessed Wikipedia knowledge and secondly transferred such computer-based awareness to a virtual agent. As a result, we contribute to improve human-agent dialogs by enabling topical talk between human and artificial conversation partners. 0 0
Link prediction in a bipartite network using Wikipedia revision information Chang Y.-J.
Kao H.-Y.
Proceedings - 2012 Conference on Technologies and Applications of Artificial Intelligence, TAAI 2012 English 2012 We consider the problem of link prediction in the bipartite network of Wikipedia. Bipartite stands for an important class in social networks, and many unipartite networks can be reinterpreted as bipartite networks when edges are modeled as vertices, such as co-authorship networks. While bipartite is the special case of general graphs, common link prediction function cannot predict the edge occurrence in bipartite graph without any specialization. In this paper, we formulate an undirected bipartite graph using the history revision information in Wikipedia. We adapt the topological features to the bipartite of Wikipedia, and apply a supervised learning approach to our link prediction formulation of the problem. We also compare the performance of link prediction model with different features. 0 0
Link prediction on evolving data using tensor-based common neighbor Cui H. Proceedings - 2012 5th International Symposium on Computational Intelligence and Design, ISCID 2012 English 2012 Recently there has been increasingly interest in researching links between objects in complex networks, which can be helpful in many data mining tasks. One of the fundamental researches of links between objects is link prediction. Many link prediction algorithms have been proposed and perform quite well, however, most of those algorithms only concerns network structure in terms of traditional graph theory, which lack information about evolving network. In this paper we proposed a novel tensor-based prediction method, which is designed through two steps: First, tracking time-dependent network snapshots in adjacency matrices which form a multi-way tensor by using exponential smoothing method. Second, apply Common Neighbor algorithm to compute the degree of similarity for each nodes. This algorithm is quite different from other tensor-based algorithms, which also mentioned in this paper. In order to estimate the accuracy of our link prediction algorithm, we employ various popular datasets of social networks and information platforms, such as Facebook and Wikipedia networks. The results show that our link prediction algorithm performances better than another tensor-based algorithms mentioned in this paper. 0 0
Malleable finding aids Anderson S.R.
Allen R.B.
Lecture Notes in Computer Science English 2012 We show a prototype implementation of a Wiki-based Malleable Finding Aid that provides features to support user engagement and we discuss the contribution of individual features such as graphical representations, a table of contents, interactive sorting of entries, and the possibility for user tagging. Finally, we explore the implications of Malleable Finding Aids for collections which are richly inter-linked and which support a fully social Archival Commons. 0 0
Name-ethnicity classification and ethnicity-sensitive name matching Treeratpituk P.
Giles C.L.
Proceedings of the National Conference on Artificial Intelligence English 2012 Personal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents. They are often used as queries for retrieving records and also as key information for linking documents from multiple sources. Matching personal names can be challenging due to variations in spelling and various formatting of names. While many approximated name matching techniques have been proposed, most are generic string-matching algorithms. Unlike other types of proper names, personal names are highly cultural. Many ethnicities have their own unique naming systems and identifiable characteristics. In this paper we explore such relationships between ethnicities and personal names to improve the name matching performance. First, we propose a name-ethnicity classifier based on the multinomial logistic regression. Our model can effectively identify name-ethnicity from personal names in Wikipedia, which we use to define name-ethnicity, to within 85% accuracy. Next, we propose a novel alignment-based name matching algorithm, based on Smith-Waterman algorithm and logistic regression. Different name matching models are then trained for different name-ethnicity groups. Our preliminary experimental result on DBLP's disambiguated author dataset yields a performance of 99% precision and 89% recall. Surprisingly, textual features carry more weight than phonetic ones in name-ethnicity classification. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Query expansion powered by wikipedia hyperlinks Bruce C.
Gao X.
Andreae P.
Jabeen S.
Lecture Notes in Computer Science English 2012 This research introduces a new query expansion method that uses Wikipedia and its hyperlink structure to find related terms for reformulating a query. Queries are first understood better by splitting into query aspects. Further understanding is gained through measuring how well each aspect is represented in the original search results. Poorly represented aspects are found to be an excellent source of query improvement. Our main contribution is the way of using Wikipedia to identify aspects and underrepresented aspects, and to weight the expansion terms. Results have shown that our approach improves the original query and search results, and outperforms two existing query expansion methods. 0 0
REWOrD: Semantic relatedness in the web of data Pirro G. Proceedings of the National Conference on Artificial Intelligence English 2012 This paper presents REWOrD, an approach to compute semantic relatedness between entities in the Web of Data representing real word concepts. REWOrD exploits the graph nature of RDF data and the SPARQL query language to access this data. Through simple queries, REWOrD constructs weighted vectors keeping the informativeness of RDF predicates used to make statements about the entities being compared. The most informative path is also considered to further refine informativeness. Relatedness is then computed by the cosine of the weighted vectors. Differently from previous approaches based on Wikipedia, REWOrD does not require any preprocessing or custom data transformation. Indeed, it can leverage whatever RDF knowledge base as a source of background knowledge. We evaluated REWOrD in different settings by using a new dataset of real word entities and investigate its flexibility. As compared to related work on classical datasets, REWOrD obtains comparable results while, on one side, it avoids the burden of preprocessing and data transformation and, on the other side, it provides more flexibility and applicability in a broad range of domains. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Round-trip semantics with Sztakipedia and DBpedia Spotlight Heder M.
Mendes P.N.
WWW'12 - Proceedings of the 21st Annual Conference on World Wide Web Companion English 2012 We describe a tool kit to support a knowledge-enhancement cycle on the Web. In the first step, structured data which is extracted from Wikipedia is used to construct automatic content enhancement engines. Those engines can be used to interconnect knowledge in structured and unstructured information sources on the Web, including Wikipedia itself. Sztakipedia-toolbar is a MediaWiki user script which brings DBpedia Spotlight and other kinds of machine intelligence into the Wiki editor interface to provide enhancement suggestions to the user. The suggestions offered by the tool focus on complementing knowledge and increasing the availability of structured data on Wikipedia. This will, in turn, increase the available information for the content enhancement engines themselves, completing a virtuous cycle of knowledge enhancement. A 90 seconds long screencast instroduces the system on youtube: http://www.youtube.com/watch?v= 8VW0TrvXpl4. For those who are interested in more details there is an other 4 minutes long video: http://www.youtube.com/watch?v= cLqe-DOqKCM. Copyright is held by the International World Wide Web Conference Committee (IW3C2). 0 0
Search for minority information from wikipedia based on similarity of majority information Hattori Y.
Akiyo Nadamoto
Lecture Notes in Computer Science English 2012 In this research, we propose a method of searching for minority information, which is less acknowledged and less popular, on the internet. We propose two methods to extract minority information. One is that of calculating relevance of content. The other is based on analogy expression. In this paper, we propose such a minority search system. At this time, we consider it necessary to search for minority information in which a user is interested. Using our proposed system, the user inputs a query which represents their interest in majority information. Then the system searches for minority information that is similar to the majority information provided. Consequently, users can obtain the new information that users do not know and can discover new knowledge and new interests. 0 0
Supporting content curation communities: The case of the Encyclopedia of Life Rotman D.
Procita K.
Hansen D.
Sims Parr C.
Justin Preece
Journal of the American Society for Information Science and Technology English 2012 This article explores the opportunities and challenges of creating and sustaining large-scale "content curation communities" through an in-depth case study of the Encyclopedia of Life (EOL). Content curation communities are large-scale crowdsourcing endeavors that aim to curate existing content into a single repository, making these communities different from content creation communities such as Wikipedia. In this article, we define content curation communities and provide examples of this increasingly important genre. We then follow by presenting EOL, a compelling example of a content curation community, and describe a case study of EOL based on analysis of interviews, online discussions, and survey data. Our findings are characterized into two broad categories: information integration and social integration. Information integration challenges at EOL include the need to (a) accommodate and validate multiple sources and (b) integrate traditional peer reviewed sources with user-generated, nonpeer-reviewed content. Social integration challenges at EOL include the need to (a) establish the credibility of open-access resources within the scientific community and (b) facilitate collaboration between experts and novices. After identifying the challenges, we discuss the potential strategies EOL and other content curation communities can use to address them, and provide technical, content, and social design recommendations for overcoming them. 0 0
The Inclusivity of Wikipedia and the Drawing of Expert Boundaries: An Examination of Talk Pages and Reference Lists Brendan Luyt Journal of the American Society for Information Science and Technology English 2012 Wikipedia is frequently viewed as an inclusive medium. But inclusivity within this online encyclopedia is not a simple matter of just allowing anyone to contribute. In its quest for legitimacy as an encyclopedia, Wikipedia relies on outsiders to judge claims championed by rival editors. In choosing these experts, Wikipedians define the boundaries of acceptable comment on any given subject. Inclusivity then becomes a matter of how the boundaries of expertise are drawn. In this article I examine the nature of these boundaries and the implications they have for inclusivity and credibility as revealed through the talk pages produced and sources used by a particular subset of Wikipedia's creators-those involved in writing articles on the topic of Philippine history. 0 0
Towards building a global oracle: A physical mashup using artificial intelligence technology Fortuna C.
Vucnik M.
Blaz Fortuna
Kenda K.
Moraru A.
Mladenic D.
ACM International Conference Proceeding Series English 2012 In this paper, we describe Videk - a physical mashup which uses artificial intelligence technology. We make an analogy between human senses and sensors; and between human brain and artificial intelligence technology respectively. This analogy leads to the concept of Global Oracle. We introduce a mashup system which automatically collects data from sensors. The data is processed and stored by SenseStream while the meta-data is fed into ResearchCyc. SenseStream indexes aggregates, performs clustering and learns rules which then it exports as RuleML. ResearchCyc performs logical inference on the meta-data and transliterates logical sentences. The GUI mashes up sensor data with SenseStream output, ResearchCyc output and other external data sources: GoogleMaps, Geonames, Wikipedia and Panoramio. Copyright 0 0
Using Wikipedia and conceptual graph structures to generate questions for academic writing support Liu M.
Calvo R.A.
Aditomo A.
Pizzato L.A.
IEEE Transactions on Learning Technologies English 2012 In this paper, we present a novel approach for semiautomatic question generation to support academic writing. Our system first extracts key phrases from students' literature review papers. Each key phrase is matched with a Wikipedia article and classified into one of five abstract concept categories: Research Field, Technology, System, Term, and Other. Using the content of the matched Wikipedia article, the system then constructs a conceptual graph structure representation for each key phrase and the questions are then generated based the structure. To evaluate the quality of the computer generated questions, we conducted a version of the Bystander Turing test, which involved 20 research students who had written literature reviews for an IT methods course. The pedagogical values of generated questions were evaluated using a semiautomated process. The results indicate that the students had difficulty distinguishing between computer-generated and supervisor-generated questions. Computer-generated questions were also rated as being as pedagogically useful as supervisor-generated questions, and more useful than generic questions. The findings also suggest that the computer-generated questions were more useful for the first-year students than for second or third-year students. 0 0
Vulnerapedia: Security knowledge management with an ontology Blanco F.J.
Fernandez-Villamor J.I.
Iglesias C.A.
ICAART 2012 - Proceedings of the 4th International Conference on Agents and Artificial Intelligence English 2012 Ontological engineering can do an efficient management of the security data, generating security knowledge. We use a step methodology defining a main ontology in the web application security domain. Next, extraction and integration processes translate unstructured data in quality security knowledge. Thus, we check the ontology can perform management processes involved. A social tool is implemented to wrap the knowledge in an accessible way. It opens the security knowledge to encourage people to collaboratively use and extend it. 0 0
Wiki as Ontology for knowledge discovery on WWW Yin L.
Wang J.
Huang Y.
Proceedings - 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012 English 2012 Due to the increasing amount of data Available online, the World Wide Web has becoming one of the most valuable resources for information retrievals and knowledge discovery. Web mining technologies (usually divided into Content mining, Structure mining and Usage mining) are the right solutions for knowledge discovery on WWW. In fact the work depends on two essential issues: One is the knowledge itself, which means analyze what's the required information; the other is how does machine know the requirement well, which means to realize a feasible method for computation and the complex semantic measurement. This paper aimed to discuss three aspects of knowledge we defined: Content, structure and Prior. It means knowledge discovery on WWW should consider content features, structure relations and Priors from background simultaneously. A practice of Wiki as ontology also proposed in this paper. The multiuser writing system will bring chance as large corpus, we applied the linked data for construction of a dynamic semantic network. And which can be used in short text computation such as query expansion and so on. For the consideration of swarm intelligence the key issues and lessons are given in this paper, linked data such as wiki will provide chances and challenges for computability in the future. 0 0
A resource-based method for named entity extraction and classification Gamallo P.
Garcia M.
Lecture Notes in Computer Science English 2011 We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Language-independent heuristics are used to disambiguate and classify entities that have been already identified (or recognized) in text. We compare the performance of our resource-based system with that of a supervised NEC module implemented for the FreeLing suite, which was the winner system in CoNLL-2002 competition. Experiments were performed over Portuguese text corpora taking into account several domains and genres. 0 0
A statistical approach for automatic keyphrase extraction Abulaish M.
Jahiruddin
Dey L.
Proceedings of the 5th Indian International Conference on Artificial Intelligence, IICAI 2011 English 2011 Due to availability of voluminous textual data either on the World Wide Web or in textual databases automatic keyphrase extraction has gained increasing popularity in recent past to summarize and characterize text documents. Consequently, a number of machine learning techniques, mostly supervised, have been proposed to mine keyphrases in an automatic way. But, the non-availability of annotated corpus for training such systems is the main hinder for their success. In this paper, we propose the design of an automatic keyphrase extraction system which uses NLP and statistical approach to mine keyphrases from unstructured text documents. The efficacy of the proposed system is established over texts crawled from Wikipedia server. On evaluation we found that the proposed method outperforms KEA which uses naïve Bayes classification technique for keyphrase extraction. 0 0
A study of category expansion for related entity finding Jinghua Zhang
Qu Y.
Proceedings - 2011 4th International Symposium on Computational Intelligence and Design, ISCID 2011 English 2011 Entity is an important information carrier in Web pages. Searchers often want a ranked list of relevant entities directly rather a list of documents. So the research of related entity finding (REF) is very meaningful. In this paper we investigate the most important task of REF: Entity Ranking. To address the issue of wrong entity type in entity ranking: some retrieved entities don't belong to the target entity type. We make use of category expansion to deal with the issue of wrong entity type polluting entity ranking. We use Wikipedia and Dbpedia as data sources in the experiment. We found category expansion based on original type achieves a better result in recall and precision proved by experiment. 0 0
Automatic semantic web annotation of named entities Charton E.
Marie-Pierre Gagnon
Ozell B.
Lecture Notes in Computer Science English 2011 This paper describes a method to perform automated semantic annotation of named entities contained in large corpora. The semantic annotation is made in the context of the Semantic Web. The method is based on an algorithm that compares the set of words that appear before and after the name entity with the content of Wikipedia articles, and identifies the more relevant one by means of a similarity measure. It then uses the link that exists between the selected Wikipedia entry and the corresponding RDF description in the Linked Data project to establish a connection between the named entity and some URI in the Semantic Web. We present our system, discuss its architecture, and describe an algorithm dedicated to ontological disambiguation of named entities contained in large-scale corpora. We evaluate the algorithm, and present our results. 0 0
Bootstrapping multilingual relation discovery using English wikipedia and wikimedia-induced entity extraction Schome P.
Tim Allison
Chris Giannella
Craig Pfeifer
Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI English 2011 Relation extraction has been a subject of significant study over the past decade. Most relation extractors have been developed by combining the training of complex computational systems on large volumes of annotations with extensive rule-writing by language experts. Moreover, many relation extractors are reliant on other non-trivial NLP technologies which themselves are developed through significant human efforts, such as entity tagging, parsing, etc. Due to the high cost of creating and assembling the required resources, relation extractors have typically been developed for only high-resourced languages. In this paper, we describe a near-zero-cost methodology to build relation extractors for significantly distinct non-English languages using only freely available Wikipedia and other web documents, and some knowledge of English. We apply our method to build alma-mater, birthplace, father, occupation, and spouse relation extractors in Greek, Spanish, Russian, and Chinese. We conduct evaluations of induced relations at the file level - the most refined we have seen in the literature. 0 0
Constraint optimization approach to context based word selection Matsuno J.
Toru Ishida
IJCAI International Joint Conference on Artificial Intelligence English 2011 Consistent word selection in machine translation is currently realized by resolving word sense ambiguity through the context of a single sentence or neighboring sentences. However, consistent word selection over the whole article has yet to be achieved. Consistency over the whole article is extremely important when applying machine translation to collectively developed documents like Wikipedia. In this paper, we propose to consider constraints between words in the whole article based on their semantic relatedness and contextual distance. The proposed method is successfully implemented in both statistical and rule-based translators. We evaluate those systems by translating 100 articles in the English Wikipedia into Japanese. The results show that the ratio of appropriate word selection for common nouns increased to around 75% with our method, while it was around 55% without our method. 0 0
Creating and Exploiting a Hybrid Knowledge Base for Linked Data Zareen Syed
Tim Finin
Communications in Computer and Information Science English 2011 Twenty years ago Tim Berners-Lee proposed a distributed hypertext system based on standard Internet protocols. The Web that resulted fundamentally changed the ways we share information and services, both on the public Internet and within organizations. That original proposal contained the seeds of another effort that has not yet fully blossomed: a Semantic Web designed to enable computer programs to share and understand structured and semi-structured information easily. We will review the evolution of the idea and technologies to realize a Web of Data and describe how we are exploiting them to enhance information retrieval and information extraction. A key resource in our work is Wikitology, a hybrid knowledge base of structured and unstructured information extracted from Wikipedia. 0 0
Disambiguation and filtering methods in using web knowledge for coreference resolution Uryupina O.
Poesio M.
Claudio Giuliano
Kateryna Tymoshenko
Proceedings of the 24th International Florida Artificial Intelligence Research Society, FLAIRS - 24 English 2011 We investigate two publicly available web knowledge bases, Wikipedia and Yago, in an attempt to leverage semantic information and increase the performance level of a state-of-the-art coreference resolution (CR) engine. We extract semantic compatibility and aliasing information from Wikipedia and Yago, and incorporate it into a CR system. We show that using such knowledge with no disambiguation and filtering does not bring any improvement over the baseline, mirroring the previous findings (Ponzetto and Poesio 2009). We propose, therefore, a number of solutions to reduce the amount of noise coming from web resources: using disambiguation tools for Wikipedia, pruning Yago to eliminate the most generic categories and imposing additional constraints on affected mentions. Our evaluation experiments on the ACE-02 corpus show that the knowledge, extracted from Wikipedia and Yago, improves our system's performance by 2-3 percentage points. Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Evaluating reranking methods using wikipedia features Kurakado K.
Oishi T.
Hasegawa R.
Fujita H.
Koshimura M.
ICAART 2011 - Proceedings of the 3rd International Conference on Agents and Artificial Intelligence English 2011 Many people these days access a vast document on the Web very often with the help of search engines such as Google. However, even if we use the search engine, it is often the case that we cannot find desired information easily. In this paper, we extract related words for the search query by analyzing link information and category structure. we aim to assist the user in retrieving web pages by reranking search results. 0 0
Finding paths connecting two proper nouns using an ant colony algorithm Hakande D. ECTA 2011 FCTA 2011 - Proceedings of the International Conference on Evolutionary Computation Theory and Applications and International Conference on Fuzzy Computation Theory and Applications English 2011 Collaborative systems available on the Web allow millions of users to share information through a growing collection of tools and platforms such as wiki- patforms, blogs, and shared forums. With abundant information resources on the Internet such as Wikipedia or the Freebase, we study the connections between two proper nouns. Nevertheless, the problem is a challenging search problem, as information on the Internet is undoubtedly large and full of irrelevant information. In this project, we first parse and mine the entire Freebase database in order to extract the relevant information of proper nouns. Further we apply Ant Colony Optimization method for finding the path that connects two proper nouns together. 0 0
Geodesic distances for web document clustering Tekir S.
Mansmann F.
Keim D.
IEEE SSCI 2011: Symposium Series on Computational Intelligence - CIDM 2011: 2011 IEEE Symposium on Computational Intelligence and Data Mining English 2011 While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article. 0 0
How to reason by HeaRT in a semantic knowledge-based Wiki Adrian W.T.
Szymon Bobek
Nalepa G.J.
Krzysztof Kaczor
Krzysztof Kluza
Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI English 2011 Semantic wikis constitute an increasingly popular class of systems for collaborative knowledge engineering. We developed Loki, a semantic wiki that uses a logic-based knowledge representation. It is compatible with semantic annotations mechanism as well as Semantic Web languages. We integrated the system with a rule engine called HeaRT that supports inference with production rules. Several modes for modularized rule bases, suitable for the distributed rule bases present in a wiki, are considered. Embedding the rule engine enables strong reasoning and allows to run production rules over semantic knowledge bases. In the paper, we demonstrate the system concepts and functionality using an illustrative example. 0 0
Integrating artificial intelligence solutions into interfaces of online knowledge production Heder M. ICIC Express Letters English 2011 The current interfaces of online knowledge production systems are not optimal for the creation of high-quality knowledge units. This study investigates possible methods for the integration of AI solutions into those web interfaces where users produce knowledge, e.g., Wikipedia, forums and blogs. A requirement survey was conducted in order to predict which solutions the users would most likely accept out of the many possible choices. We focused on the reading and editing preferences of Wikipedia users, Wikipedia being the biggest knowledge production and sharing framework. We found that many functions can be easily implemented into the knowledge production interface if we simply integrate well-known and available AI solutions. The results of our survey show that right now the need for basic, but well-implemented and integrated AI functions is greater than the need for cutting-edge, complex AI modules. It can be concluded that even if it is advisable to constantly improve the underlying algorithms and methods of the system, much more emphasis should be given to the interface design of currently available AI solutions. 0 0
Leveraging wikipedia characteristics for search and candidate generation in question answering Chu-Carroll J.
Fan J.
Proceedings of the National Conference on Artificial Intelligence English 2011 Most existing Question Answering (QA) systems adopt a type-and-generate approach to candidate generation that relies on a pre-defined domain ontology. This paper describes a type independent search and candidate generation paradigm for QA that leverages Wikipedia characteristics. This approach is particularly useful for adapting QA systems to domains where reliable answer type identification and type-based answer extraction are not available. We present a three-pronged search approach motivated by relations an answer-justifying title-oriented document may have with the question/answer pair. We further show how Wikipedia metadata such as anchor texts and redirects can be utilized to effectively extract candidate answers from search results without a type ontology. Our experimental results show that our strategies obtained high binary recall in both search and candidate generation on TREC questions, a domain that has mature answer type extraction technology, as well as on Jeopardy! questions, a domain without such technology. Our high-recall search and candidate generation approach has also led to high over-all QA performance in Watson, our end-to-end system. Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
LogAnswer in question answering forums Pelzer B.
Glockner I.
Dong T.
ICAART 2011 - Proceedings of the 3rd International Conference on Agents and Artificial Intelligence English 2011 LogAnswer is a question answering (QA) system for the German language. By providing concise answers to questions of the user, LogAnswer provides more natural access to document collections than conventional search engines do. QA forums provide online venues where human users can ask each other questions and give answers. We describe an ongoing adaptation of LogAnswer to QA forums, aiming at creating a virtual forum user who can respond intelligently and efficiently to human questions. This serves not only as a more accurate evaluation method of our system, but also as a real world use case for automated QA. The basic idea is that the QA system can disburden the human experts from answering routine questions, e.g. questions with known answer in the forum, or questions that can be answered from the Wikipedia. As a result, the users can focus on those questions that really demand human judgement or expertise. In order not to spam users, the QA system needs a good self-assessment of its answer quality. Existing QA techniques, however, are not sufficiently precision-oriented. The need to provide justified answers thus fosters research into logic-oriented QA and novel methods for answer validation. 0 0
Loki-presentation of logic-based semantic wiki Adrian W.T.
Nalepa G.J.
CEUR Workshop Proceedings English 2011 TOOL PRESENTATION: The paper presents a semantic wiki, called Loki, with strong logical knowledge representation using rules. The system uses a coherent logic-based representation for semantic annotations of the content and implementing reasoning procedures. The representation uses the logic programming paradigm and the Prolog programming language. The proposed architecture allows for rule-based reasoning in the wiki. It also provides a compatibility layer with the popular Semantic MediaWiki platform, directly parsing its annotations. 0 0
Semantic relationship discovery with wikipedia structure Bu F.
Hao Y.
Zhu X.
IJCAI International Joint Conference on Artificial Intelligence English 2011 Thanks to the idea of social collaboration, Wikipedia has accumulated vast amount of semistructured knowledge in which the link structure reflects human's cognition on semantic relationship to some extent. In this paper, we proposed a novel method RCRank to jointly compute conceptconcept relatedness and concept-category relatedness base on the assumption that information carried in concept-concept links and concept-category links can mutually reinforce each other. Different from previous work, RCRank can not only find semantically related concepts but also interpret their relations by categories. Experimental results on concept recommendation and relation interpretation show that our method substantially outperforms classical methods. 0 0
Sentiment analysis of news titles: The role of entities and a new affective lexicon Loureiro D.
Marreiros G.
Neves J.
Lecture Notes in Computer Science English 2011 The growth of content on the web has been followed by increasing interest in opinion mining. This field of research relies on accurate recognition of emotion from textual data. There's been much research in sentiment analysis lately, but it always focuses on the same elements. Sentiment analysis traditionally depends on linguistic corpora, or common sense knowledge bases, to provide extra dimensions of information to the text being analyzed. Previous research hasn't yet explored a fully automatic method to evaluate how events associated to certain entities may impact each individual's sentiment perception. This project presents a method to assign valence ratings to entities, using information from their Wikipedia page, and considering user preferences gathered from the user's Facebook profile. Furthermore, a new affective lexicon is compiled entirely from existing corpora, without any intervention from the coders. 0 0
Topic mining based on graph local clustering Garza Villarreal S.E.
Brena R.F.
Lecture Notes in Computer Science English 2011 This paper introduces an approach for discovering thematically related document groups (a topic mining task) in massive document collections with the aid of graph local clustering. This can be achieved by viewing a document collection as a directed graph where vertices represent documents and arcs represent connections among these (e.g. hyperlinks). Because a document is likely to have more connections to documents of the same theme, we have assumed that topics have the structure of a graph cluster, i.e. a group of vertices with more arcs to the inside of the group and fewer arcs to the outside of it. So, topics could be discovered by clustering the document graph; we use a local approach to cope with scalability. We also extract properties (keywords and most representative documents) from clusters to provide a summary of the topic. This approach was tested over the Wikipedia collection and we observed that the resulting clusters in fact correspond to topics, which shows that topic mining can be treated as a graph clustering problem. 0 0
Using a lexical dictionary and a folksonomy to automatically construct domain ontologies Macias-Galindo D.
Wong W.
Cavedon L.
Thangarajah J.
Lecture Notes in Computer Science English 2011 We present and evaluate MKBUILD, a tool for creating domain-specific ontologies. These ontologies, which we call Modular Knowledge Bases (MKBs), contain concepts and associations imported from existing large-scale knowledge resources, in particular WordNet and Wikipedia. The combination of WordNet's human-crafted taxonomy and Wikipedia's semantic associations between articles produces a highly connected resource. Our MKBs are used by a conversational agent operating in a small computational environment. We constructed several domains with our technique, and then conducted an evaluation by asking human subjects to rate the domain-relevance of the concepts included in each MKB on a 3-point scale. The proposed methodology achieved precision values between 71% and 88% and recall between 37% and 95% in the evaluation, depending on how the middle-score judgements are interpreted. The results are encouraging considering the cross-domain nature of the construction process and the difficulty of representing concepts as opposed to terms. 0 0
VisualWikiCurator: A corporate wiki plugin Nicholas Kong
Ben H.
Gregorio Convertino
Chi E.H.
Conference on Human Factors in Computing Systems - Proceedings English 2011 Knowledge workers who maintain corporate wikis face high costs for organizing and updating content on wikis. This problem leads to low adoption rates and compromises the utility of such tools in organizations. We describe a system that seeks to reduce the interactions costs of updating and organizing wiki pages by combining human and machine intelligence. We then present preliminary results of an ongoing lab-based evaluation of the tool with knowledge workers. 0 0
Wiki tool for adaptive, accesibility, usability and colaborative hypermedia courses: Mediawikicourse De Castro C.
Garcia E.
Ramirez J.M.
Buron F.J.
Sainz B.
Sanchez R.
Robles R.M.
Torres J.C.
Bell J.
Alcantud F.
Lecture Notes in Computer Science English 2011 Recently published social protection and dependence reports reaffirm that the elderly, the disabled, or those in situations of dependency objectively benefit from continuing to live at home with the assistance from direct family. Currently in Spain - amongst the elderly, or people in a situation of dependency - 8 out of every 10 people stay at home. The end result is that the direct family relations have the responsibility of performing 76% of the tasks during the daily routine where aid is needed1. Associations for people with disabilities, however, not only report a lack of adequate aid services, but a lack of direct-family assistance as well. It is necessary, therefore, for an "evolution" or overhaul amongst the social and health service provision systems. The elderly, people in situations of dependency, or people with disabilities should be provided with enough resources and aids to allow them to decide their own future2. 0 0
Wikisimple: Automatic simplification of wikipedia articles Woodsend K.
Lapata M.
Proceedings of the National Conference on Artificial Intelligence English 2011 Text simplification aims to rewrite text into simpler versions and thus make information accessible to a broader audience (e.g., non-native speakers, children, and individuals with language impairments). In this paper, we propose a model that simplifies documents automatically while selecting their most important content and rewriting them in a simpler style. We learn content selection rules from same-topic Wikipedia articles written in the main encyclopedia and its Simple English variant. We also use the revision histories of Simple Wikipedia articles to learn a quasi-synchronous grammar of simplification rewrite rules. Based on an integer linear programming formulation, we develop a joint model where preferences based on content and style are optimized simultaneously. Experiments on simplifying main Wikipedia articles show that our method significantly reduces the reading difficulty, while still capturing the important content. Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
YAGO-QA: Answering questions by structured knowledge queries Adolphs P.
Martin Theobald
Schafer U.
Uszkoreit H.
Gerhard Weikum
Proceedings - 5th IEEE International Conference on Semantic Computing, ICSC 2011 English 2011 We present a natural-language question-answering system that gives access to the accumulated knowledge of one of the largest community projects on the Web - Wikipedia - via an automatically acquired structured knowledge base. Key to building such a system is to establish mappings from natural language expressions to semantic representations. We propose to acquire these mappings by data-driven methods -corpus harvesting and paraphrasing - and present a preliminary empirical study that demonstrates the viability of our method. 0 0
A semi-automatic method for domain ontology extraction from Portuguese language Wikipedia's categories Xavier C.C.
De Lima V.L.S.
Lecture Notes in Computer Science English 2010 The increasing need for ontologies and the difficulties of manual construction give place to initiatives proposing methods for automatic and semi-automatic ontology learning. In this work we present a semi-automatic method for domain ontologies extraction from Wikipedia's categories. In order to validate the method, we have conducted a case study in which we implemented a prototype generating a Tourism ontology. The results are evaluated against a manually built Golden Standard reporting 79.51% Precision and 91.95% Recall, comparable to those found in the literature for other languages. 0 0
Aligning WordNet synsets and wikipedia articles Fernando S.
Stevenson M.
AAAI Workshop - Technical Report English 2010 This paper examines the problem of finding articles in Wikipedia to match noun synsets in WordNet. The motivation is that these articles enrich the synsets with much more information than is already present in WordNet. Two methods are used. The first is title matching, following redirects and disambiguation links. The second is information retrieval over the set of articles. The methods are evaluated over a random sample set of 200 noun synsets which were manually annotated. With 10 candidate articles retrieved for each noun synset, the methods achieve recall of 93%. The manually annotated data set and the automatically generated candidate article sets are available online for research purposes. Copyright © 2010, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Application of social software in college education Pan Q. Proceedings - 2010 International Conference on Artificial Intelligence and Education, ICAIE 2010 English 2010 Social software is a newborn thing in the process of network socialization, it makes learners and software feature set in one body, provides good support for learning, and it makes learning and the transformation of knowledge complement with each other. This article describes the concept of social software and its classification, and expounds its application in high school education, in the hope that learners can effectively use social software to achieve optimum learning outcomest. 0 0
Approaches for automatically enriching wikipedia Zareen Syed
Tim Finin
AAAI Workshop - Technical Report English 2010 We have been exploring the use of Web-derived knowledge bases through the development of Wikitology - a hybrid knowledge base of structured and unstructured information extracted from Wikipedia augmented by RDF data from DBpedia and other Linked Open Data resources. In this paper, we describe approaches that aid in enriching Wikipedia and thus the resources that derive from Wikipedia such as the Wikitology knowledge base, DBpedia, Freebase and Powerset. Copyright © 2010, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
C-Link: Concept linkage in knowledge repositories Cowling P.
Remde S.
Hartley P.
Stewart W.
Stock-Brooks J.
Woolley T.
AAAI Spring Symposium - Technical Report English 2010 When searching a knowledge repository such as Wikipedia or the Internet, the user doesn't always know what they are looking for. Indeed, it is often the case that a user wishes to find information about a concept that was completely unknown to them prior to the search. In this paper we describe C-Link, which provides the user with a method for searching for unknown concepts which lie between two known concepts. C-Link does this by modeling the knowledge repository as a weighted, directed graph where nodes are concepts and arc weights give the degree of "relatedness" between concepts. An experimental study was undertaken with 59 participants to investigate the performance of C-Link compared to standard search approaches. Statistical analysis of the results shows great potential for C-Link as a search tool. 0 0
Computing semantic relatedness between named entities using Wikipedia Hongyan Liu
Yirong Chen
Proceedings - International Conference on Artificial Intelligence and Computational Intelligence, AICI 2010 English 2010 In this paper the authors suggest a novel approach that uses Wikipedia to measure the semantic relatedness between Chinese named entities, such as names of persons, books, softwares, etc. The relatedness is measured through articles in Wikipedia that are related to the named entities. The authors select a set of "definition words" which are hyperlinks from these articles, and then compute the relatedness between two named entities as the relatedness between two sets of definition words. The authors propose two ways to measure the relatedness between two definition words: by Wiki-articles related to the words or by categories of the words. Proposed approaches are compared with several other baseline models through experiments. The experimental results show that this method renders satisfactory results. 0 0
Creating and exploiting a Web of semantic data Tim Finin
Zareen Syed
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence, Proceedings English 2010 Twenty years ago Tim Berners-Lee proposed a distributed hypertext system based on standard Internet protocols. The Web that resulted fundamentally changed the ways we share information and services, both on the public Internet and within organizations. That original proposal contained the seeds of another effort that has not yet fully blossomed: a Semantic Web designed to enable computer programs to share and understand structured and semi-structured information easily. We will review the evolution of the idea and technologies to realize a Web of Data and describe how we are exploiting them to enhance information retrieval and information extraction. A key resource in our work is Wikitology, a hybrid knowledge base of structured and unstructured information extracted from Wikipedia. 0 0
Crowdsourcing, open innovation and collective intelligence in the scientific method: A research agenda and operational framework Buecheler T.
Sieg J.H.
Fuchslin R.M.
Pfeifer R.
Artificial Life XII: Proceedings of the 12th International Conference on the Synthesis and Simulation of Living Systems, ALIFE 2010 English 2010 The lonely researcher trying to crack a problem in her office still plays an important role in fundamental research. However, a vast exchange, often with participants from different fields is taking place in modern research activities and projects. In the "Research Value Chain" (a simplified depiction of the Scientific Method as a process used for the analyses in this paper), interactions between researchers and other individuals (intentional or not) within or outside their respective institutions can be regarded as occurrences of Collective Intelligence. "Crowdsourcing" (Howe 2006) is a special case of such Collective Intelligence. It leverages the wisdom of crowds (Surowiecki 2004) and is already changing the way groups of people produce knowledge, generate ideas and make them actionable. A very famous example of a Crowdsourcing outcome is the distributed encyclopedia "Wikipedia". Published research agendas are asking how techniques addressing "the crowd" can be applied to non-profit environments, namely universities, and fundamental research in general. This paper discusses how the non-profit "Research Value Chain" can potentially benefit from Crowdsourcing. Further, a research agenda is proposed that investigates a) the applicability of Crowdsourcing to fundamental science and b) the impact of distributed agent principles from Artificial Intelligence research on the robustness of Crowdsourcing. Insights and methods from different research fields will be combined, such as complex networks, spatially embedded interacting agents or swarms and dynamic networks. Although the ideas in this paper essentially outline a research agenda, preliminary data from two pilot studies show that non-scientists can support scientific projects with high quality contributions. Intrinsic motivators (such as "fun") are present, which suggests individuals are not (only) contributing to such projects with a view to large monetary rewards. 1 0
Keyword extraction and headline generation using novel word features Xu S.
Yang S.
Lau F.C.M.
Proceedings of the National Conference on Artificial Intelligence English 2010 We introduce several novel word features for keyword extraction and headline generation. These new word features are derived according to the background knowledge of a document as supplied by Wikipedia. Given a document, to acquire its background knowledge from Wikipedia, we first generat e a query for searching the Wikipedia corpus based on the key facts present in the document. We then use the query to find articles in the Wikipedia corpus that are closely related to the contents of the document. With the Wikipedia search result article set, we extract the inlink, outlink, category and infobox information in each article to derive a set of novel word features which reflect the document's background knowledge. These newly introduced word features of fer valuable indications on individual words' importance in the input document. They serve as nice complements to the traditional word features derivable from explicit information of a document. In addition, we also introduce a word-document fitness feat ure to characterize the influence of a document's genre on the keyword extraction and headline generation process. We study the effectiveness of these novel word features for keyword extraction and headline generation by experiments and have obtained very encouraging results. Copyright © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 0
Learning from the web: Extracting general world knowledge from noisy text Gordon J.
Van Durme B.
Schubert L.K.
AAAI Workshop - Technical Report English 2010 The quality and nature of knowledge that can be found by an automated knowledge-extraction system depends on its inputs. For systems that learn by reading text, the Web offers a breadth of topics and currency, but it also presents the problems of dealing with casual, unedited writing, non-textual inputs, and the mingling of languages. The results of extraction using the KNEXT system on two Web corpora - Wikipedia and a collection of weblog entries - indicate that, with automatic filtering of the output, even ungrammatical writing on arbitrary topics can yield an extensive knowledge base, which human judges find to be of good quality, with propositions receiving an average score across both corpora of 2.34 (where the range is 1 to 5 and lower is better) versus 3.00 for unfiltered output from the same sources. Copyright © 2010, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Modeling events with cascades of Poisson processes Simma A.
Jordan M.I.
Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, UAI 2010 English 2010 We present a probabilistic model of events in continuous time in which each event triggers a Poisson process of successor events. The ensemble of observed events is thereby modeled as a superposition of Poisson processes. Efficient inference is feasible under this model with an EM algorithm. Moreover, the EM algorithm can be implemented as a distributed algorithm, permitting the model to be applied to very large datasets. We apply these techniques to the modeling of Twitter messages and the revision history of Wikipedia. 0 0
Multiclass-multilabel classification with more classes than examples Dekel O.
Shamir O.
Journal of Machine Learning Research English 2010 We discuss multiclass-multilabel classification problems in which the set of classes is extremely large. Most existing multiclass-multilabel learning algorithms expect to observe a reasonably large sample from each class, and fail if they receive only a handful of examples per class. We propose and analyze the following two-stage approach: first use an arbitrary (perhaps heuristic) classification algorithm to construct an initial classifier, then apply a simple but principled method to augment this classifier by removing harmful labels from its output. A careful theoretical analysis allows us to justify our approach under some reasonable conditions (such as label sparsity and power-law distribution of class frequencies), even when the training set does not provide a statistically accurate representation of most classes. Surprisingly, our theoretical analysis continues to hold even when the number of classes exceeds the sample size. We demonstrate the merits of our approach on the ambitious task of categorizing the entire web using the 1.5 million categories defined on Wikipedia. Copyright 2010 by the authors. 0 0
Place in perspective: Extracting online information about points of interest Alves A.O.
Pereira F.C.
Fernando Rodrigues
Oliveirinha J.
Lecture Notes in Computer Science English 2010 During the last few years, the amount of online descriptive information about places has reached reasonable dimensions for many cities in the world. Being such information mostly in Natural Language text, Information Extraction techniques are needed for obtaining the meaning of places that underlies these massive amounts of commonsense and user made sources. In this article, we show how we automatically label places using Information Extraction techniques applied to online resources such as Wikipedia, Yellow Pages and Yahoo!. 0 0
Related word extraction from wikipedia for web retrieval assistance Hori K.
Oishi T.
Mine T.
Hasegawa R.
Fujita H.
Koshimura M.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence, Proceedings English 2010 This paper proposes a web retrieval system with extended queries generated from the contents of Wikipedia.By using the extended queries, we aim to assist user in retrieving Web pages and acquiring knowledge. To extract extended query items, we make much of hyperlinks in Wikipedia in addition to the related word extraction algorithm. We evaluated the system through experimental use of it by several examinees and the questionnaires to them. Experimental results show that our system works well for user's retrieval and knowledge acquisition. 0 0
Scalable semantic annotation of text using lexical and Web resources Zavitsanos E.
Tsatsaronis G.
Varlamis I.
Paliouras G.
Lecture Notes in Computer Science English 2010 In this paper we are dealing with the task of adding domain-specific semantic tags to a document, based solely on the domain ontology and generic lexical and Web resources. In this manner, we avoid the need for trained domain-specific lexical resources, which hinder the scalability of semantic annotation. More specifically, the proposed method maps the content of the document to concepts of the ontology, using the WordNet lexicon and Wikipedia. The method comprises a novel combination of measures of semantic relatedness and word sense disambiguation techniques to identify the most related ontology concepts for the document. We test the method on two case studies: (a) a set of summaries, accompanying environmental news videos, (b) a set of medical abstracts. The results in both cases show that the proposed method achieves reasonable performance, thus pointing to a promising path for scalable semantic annotation of documents. 0 0
Standard Operating Procedures: Collaborative development and distributed use Wickler G.
Potter S.
ISCRAM 2010 - 7th International Conference on Information Systems for Crisis Response and Management: Defining Crisis Management 3.0, Proceedings English 2010 This paper describes a system that supports the distributed development and deployment of Standard Operating Procedures. The system is based on popular, open-source wiki software for the SOP development, and the I-X task-centric agent framework for deployment. A preliminary evaluation using an SOP for virtual collaboration is described and shows the potential of the approach. 0 0
The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis Lintean M.
Moldovan C.
Rus V.
McNamara D.
Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23 English 2010 In this paper, we investigate the impact of several local and global weighting schemes on Latent Semantic Analysis' (LSA) ability to capture semantic similarity between two texts. We worked with texts varying in size from sentences to paragraphs. We present a comparison of 3 local and 3 global weighting schemes across 3 different standardized data sets related to semantic similarity tasks. For local weighting, we used binary weighting, term-frequency, and log-type. For global weighting, we relied on binary, inverted document frequencies (IDF) collected from the English Wikipedia, and entropy, which is the standard weighting scheme used by most LSA-based applications. We studied all possible combinations of these weighting schemes on the following three tasks and corresponding data sets: paraphrase identification at sentence level using the Microsoft Research Paraphrase Corpus, paraphrase identification at sentence level using data from the intelligent tutoring system iSTART, and mental model detection based on student-articulated paragraphs in MetaTutor, another intelligent tutoring system. Our experiments revealed that for sentence-level texts a combination of type frequency local weighting in combination with either IDF or binary global weighting works best. For paragraph-level texts, a log-type local weighting in combination with binary global weighting works best. We also found that global weights have a greater impact for sententence-level similarity as the local weight is undermined by the small size of such texts. Copyright © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 0
Towards automatic classification of wikipedia content Szymanski J. Lecture Notes in Computer Science English 2010 Wikipedia - the Free Encyclopedia encounters the problem of proper classification of new articles everyday. The process of assignment of articles to categories is performed manually and it is a time consuming task. It requires knowledge about Wikipedia structure, which is beyond typical editor competence, which leads to human-caused mistakes - omitting or wrong assignments of articles to categories. The article presents application of SVM classifier for automatic classification of documents from The Free Encyclopedia. The classifier application has been tested while using two text representations: inter-documents connections (hyperlinks) and word content. The results of the performed experiments evaluated on hand crafted data show that the Wikipedia classification process can be partially automated. The proposed approach can be used for building a decision support system which suggests editors the best categories that fit new content entered to Wikipedia. 0 0
Using encyclopaedic knowledge for query classification Richard Khoury Proceedings of the 2010 International Conference on Artificial Intelligence, ICAI 2010 English 2010 Identifying the intended topic that underlies a user's queiy can benefit a large range of applications, from search engines to question-answering systems. However, query classification remains a difficult challenge due to the variety of queries a user can ask, the wide range of topics users can ask about, and the limited amount of information that can be mined from the queiy. In this paper, we develop a new query classification system that accounts for these three challenges. Our system relies on encyclopaedic knowledge to understand the user's queiy and fill in the gaps of missing information. Specifically, we use the freely-available online encyclopaedia Wikipedia as a natural-language knowledge base, and exploit Wikipedia's structure to infer the correct classification of any user queiy. 0 0
Wikipedia missing link discovery: A comparative study Sunercan O.
Birturk A.
AAAI Spring Symposium - Technical Report English 2010 In this paper, we describe our work on discovering missing links in Wikipedia articles. This task is important for both readers and authors of Wikipedia. The readers will benefit from the increased article quality with better navigation support. On the other hand, the system can be employed to support the authors during editing. This study combines the strengths of different approaches previously applied for the task, and adds its own techniques to reach satisfactory results. Because of the subjectivity in the nature of the task; automatic evaluation is hard to apply. Comparing approaches seems to be the best method to evaluate new techniques, and we offer a semi-automatized method for evaluation of the results. The recall is calculated automatically using existing links in Wikipedia. The precision is calculated according to manual evaluations of human assessors. Comparative results for different techniques arc presented, showing the success of our improvements. We employ Turkish Wikipedia, we arc the first to study on it, to examine whether a small instance is scalable enough for such purposes. 0 0
"Language Is the Skin of My Thought": Integrating Wikipedia and AI to Support a Guillotine Player Pasquale Lops
Pierpaolo Basile
Marco Gemmis
Giovanni Semeraro
Lecture Notes in Computer Science English 2009 This paper describes OTTHO (On the Tip of my THOught), a system designed for solving a language game, called Guillotine, which demands knowledge covering a broad range of topics, such as movies, politics, literature, history, proverbs, and popular culture. The rule of the game is simple: the player observes five words, generally unrelated to each other, and in one minute she has to provide a sixth word, semantically connected to the others. The system exploits several knowledge sources, such as a dictionary, a set of proverbs, and Wikipedia to realize a knowledge infusion process. The paper describes the process of modeling these sources and the reasoning mechanism to find the solution of the game. The main motivation for designing an artificial player for Guillotine is the challenge of providing the machine with the cultural and linguistic background knowledge which makes it similar to a human being, with the ability of interpreting natural language documents and reasoning on their content. Experiments carried out showed promising results. Our feeling is that the presented approach has a great potential for other more practical applications besides solving a language game. 0 0
A large margin approach to anaphora resolution for neuroscience knowledge discovery Burak Ozyurt I. Proceedings of the 22nd International Florida Artificial Intelligence Research Society Conference, FLAIRS-22 English 2009 A discriminative large margin classifier based approach to anaphora resolution for neuroscience abstracts is presented. The system employs both syntactic and semantic features. A support vector machine based word sense disambiguation method combining evidence from three methods, that use WordNet and Wikipedia, is also introduced and used for semantic features. The support vector machine anaphora resolution classifier with probabilistic outputs achieved almost four-fold improvement in accuracy over the baseline method. Copyright © 2009, Assocation for the Advancement of ArtdicaI Intelligence (www.aaai.org). All rights reserved. 0 0
A new financial investment management method based on knowledge management Yu Q. ISCID 2009 - 2009 International Symposium on Computational Intelligence and Design English 2009 There are many methodologies and theories developed for financial investment analysis. Nevertheless, financial analysts tend to adopt their proprietary models and systems to carry out financial investment analysis in practice. To advance both theories and practices in the financial investment domain, a knowledge management (KM) service is highly desirable to enable analysts, academics, and public investors to share their investment knowledge. This paper illustrates the design and development of a wiki-based investment knowledge management service which supports moderated sharing of structured and unstructured investment knowledge to facilitate investment decision making for both financial analysts and the general public. Our initial usability study shows that the proposed wiki-based investment knowledge management service is promising. 0 0
Automatic multilingual lexicon generation using wikipedia as a resource Shahid A.R.
Kazakov D.
ICAART 2009 - Proceedings of the 1st International Conference on Agents and Artificial Intelligence English 2009 This paper proposes a method for creating a multilingual dictionary by taking the titles of Wikipedia pages in English and then finding the titles of the corresponding articles in other languages. The creation of such multilingual dictionaries has become possible as a result of exponential increase in the size of multilingual information on the web. Wikipedia is a prime example of such multilingual source of information on any conceivable topic in the world, which is edited by the readers. Here, a web crawler has been used to traverse Wikipedia following the links on a given page. The crawler takes out the title along with the titles of the corresponding pages in other targeted languages. The result is a set of words and phrases that are translations of each other. For efficiency, the URLs are organized using hash tables. A lexicon has been constructed which contains 7-tuples corresponding to 7 different languages, namely: English, German, French, Polish, Bulgarian, Greek and Chinese. 0 0
Clustering hyperlinks for topic extraction: An exploratory analysis Villarreal S.E.G.
Elizalde L.M.
Viveros A.C.
8th Mexican International Conference on Artificial Intelligence - Proceedings of the Special Session, MICAI 2009 English 2009 In a Web of increasing size and complexity, a key issue is automatic document organization, which includes topic extraction in collections. Since we consider topics as document clusters with semantic properties, we are concerned with exploring suitable clustering techniques for their identification on hyperlinked environments (where we only regard structural information). For this purpose, three algorithms (PDDP, kmeans, and graph local clustering) were executed over a document subset of an increasingly popular corpus: Wikipedia. Results were evaluated with unsupervised metrics (cosine similarity, semantic relatedness, Jaccard index) and suggest that promising results can be produced for this particular domain. 0 0
Collaborative summarization: When collaborative filtering meets document summarization Qu Y.
Qingcai Chen
PACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation English 2009 We propose a new way of generating personalized single document summary by combining two complementary methods: collaborative filtering for tag recommendation and graph-based affinity propagation. The proposed method, named by Collaborative Summarization, consists of two steps iteratively repeated until convergence. In the first step, the possible tags of one user on a new document are predicted using collaborative filtering which bases on tagging histories of all users. The predicted tags of the new document are supposed to represent both the key idea of the document itself and the special content of interest to that specific user. In the second step, the predicted tags are used to guide graph-based affinity propagation algorithm to generate personalized summarization. The generated summary is in turn used to fine tune the prediction of tags in the first step. The most intriguing advantage of collaborative summarization is that it harvests human intelligence which is in the form of existing tag annotations of webpages, such as delicious.com bookmark tags, to tackle a complex NLP task which is very difficult for artificial intelligence alone. Experiment on summarization of wikipedia documents based on delicious.com bookmark tags shows the potential of this method. 0 0
Explicit versus latent concept models for cross-language information retrieval Philipp Cimiano
Schultz A.
Sizov S.
Sorg P.
Staab S.
IJCAI International Joint Conference on Artificial Intelligence English 2009 The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such as WordNet, latent topics derived from the data itself - as in Latent Semantic Indexing (LSI) or (Latent Dirichlet Allocation (LDA) - to Wikipedia articles as proxies for concepts, as in the recently proposed Explicit Semantic Analysis (ESA) model. A crucial question which has not been answered so far is whether models based on explicitly given concepts (as in the ESA model for instance) perform inherently better than retrieval models based on "latent" concepts (as in LSI and/or LDA). In this paper we investigate this question closer in the context of a cross-language setting, which inherently requires concept-based retrieval bridging between different languages. In particular, we compare the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both. From a general perspective, our results contribute to clarifying the role of explicit vs. implicitly derived or latent concepts in (crosslanguage) information retrieval research. 0 0
FolksoViz: A semantic relation-based folksonomy visualization using the Wikipedia corpus Kangpyo Lee
Hyeoncheol Kim
Hyopil Shin
Kim H.-J.
10th ACIS Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, SNPD 2009, In conjunction with IWEA 2009 and WEACR 2009 English 2009 Tagging is one of the most popular services in Web 2.0 and folksonomy is a representation of collaborative tagging. Tag cloud has been the one and only visualization of the folksonomy. The tag cloud, however, provides no information about the relations between tags. In this paper, targeting del.icio.us tag data, we propose a technique, FolksoViz, for automatically deriving semantic relations between tags and for visualizing the tags and their relations. In order to find the equivalence, subsumption, and similarity relations, we apply various rules and models based on the Wikipedia corpus. The derived relations are visualized effectively. The experiment shows that the FolksoViz manages to find the correct semantic relations with high accuracy. 0 0
Greedy algorithms for sequential sensing decisions Hajishirzi H.
Shirazi A.
Choi J.
Amir E.
IJCAI International Joint Conference on Artificial Intelligence English 2009 In many real-world situations we are charged with detecting change as soon as possible. Important examples include detecting medical conditions, detecting security breaches, and updating caches of distributed databases. In those situations, sensing can be expensive, but it is also important to detect change in a timely manner. In this paper we present tractable greedy algorithms and prove that they solve this decision problem either optimally or approximate the optimal solution in many cases. Our problem model is a POMDP that includes a cost for sensing, a cost for delayed detection, a reward for successful detection, and no-cost partial observations. Making optimal decisions is difficult in general. We show that our tractable greedy approach finds optimal policies for sensing both a single variable and multiple correlated variables. Further, we provide approximations for the optimal solution to multiple hidden or observed variables per step. Our algorithms outperform previous algorithms in experiments over simulated data and live Wikipedia WWW pages. 0 0
Large-scale taxonomy mapping for restructuring and integrating Wikipedia Ponzetto S.P.
Roberto Navigli
IJCAI International Joint Conference on Artificial Intelligence English 2009 We present a knowledge-rich methodology for disambiguating Wikipedia categories with WordNet synsets and using this semantic information to restructure a taxonomy automatically generated from the Wikipedia system of categories. We evaluate against a manual gold standard and show that both category disambiguation and taxonomy restructuring perform with high accuracy. Besides, we assess these methods on automatically generated datasets and show that we are able to effectively enrich WordNet with a large number of instances from Wikipedia. Our approach produces an integrated resource, thus bringing together the fine-grained classification of instances in Wikipedia and a well-structured top-level taxonomy from WordNet. 0 0
Modeling clinical protocols using semantic mediawiki: The case of the oncocure project Eccher C.
Ferro A.
Seyfang A.
Marco Rospocher
Silvia Miksch
Lecture Notes in Computer Science English 2009 A computerized Decision Support Systems (DSS) can improve the adherence of the clinicians to clinical guidelines and protocols. The building of a prescriptive DSS based on breast cancer treatment protocols and its integration with a legacy Electronic Patient Record is the aim of the Oncocure project. An important task of this project is the encoding of the protocols in computer-executable form - a task that requires the collaboration of physicians and computer scientists in a distributed environment. In this paper, we describe our project and how semantic wiki technology was used for the encoding task. Semantic wiki technology features great flexibility, allowing to mix unstructured information and semantic annotations, and to automatically generate the final model with minimal adaptation cost. These features render semantic wikis natural candidates for small to medium scale modeling tasks, where the adaptation and training effort of bigger systems cannot be justified. This approach is not constrained to a specific protocol modeling language, but can be used as a collaborative tool for other languages. When implemented, our DSS is expected to reduce the cost of care while improving the adherence to the guideline and the quality of the documentation. 0 0
Question classification - A semantic approach using wordnet and wikipedia Ray S.K.
Sandesh Singh
Joshi B.P.
Proceedings of the 4th Indian International Conference on Artificial Intelligence, IICAI 2009 English 2009 Question Answering Systems are providing answers to the users' questions in succinct form where Question classification module of a Question Answering System plays a very important role in pinpointing the exact answer of the question. In literature, incorrect question classification lias been cited as one of the major causes of poor performance of the Question Answering Systems and this emphasizes on the importance of question classification module designing. In this paper, we have proposed a question classification method that combines the powerful semantic features of the WordNet and the vast knowledge repository of the Wikipedia to describe informative terms explicitly. We have trained our method over a standard set of 5500 questions (by UIUC) and then tested over 5 TREC question collections and compared our results. Our system's average question classification accuracy is 89.55% in comparison of 80.2% by Zhang and Lee [17], 84.2% by Li and Roth [7], 89.2% by Huang [6]. The question classification accuracy suggests the effectiveness of the method which is promising in the field of open domain question classification. Copyright 0 0
Reference resolution challenges for intelligent agents: The need for knowledge McShane M. IEEE Intelligent Systems English 2009 The difficult cases of reference in natural language processing require intelligent agents that can reason about language and the machine-tractable knowledge. The knowledge-lean model relies on various statistical techniques that are trained over a manually defined collection, typically using a small number of features such as morphological agreement, the text distance between the entity and the potential coreferent, and various other features that do not require text understanding. The incorporation of some semantic features drawn from Wikipedia, and WordNet improves reference resolution for some referring expressions. One promoter of knowledge-lean corpus-based methods was the message understanding conference (MUC) reference resolution task, for which sponsors provided annotated corpora for the training and evaluation of the competing systems. The two requirements for the reference annotation strategy were the need for greater than 95 percent interannotator agreement and the ability to annotate quickly and cheaply. 0 0
Weblogs as a source for extracting general world knowledge Gordon J.
Van Durme B.
Schubert L.
K-CAP'09 - Proceedings of the 5th International Conference on Knowledge Capture English 2009 Knowledge extraction (KE) efforts have often used corpora of heavily edited writing and sources written to provide the desired knowledge (e.g., newspapers or textbooks). However, the proliferation of diverse, up-to-date, unedited writing on the Web, especially in weblogs, offers new challenges for KE tools. We describe our efforts to extract general knowledge implicit in this noisy data and examine whether such sources can be an adequate substitute for resources like Wikipedia. 0 0
Wikispeedia: An online game for inferring semantic distances between concepts Robert West
Joelle Pineau
Doina Precup
IJCAI International Joint Conference on Artificial Intelligence English 2009 Computing the semantic distance between real-world concepts is crucial for many intelligent applications. We present a novel method that leverages data from 'Wikispeedia', an online game played on Wikipedia; players have to reach an article from another, unrelated article, only by clicking links in the articles encountered. In order to automatically infer semantic distances between everyday concepts, our method effectively extracts the common sense displayed by humans during play, and is thus more desirable, from a cognitive point of view, than purely corpus-based methods. We show that our method significantly outperforms Latent Semantic Analysis in a psychometric evaluation of the quality of learned semantic distances. 0 0
A bush encroachment decision support system's metamorphosis Winschiers-Theophilus H.
Fendler J.
Stanley C.
Joubert D.
Zimmermann I.
Mukumbira S.
Proceedings of the 20th Australasian Conference on Computer-Human Interaction: Designing for Habitus and Habitat, OZCHI'08 English 2008 Since the inception of our bush-encroachment decision support system, we have gone through many cycles of adaptations while striving towards what we believed to be a usable system. A fundamental difference between community based users and individualistic users necessitates a change in the design and evaluation methods as well as a community agreement of concepts and values guiding the design. In this paper we share the lessons learned along the story depicting the metamorphosis of a bush encroachment decision support system in Southern African rangelands. Above and beyond community members participating in the design and evaluation of the system, they establish the community grounded values determining the system's quality concepts such as usability. 0 0
An effective, low-cost measure of semantic relatedness obtained from wikipedia links Milne D.
Witten I.H.
AAAI Workshop - Technical Report English 2008 This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Our approach is unique in that it does so using the hyperlink structure of Wikipedia rather than its category hierarchy or textual content. Evaluation with manually defined measures of semantic relatedness reveals this to be an effective compromise between the ease of computation of the former approach and the accuracy of the latter. Copyright 0 1
Applying the logic of multiple-valued argumentation to social web: SNS and wikipedia Shusuke Kuribara
Safia Abbas
Hajime Sawamura
Lecture Notes in Computer Science English 2008 The Logic of Multiple-Valued Argumentation (LMA) is an argumentation framework that allows for argument-based reasoning about uncertain issues under uncertain knowledge. In this paper, we describe its applications to Social Web: SNS and Wikipedia. They are said to be the most influential social Web applications to the present and future information society. For SNS, we present an agent that judges the registration approval for Mymixi in mixi in terms of LMA. For Wikipedia, we focus on the deletion problem of Wikipedia and present agents that argue about the issue on whether contributed articles should be deleted or not, analyzing arguments proposed for deletion in terms of LMA. These attempts reveal that LMA can deal with not only potential applications but also practical ones such as extensive and contemporary applications. 0 0
Augmenting wikipedia-extraction with results from the web Fei Wu
Raphael Hoffmann
Weld D.S.
AAAI Workshop - Technical Report English 2008 Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper explains and evaluates a method for improving recall by extracting from the broader Web. There are two key advances necessary to make Web supplementation effective: 1) a method to filter promising sentences from Web pages, and 2) a novel retraining technique to broaden extractor recall. Experiments show that, used in concert with shrinkage, our techniques increase recall by a factor of up to 8 while maintaining or increasing precision. Copyright 0 0
Automatic thesaurus generation using co-occurrence Brussee R.
Wartena C.
Belgian/Netherlands Artificial Intelligence Conference English 2008 This paper proposes a characterization of useful thesaurus terms by the informativity of co-occurrence with that term. Given a corpus of documents, informativity is formalized as the information gain of the weighted average term distribution of all documents containing that term. While the resulting algorithm for thesaurus generation is unsupervised, we find that high informativity terms correspond to large and coherent subsets of documents. We evaluate our method on a set of DutchWikipedia articles by comparing high informativity terms with keywords for the Wikipedia category of the articles. 0 0
Automatic vandalism detection in wikipedia: Towards a machine learning approach Smets K.
Goethals B.
Verdonk B.
AAAI Workshop - Technical Report English 2008 Since the end of 2006 several autonomous bots are, or have been, running on Wikipedia to keep the encyclopedia free from vandalism and other damaging edits. These expert systems, however, are far from optimal and should be improved to relieve the human editors from the burden of manually reverting such edits. We investigate the possibility of using machine learning techniques to build an autonomous system capable to distinguish vandalism from legitimate edits. We highlight the results of a small but important step in this direction by applying commonly known machine learning algorithms using a straightforward feature representation. Despite the promising results, this study reveals that elementary features, which are also used by the current approaches to fight vandalism, are not sufficient to build such a system. They will need to be accompanied by additional information which, among other things, incorporates the semantics of a revision. Copyright 0 3
Concept-based feature generation and selection for information retrieval Egozi O.
Evgeniy Gabrilovich
Shaul Markovitch
Proceedings of the National Conference on Artificial Intelligence English 2008 Traditional information retrieval systems use query words to identify relevant documents. In difficult retrieval tasks, however, one needs access to a wealth of background knowledge. We present a method that uses Wikipedia-based feature generation to improve retrieval performance. Intuitively, we expect that using extensive world knowledge is likely to improve recall but may adversely affect precision. High quality feature selection is necessary to maintain high precision, but here we do not have the labeled training data for evaluating features, that we have in supervised learning. We present a new feature selection method that is inspired by pseudo-relevance feedback. We use the top-ranked and bottom-ranked documents retrieved by the bag-of-words method as representative sets of relevant and non-relevant documents. The generated features are then evaluated and filtered on the basis of these sets. Experiments on TREC data confirm the superior performance of our method compared to the previous state of the art. Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 0
Decoding Wikipedia categories for knowledge acquisition Vivi Nastase
Michael Strube
Proceedings of the National Conference on Artificial Intelligence English 2008 This paper presents an approach to acquire knowledge from Wikipedia categories and the category network. Many Wikipedia categories have complex names which reflect human classification and organizing instances, and thus encode knowledge about class attributes, taxonomic and other semantic relations. We decode the names and refer back to the network to induce relations between concepts in Wikipedia represented through pages or categories. The category structure allows us to propagate a relation detected between constituents of a category name to numerous concept links. The results of the process are evaluated against ResearchCyc and a subset also by human judges. The results support the idea that Wikipedia category names are a rich source of useful and accurate knowledge. Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 0
EFS: Expert finding system based on wikipedia link pattern analysis Yang K.-H.
Chen C.-Y.
Lee H.-M.
Ho J.-M.
Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics English 2008 Building an expert finding system is very important for many applications especially in the academic environment. Previous work uses e-mails or web pages as corpus to analyze the expertise for each expert. In this paper, we present an Expert Finding System, abbreviated as EFS to build experts' profiles by using their journal publications. For a given proposal, the EFS first looks up the Wikipedia web site to get relative link information, and then list and rank all associated experts by using those information. In our experiments, we use a real-world dataset which comprises of 882 people and 13,654 papers, and are categorized into 9 expertise domains. Our experimental results show that the EFS works well on several expertise domains like "Artificial Intelligence" and "Image & Pattern Recognition" etc. 0 0
Employing a domain specific ontology to perform semantic search Morneau M.
Mineau G.W.
Lecture Notes in Computer Science English 2008 Increasing the relevancy of Web search results has been a major concern in research over the last years. Boolean search, metadata, natural language based processing and various other techniques have been applied to improve the quality of search results sent to a user. Ontology-based methods were proposed to refine the information extraction process but they have not yet achieved wide adoption by search engines. This is mainly due to the fact that the ontology building process is time consuming. An all inclusive ontology for the entire World Wide Web might be difficult if not impossible to construct, but a specific domain ontology can be automatically built using statistical and machine learning techniques, as done with our tool: SeseiOnto. In this paper, we describe how we adapted the SeseiOnto software to perform Web search on the Wikipedia page on climate change. SeseiOnto, by using conceptual graphs to represent natural language and an ontology to extract links between concepts, manages to properly answer natural language queries about climate change. Our tests show that SeseiOnto has the potential to be used in domain specific Web search as well as in corporate intranets. 0 0
Enriching the crosslingual link structure of wikipedia - A classification-based approach Sorg P.
Philipp Cimiano
AAAI Workshop - Technical Report English 2008 The crosslingual link structure of Wikipedia represents a valuable resource which can be exploited for crosslingual natural language processing applications. However, this requires that it has a reasonable coverage and is furthermore accurate. For the specific language pair German/English that we consider in our experiments, we show that roughly 50% of the articles are linked from German to English and only 14% from English to German. These figures clearly corroborate the need for an approach to automatically induce new cross-language links, especially in the light of such a dynamically growing resource such as Wikipedia. In this paper we present a classification-based approach with the goal of inferring new cross-language links. Our experiments show that this approach has a recall of 70% with a precision of 94% for the task of learning cross-language links on a test dataset. 0 0
Importance of semantic representation: Dataless classification Chang M.-W.
Lev Ratinov
Dan Roth
Srikumar V.
Proceedings of the National Conference on Artificial Intelligence English 2008 Traditionally, text categorization has been studied as the problem of training of a classifier using labeled data. However, people can categorize documents into named categories without any explicit training because we know the meaning of category names. In this paper, we introduce Dataless Classification, a learning protocol that uses world knowledge to induce classifiers without the need for any labeled data. Like humans, a dataless classifier interprets a string of words as a set of semantic concepts. We propose a model for dataless classification and show that the label name alone is often sufficient to induce classifiers. Using Wikipedia as our source of world knowledge, we get 85.29% accuracy on tasks from the 20 Newsgroup dataset and 88.62% accuracy on tasks from a Yahoo! Answers dataset without any labeled or unlabeled data from the datasets. With unlabeled data, we can further improve the results and show quite competitive performance to a supervised learning algorithm that uses 100 labeled examples. Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 0
Improving interaction with virtual globes through spatial thinking: Helping users ask "Why?" Schoming J.
Raubal M.
Marsh M.
Brent Hecht
Antonio Kruger
Michael Rohs
International Conference on Intelligent User Interfaces, Proceedings IUI English 2008 Virtual globes have progressed from little-known technology to broadly popular software in a mere few years. We investigated this phenomenon through a survey and discovered that, while virtual globes are en vogue, their use is restricted to a small set of tasks so simple that they do not involve any spatial thinking. Spatial thinking requires that users ask "what is where" and "why"; the most common virtual globe tasks only include the "what". Based on the results of this survey, we have developed a multi-touch virtual globe derived from an adapted virtual globe paradigm designed to widen the potential uses of the technology by helping its users to inquire about both the "what is where" and "why" of spatial distribution. We do not seek to provide users with full GIS (geographic information system) functionality, but rather we aim to facilitate the asking and answering of simple "why" questions about general topics that appeal to a wide virtual globe user base. Copyright 2008 ACM. 0 0
Integrating cyc and wikipedia: Folksonomy meets rigorously defined common-sense Olena Medelyan
Cathy Legg
AAAI Workshop - Technical Report English 2008 Integration of ontologies begins with establishing mappings between their concept entries. We map categories from the largest manually-built ontology, Cyc, onto Wikipedia articles describing corresponding concepts. Our method draws both on Wikipedia's rich but chaotic hyperlink structure and Cyc's carefully defined taxonomic and common-sense knowledge. On 9,333 manual alignments by one person, we achieve an F-measure of 90%; on 100 alignments by six human subjects the average agreement of the method with the subject is close to their agreement with each other. We cover 62.8% of Cyc categories relating to common-sense knowledge and discuss what further information might be added to Cyc given this substantial new alignment. Copyright 0 0
Knowledge supervised text classification with no labeled documents Zhang C.
Xue G.-R.
Yiqin Yu
Lecture Notes in Computer Science English 2008 In traditional text classification approaches, the semantic meanings of the classes are described by the labeled documents. Since labeling documents is often time consuming and expensive, it is a promising idea that asking users to provide some keywords to depict the classes, instead of labeling any documents. However, short pieces of keywords may not contain enough information and therefore may lead to unreliable classifier. Fortunately, there are large amount of public data easily available in web directories, such as ODP, Wikipedia, etc. We are interested in exploring the enormous crowd intelligence contained in such public data to enhance text classification. In this paper, we propose a novel text classification framework called "Knowledge Supervised Learning"(KSL), which utilizes the knowledge in keywords and the crowd intelligence to learn the classifier without any labeled documents. We design a two-stage risk minimization (TSRM) approach for the KSL problem. It can optimize the expected prediction risk and build the high quality classifier. Empirical results verify our claim: our algorithm can achieve above 0.9 on Micro-F1 on average, which is much better than baselines and even comparable against SVM classifier supervised by labeled documents. 0 0
Learning to predict the quality of contributions to wikipedia Druck G.
Miklau G.
McCallum A.
AAAI Workshop - Technical Report English 2008 Although some have argued that Wikipedia's open edit policy is one of the primary reasons for its success, it also raises concerns about quality - vandalism, bias, and errors can be problems. Despite these challenges, Wikipedia articles are often (perhaps surprisingly) of high quality, which many attribute to both the dedicated Wikipedia community and "good Samaritan" users. As Wikipedia continues to grow, however, it becomes more difficult for these users to keep up with the increasing number of articles and edits. This motivates the development of tools to assist users in creating and maintaining quality. In this paper, we propose metrics that quantify the quality of contributions to Wikipedia through implicit feedback from the community. We then learn discriminative probabilistic models that predict the quality of a new edit using features of the changes made, the author of the edit, and the article being edited. Through estimating parameters for these models, we also gain an understanding of factors that influence quality. We advocate using edit quality predictions and information gleaned from model analysis not to place restrictions on editing, but to instead alert users to potential quality problems, and to facilitate the development of additional incentives for contributors. We evaluate the edit quality prediction models on the Spanish Wikipedia. Experiments demonstrate that the models perform better when given access to content-based features of the edit, rather than only features of contributing user. This suggests that a user-based solution to the Wikipedia quality problem may not be sufficient. Copyright 0 2
Method for building sentence-aligned corpus from wikipedia Yasuda K.
Eiichiro Sumita
AAAI Workshop - Technical Report English 2008 We propose the framework of a Machine Translation (MT) bootstrapping method by using multilingual Wikipedia articles. This novel method can simultaneously generate a statistical machine translation (SMT) and a sentence-aligned corpus. In this study, we perform two types of experiments. The aim of the first type of experiments is to verify the sentence alignment performance by comparing the proposed method with a conventional sentence alignment approach. For the first type of experiments, we use JENAAD, which is a sentence-aligned corpus built by the conventional sentence alignment method. The second type of experiments uses actual English and Japanese Wikipedia articles for sentence alignment. The result of the first type of experiments shows that the performance of the proposed method is comparable to that of the conventional sentence alignment method. Additionally, the second type of experiments shows that wc can obtain the English translation of 10% of Japanese sentences while maintaining high alignment quality (rank-A ratio of over 0.8). Copyright 0 1
Named entity disambiguation on an ontology enriched by Wikipedia Nguyen H.T.
Cao T.H.
RIVF 2008 - 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies English 2008 Currently, for named entity disambiguation, the shortage of training data is a problem. This paper presents a novel method that overcomes this problem by automatically generating an annotated corpus based on a specific ontology. Then the corpus was enriched with new and informative features extracted from Wikipedia data. Moreover, rather than pursuing rule-based methods as in literature, we employ a machine learning model to not only disambiguate but also identify named entities. In addition, our method explores in details the use of a range of features extracted from texts, a given ontology, and Wikipedia data for disambiguation. This paper also systematically analyzes impacts of the features on disambiguation accuracy by varying their combinations for representing named entities. Empirical evaluation shows that, while the ontology provides basic features of named entities, Wikipedia is a fertile source for additional features to construct accurate and robust named entity disambiguation systems. 0 0
Object image retrieval by exploiting online knowledge resources Gang Wang
Forsyth D.
26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR English 2008 We describe a method to retrieve images found on web pages with specified object class labels, using an analysis of text around the image and of image appearance. Our method determines whether an object is both described in text and appears in a image using a discriminative image model and a generative text model. Our models are learnt by exploiting established online knowledge resources (Wikipedia pages for text; Flickr and Caltech data sets for image). These resources provide rich text and object appearance information. We describe results on two data sets. The first is Berg's collection of ten animal categories; on this data set, we outperform previous approaches [7, 33]. We have also collected five more categories. Experimental results show the effectiveness of our approach on this new data set. 0 0
Okinet: Automatic extraction of a medical ontology from wikipedia Pedro V.C.
Niculescu R.S.
Lita L.V.
AAAI Workshop - Technical Report English 2008 The medical domain provides a fertile ground for the application of ontological knowledge. Ontologies are an essential part of many approaches to medical text processing, understanding and reasoning. However, the creation of large, high quality medical ontologies is not trivial, requiring the analysis of domain sources, background knowledge, as well as obtaining consensus among experts. Current methods are labor intensive, prone to generate inconsistencies, and often require expert knowledge. Fortunately, semi structured information repositories, like Wikipedia, provide a valuable resource from which to mine structured information. In this paper we propose a novel framework for automatically creating medical ontologies from semi-structured data. As part of this framework, we present a Directional Feedback Edge Labeling (DFEL) algorithm. We successfully demonstrate the effectiveness of the DFEL algorithm on the task of labeling the relations of Okinet, a Wikipedia based medical ontology. Current results demonstrate the high performance, utility, and flexibility of our approach. We conclude by describing ROSE, an application that combines Okinet with other medical ontologies. 0 0
On visualizing heterogeneous semantic networks from multiple data sources Maureen
Aixin Sun
Lim E.-P.
Anwitaman Datta
Kuiyu Chang
Lecture Notes in Computer Science English 2008 In this paper, we focus on the visualization of heterogeneous semantic networks obtained from multiple data sources. A semantic network comprising a set of entities and relationships is often used for representing knowledge derived from textual data or database records. Although the semantic networks created for the same domain at different data sources may cover a similar set of entities, these networks could also be very different because of naming conventions, coverage, view points, and other reasons. Since digital libraries often contain data from multiple sources, we propose a visualization tool to integrate and analyze the differences among multiple social networks. Through a case study on two terrorism-related semantic networks derived from Wikipedia and Terrorism Knowledge Base (TKB) respectively, the effectiveness of our proposed visualization tool is demonstrated. 0 0
Powerset's natural language wikipedia search engine Converse T.
Kaplan R.M.
Pell B.
Prevost S.
Thione L.
Walters C.
AAAI Workshop - Technical Report English 2008 This demonstration shows the capabilities and features of Powerset's natural language search engine as applied to the English Wikipedia. Powerset has assembled scalable document retrieval technology to construct a semantic index of the World Wide Web. In order to develop and test our technology, we have released a search product (at http://www.powerset.com) that incorporates all the information from the English Wikipedia. The product also integrates community-edited content from Metaweb's Freebase database of structured information. Users may query the index using keywords, natural language questions or phrases. Retrieval latency is comparable to standard key-word based consumer search engines. Powerset semantic indexing is based on the XLE, Natural Language Processing technology licensed from the Palo Alto Research Center (PARC). During both indexing and querying, we apply our deep natural language analysis methods to extract semantic "facts" - relations and semantic connections between words and concepts - from all the sentences in Wikipedia. At query time, advanced search-engineering technology makes these facts available for retrieval by matching them against facts or partial facts extracted from the query. In this demonstration, we show how retrieved information is presented as conventional search results with links to relevant Wikipedia pages. We also demonstrate how the distilled semantic relations are organized in a browsing format that shows relevant subject/relation/object triples related to the user's query. This makes it easy both to find other relevant pages and to use our Search-Within-The-Page feature to localize additional semantic searches to the text of the selected target page. Together these features summarize the facts on a page and allow navigation directly to information of interest to individual users. Looking ahead beyond continuous improvements to core search and scaling to much larger collections of content, Powerset's automatic extraction of semantic facts can be used to create and extend knowledge resources including lexicons, ontologies, and entity profiles. Our system is already deployed as a consumer-search web service, but we also plan to develop an API that will enable programmatic access to our structured representation of text. 0 0
Text categorization with knowledge transfer from heterogeneous data sources Gupta R.
Lev Ratinov
Proceedings of the National Conference on Artificial Intelligence English 2008 Multi-category classification of short dialogues is a common task performed by humans. When assigning a question to an expert, a customer service operator tries to classify the customer query into one of N different classes for which experts are available. Similarly, questions on the web (for example questions at Yahoo Answers) can be automatically forwarded to a restricted group of people with a specific expertise. Typical questions are short and assume background world knowledge for correct classification. With exponentially increasing amount of knowledge available, with distinct properties (labeled vs unlabeled, structured vs unstructured), no single knowledge-transfer algorithm such as transfer learning, multi-task learning or self-taught learning can be applied universally. In this work we show that bag-of-words classifiers performs poorly on noisy short conversational text snippets. We present an algorithm for leveraging heterogeneous data sources and algorithms with significant improvements over any single algorithm, rivaling human performance. Using different algorithms for each knowledge source we use mutual information to aggressively prune features. With heterogeneous data sources including Wikipedia, Open Directory Project (ODP), and Yahoo Answers, we show 89.4% and 96.8% correct classification on Google Answers corpus and Switchboard corpus using only 200 features/class. This reflects a huge improvement over bag of words approaches and 48-65% error reduction over previously published state of art (Gabrilovich et. al. 2006). Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 0
The fast and the numerous - Combining machine and community intelligence for semantic annotation Sebastian Blohm
Krotzsch M.
Philipp Cimiano
AAAI Workshop - Technical Report English 2008 Starting from the observation that certain communities have incentive mechanisms in place to create large amounts of unstructured content, we propose in this paper an original model which we expect to lead to the large number of annotations required to semantically enrich Web content at a large scale. The novelty of our model lies in the combination of two key ingredients: the effort that online communities are making to create content and the capability of machines to detect regular patterns in user annotation to suggest new annotations. Provided that the creation of semantic content is made easy enough and incentives are in place, we can assume that these communities will be willing to provide annotations. However, as human resources are clearly limited, we aim at integrating algorithmic support into our model to bootstrap on existing annotations and learn patterns to be used for suggesting new annotations. As the automatically extracted information needs to be validated, our model presents the extracted knowledge to the user in the form of questions, thus allowing for the validation of the information. In this paper, we describe the requirements on our model, its concrete implementation based on Semantic MediaWiki and an information extraction system and discuss lessons learned from practical experience with real users. These experiences allow us to conclude that our model is a promising approach towards leveraging semantic annotation. Copyright 0 0
Using wikipedia links to construct word segmentation corpora Gabay D.
Ziv B.E.
Elhadad M.
AAAI Workshop - Technical Report English 2008 Tagged corpora are essential for evaluating and training natural language processing tools. The cost of constructing large enough manually tagged corpora is high, even when the annotation level is shallow. This article describes a simple method to automatically create a partially tagged corpus, using Wikipedia hyperlinks. The resulting corpus contains information about the correct segmentation of 523,599 non-consecutive words in 363,090 sentences. We used our method to construct a corpus of Modern Hebrew (which we have made available at http://www.cs.bgu.ac.il/-nlpproj). The method can also be applied to other languages where word segmentation is difficult to determine, such as East and South-East Asian languages. Copyright 0 0
Using wiktionary for computing semantic relatedness Torsten Zesch
Muller C.
Iryna Gurevych
Proceedings of the National Conference on Artificial Intelligence English 2008 We introduce Wiktionary as an emerging lexical semantic resource that can be used as a substitute for expert-made resources in AI applications. We evaluate Wiktionary on the pervasive task of computing semantic relatedness for English and German by means of correlation with human rankings and solving word choice problems. For the first time, we apply a concept vector based measure to a set of different concept representations like Wiktionary pseudo glosses, the first paragraph of Wikipedia articles, English WordNet glosses, and GermaNet pseudo glosses. We show that: (i) Wiktionary is the best lexical semantic resource in the ranking task and performs comparably to other resources in the word choice task, and (ii) the concept vector based approach yields the best results on all datasets in both evaluations. Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 1
Wikis: 'From Each According to His Knowledge' Daniel E. O'Leary Computer English 2008 0 0
Wikis: 'From each according to his knowledge' O'Leary D.E. Computer English 2008 Wikis offer tremendous potential to capture knowledge from large groups of people, making tacit, hidden content explicit and widely available. They also efficiently connect those with information to those seeking it. 0 0
Adynamic voting wiki model Carolynne White
Linda Plotnick
Murray Turoff
Hiltz S.R.
Association for Information Systems - 13th Americas Conference on Information Systems, AMCIS 2007: Reaching New Heights English 2007 Defining a problem and understanding it syntactically as well as semantically enhances the decision process because the written agenda and solutions are understood on a token level. Consensus in groups can be challenging in present web based environments given the dynamics of types of interactions and needs. Larger virtual communities are beginning to use wiki based decision support systems for time critical interactions where the quality of the information is high and a near real time feedback system is necessary. Understanding the meaning of the problem and group consensus can be improved exploiting a voting enhanced wiki structure implemented into select parts of the decision making process. A decision support model integrating a wiki structure and a social decision support system (voting) is presented. Findings from a pilot study describe differences of idea generation between groups. Other issues are identified requiring further research. 0 0
Boosting inductive transfer for text classification using Wikipedia Somnath Banerjee Proceedings - 6th International Conference on Machine Learning and Applications, ICMLA 2007 English 2007 Inductive transfer is applying knowledge learned on one set of tasks to improve the performance of learning a new task. Inductive transfer is being applied in improving the generalization performance on a classification task using the models learned on some related tasks. In this paper, we show a method of making inductive transfer for text classification more effective using Wikipedia. We map the text documents of the different tasks to a feature space created using Wikipedia, thereby providing some background knowledge of the contents of the documents. It has been observed here that when the classifiers are built using the features generated from Wikipedia they become more effective in transferring knowledge. An evaluation on the daily classification task on the Reuters RCV1 corpus shows that our method can significantly improve the performance of inductive transfer. Our method was also able to successfully overcome a major obstacle observed in a recent work on a similar setting. 0 0
Supporting navigation in large document corpora of online communities: A case-study on the categorization of the english wikipedia Gal V.
Tikk D.
Biro G.
8th International Symposium of Hungarian Researchers on Computational Intelligence and Informatics, CINTI 2007 English 2007 This paper describes an anaphora resolution system that identifies and resolves anaphors in Hungarian texts. The system works on syntactically parsed texts. It employs anaphora identifying techniques in conjunction with syntax and morphology based resolution techniques. The most novel aspect of the system lies in its anaphora resolution technique that combines syntactic, morphologic and discourse information. 0 0
Breaking the knowledge acquisition bottleneck through conversational knowledge management Christian Wagner Information Resources Management Journal English 2006 Much of today's organizational knowledge still exists outside of formal information repositories and often only in people's heads. While organizations are eager to capture this knowledge, existing acquisition methods are not up to the task. Neither traditional artificial intelligence-based approaches nor more recent, less-structured knowledge management techniques have overcome the knowledge acquisition challenges. This article investigates knowledge acquisition bottlenecks and proposes the use of collaborative, conversational knowledge management to remove them. The article demonstrates the opportunity for more effective knowledge acquisition through the application of the principles of Bazaar style, open-source development. The article introduces wikis as software that enables this type of knowledge acquisition. It empirically analyzes the Wikipedia to produce evidence for the feasibility and effectiveness of the proposed approach. Copyright 0 1
Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge Evgeniy Gabrilovich
Shaul Markovitch
Proceedings of the National Conference on Artificial Intelligence English 2006 When humans approach the task of text categorization, they interpret the specific wording of the document in the much larger context of their background knowledge and experience. On the other hand, state-of-the-art information retrieval systems are quite brittle - they traditionally represent documents as bags of words, and are restricted to learning from individual word occurrences in the (necessarily limited) training set. For instance, given the sentence "Wal-Mart supply chain goes real time", how can a text categorization system know that Wal-Mart manages its stock with RFID technology? And having read that "Ciprofioxacin belongs to the quinolones group", how on earth can a machine know that the drug mentioned is an antibiotic produced by Bayer? In this paper we present algorithms that can do just that. We propose to enrich document representation through automatic use of a vast compendium of human knowledge - an encyclopedia. We apply machine learning techniques to Wikipedia, the largest encyclopedia to date, which surpasses in scope many conventional encyclopedias and provides a cornucopia of world knowledge. Each Wikipedia article represents a concept, and documents to be categorized are represented in the rich feature space of words and relevant Wikipedia concepts. Empirical results confirm that this knowledge-intensive representation brings text categorization to a qualitatively new level of performance across a diverse collection of datasets. Copyright © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. 0 1