From WikiPapers
Jump to: navigation, search

Linguistics is included as keyword or extra keyword in 0 datasets, 0 tools and 113 publications.


There is no datasets for this keyword.


There is no tools for this keyword.


Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Arabic text categorization based on arabic wikipedia Yahya A.
Salhi A.
ACM Transactions on Asian Language Information Processing English 2014 This article describes an algorithm for categorizing Arabic text, relying on highly categorized corpus-based datasets obtained from the Arabic Wikipedia by using manual and automated processes to build and customize categories. The categorization algorithm was built by adopting a simple categorization idea then moving forward to more complex ones. We applied tests and filtration criteria to reach the best and most efficient results that our algorithm can achieve. The categorization depends on the statistical relations between the input (test) text and the reference (training) data supported by well-defined Wikipedia-based categories. Our algorithm supports two levels for categorizing Arabic text; categories are grouped into a hierarchy of main categories and subcategories. This introduces a challenge due to the correlation between certain subcategories and overlap between main categories. We argue that our algorithm achieved good performance compared to other methods reported in the literature. 0 0
Extracting Ontologies from Arabic Wikipedia: A Linguistic Approach Al-Rajebah N.I.
Al-Khalifa H.S.
Arabian Journal for Science and Engineering English 2014 As one of the important aspects of semantic web, building ontological models became a driving demand for developing a variety of semantic web applications. Through the years, much research was conducted to investigate the process of generating ontologies automatically from semi-structured knowledge sources such as Wikipedia. Different ontology building techniques were investigated, e.g., NLP tools and pattern matching, infoboxes and structured knowledge sources (Cyc and WordNet). Looking at the results of previous approaches we can see that the vast majority of employed techniques did not consider the linguistic aspect of Wikipedia. In this article, we present our solution to extract ontologies from Wikipedia using a linguistic approach based on the semantic field theory introduced by Jost Trier. Linguistic ontologies are significant in many applications for both linguists and Web researchers. We applied the proposed approach on the Arabic version of Wikipedia. The semantic relations were extracted from infoboxes, hyperlinks within infoboxes and list of categories that articles belong to. Our system successfully extracted approximately (760,000) triples from the Arabic Wikipedia. We conducted three experiments to evaluate the system output, namely: Validation Test, Crowd Evaluation and Domain Experts' evaluation. The system output achieved an average precision of 65 %. 0 0
Multilinguals and wikipedia editing Hale S.A. WebSci 2014 - Proceedings of the 2014 ACM Web Science Conference English 2014 This article analyzes one month of edits to Wikipedia in order to examine the role of users editing multiple language editions (referred to as multilingual users). Such multilingual users may serve an important function in diffusing information across different language editions of the encyclopedia, and prior work has suggested this could reduce the level of self-focus bias in each edition. This study finds multilingual users are much more active than their single-edition (monolingual) counterparts. They are found in all language editions, but smaller-sized editions with fewer users have a higher percentage of multilingual users than larger-sized editions. About a quarter of multilingual users always edit the same articles in multiple languages, while just over 40% of multilingual users edit different articles in different languages. When non-English users do edit a second language edition, that edition is most frequently English. Nonetheless, several regional and linguistic cross-editing patterns are also present. Copyright 0 0
Okinawa in Japanese and English Wikipedia Hale S.A. Conference on Human Factors in Computing Systems - Proceedings English 2014 This research analyzes edits by foreign-language users in Wikipedia articles about Okinawa, Japan, in the Japanese and English editions of the encyclopedia. Okinawa, home to both English and Japanese speaking users, provides a good case to look at content differences and cross-language editing in a small geographic area on Wikipedia. Consistent with prior work, this research finds large differences in the representations of Okinawa in the content of the two editions. The number of users crossing the language boundary to edit both editions is also extremely small. When users do edit in a non-primary language, they most frequently edit articles that have cross-language (interwiki) links, articles that are edited more by other users, and articles that have more images. Finally, the possible value of edits from foreign-language users and design possibilities to motivate wider contributions from foreign-language users are discussed. 0 0
A comparison of named entity recognition tools applied to biographical texts Atdag S.
Labatut V.
2013 2nd International Conference on Systems and Computer Science, ICSCS 2013 English 2013 Named entity recognition (NER) is a popular domain of natural language processing. For this reason, many tools exist to perform this task. Amongst other points, they differ in the processing method they rely upon, the entity types they can detect, the nature of the text they can handle, and their input/output formats. This makes it difficult for a user to select an appropriate NER tool for a specific situation. In this article, we try to answer this question in the context of biographic texts. For this matter, we first constitute a new corpus by annotating 247 Wikipedia articles. We then select 4 publicly available, well known and free for research NER tools for comparison: Stanford NER, Illinois NET, OpenCalais NER WS and Alias-i LingPipe. We apply them to our corpus, assess their performances and compare them. When considering overall performances, a clear hierarchy emerges: Stanford has the best results, followed by LingPipe, Illionois and OpenCalais. However, a more detailed evaluation performed relatively to entity types and article categories highlights the fact their performances are diversely influenced by those factors. This complementarity opens an interesting perspective regarding the combination of these individual tools in order to improve performance. 0 0
A linguistic consensus model for Web 2.0 communities Alonso S.
Perez I.J.
Cabrerizo F.J.
Herrera-Viedma E.
Applied Soft Computing Journal English 2013 Web 2.0 communities are a quite recent phenomenon which involve large numbers of users and where communication between members is carried out in real time. Despite of those good characteristics, there is still a necessity of developing tools to help users to reach decisions with a high level of consensus in those new virtual environments. In this contribution a new consensus reaching model is presented which uses linguistic preferences and is designed to minimize the main problems that this kind of organization presents (low and intermittent participation rates, difficulty of establishing trust relations and so on) while incorporating the benefits that a Web 2.0 community offers (rich and diverse knowledge due to a large number of users, real-time communication, etc.). The model includes some delegation and feedback mechanisms to improve the speed of the process and its convergence towards a solution of consensus. Its possible application to some of the decision making processes that are carried out in the Wikipedia is also shown. © 2012 Elsevier B.V. All rights reserved. 0 0
Automated non-content word list generation using hLDA Krug W.
Tomlinson M.T.
FLAIRS 2013 - Proceedings of the 26th International Florida Artificial Intelligence Research Society Conference English 2013 In this paper, we present a language-independent method for the automatic, unsupervised extraction of non-content words from a corpus of documents. This method permits the creation of word lists that may be used in place of traditional function word lists in various natural language processing tasks. As an example we generated lists of words from a corpus of English, Chinese, and Russian posts extracted from Wikipedia articles and Wikipedia Wikitalk discussion pages. We applied these lists to the task of authorship attribution on this corpus to compare the effectiveness of lists of words extracted with this method to expert-created function word lists and frequent word lists (a common alternative to function word lists). hLDA lists perform comparably to frequent word lists. The trials also show that corpus-derived lists tend to perform better than more generic lists, and both sets of generated lists significantly outperformed the expert lists. Additionally, we evaluated the performance of an English expert list on machine translations of our Chinese and Russian documents, showing that our method also outperforms this alternative. Copyright © 2013, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Automatic extraction of Polish language errors from text edition history Grundkiewicz R. Lecture Notes in Computer Science English 2013 There are no large error corpora for a number of languages, despite the fact that they have multiple applications in natural language processing. The main reason underlying this situation is a high cost of manual corpora creation. In this paper we present the methods of automatic extraction of various kinds of errors such as spelling, typographical, grammatical, syntactic, semantic, and stylistic ones from text edition histories. By applying of these methods to the Wikipedia's article revision history, we created the large and publicly available corpus of naturally-occurring language errors for Polish, called PlEWi. Finally, we analyse and evaluate the detected error categories in our corpus. 0 0
Automatically building templates for entity summary construction Li P.
Yafang Wang
Jian Jiang
Information Processing and Management English 2013 In this paper, we propose a novel approach to automatic generation of summary templates from given collections of summary articles. We first develop an entity-aspect LDA model to simultaneously cluster both sentences and words into aspects. We then apply frequent subtree pattern mining on the dependency parse trees of the clustered and labeled sentences to discover sentence patterns that well represent the aspects. Finally, we use the generated templates to construct summaries for new entities. Key features of our method include automatic grouping of semantically related sentence patterns and automatic identification of template slots that need to be filled in. Also, we implement a new sentence compression algorithm which use dependency tree instead of parser tree. We apply our method on five Wikipedia entity categories and compare our method with three baseline methods. Both quantitative evaluation based on human judgment and qualitative comparison demonstrate the effectiveness and advantages of our method. © 2012 Elsevier Ltd. All rights reserved. 0 0
Computing semantic relatedness using word frequency and layout information of wikipedia Chan P.
Hijikata Y.
Nishida S.
Proceedings of the ACM Symposium on Applied Computing English 2013 Computing the semantic relatedness between two words or phrases is an important problem for fields such as information retrieval and natural language processing. One state-of-the-art approach to solve the problem is Explicit Semantic Analysis (ESA). ESA uses the word frequency in Wikipedia articles to estimate the relevance, so the relevance of words with low frequency cannot always be well estimated. To improve the relevance estimate of the low frequency words, we use not only word frequency but also layout information in Wikipedia articles. Empirical evaluation shows that on the low frequency words, our method achieves better estimate of semantic relatedness over ESA. Copyright 2013 ACM. 0 0
Semantic relatedness estimation using the layout information of wikipedia articles Chan P.
Hijikata Y.
Kuramochi T.
Nishida S.
International Journal of Cognitive Informatics and Natural Intelligence English 2013 Computing the semantic relatedness between two words or phrases is an important problem in fields such as information retrieval and natural language processing. Explicit Semantic Analysis (ESA), a state-of-the-art approach to solve the problem uses word frequency to estimate relevance. Therefore, the relevance of words with low frequency cannot always be well estimated. To improve the relevance estimate of low-frequency words and concepts, the authors apply regression to word frequency, its location in an article, and its text style to calculate the relevance. The relevance value is subsequently used to compute semantic relatedness. Empirical evaluation shows that, for low-frequency words, the authors' method achieves better estimate of semantic relatedness over ESA. Furthermore, when all words of the dataset are considered, the combination of the authors' proposed method and the conventional approach outperforms the conventional approach alone. Copyright 0 0
Wikipedia as an SMT training corpus Tufis D.
Ion R.
Dumitrescu S.D.
Stefanescu D.
International Conference Recent Advances in Natural Language Processing, RANLP English 2013 This article reports on mass experiments supporting the idea that data extracted from strongly comparable corpora may successfully be used to build statistical machine translation systems of reasonable translation quality for in-domain new texts. The experiments were performed for three language pairs: Spanish-English, German-English and Romanian-English, based on large bilingual corpora of similar sentence pairs extracted from the entire dumps of Wikipedia as of June 2012. Our experiments and comparison with similar work show that adding indiscriminately more data to a training corpus is not necessarily a good thing in SMT. 0 0
Annotation of adversarial and collegial social actions in discourse Bracewell D.B.
Tomlinson M.T.
Brunson M.
Plymale J.
Bracewell J.
Boerger D.
LAW 2012 - 6th Linguistic Annotation Workshop, In Conjunction with ACL 2012 - Proceedings English 2012 We posit that determining the social goals and intentions of dialogue participants is crucial for understanding discourse taking place on social media. In particular, we examine the social goals of being collegial and being adversarial. Through our early experimentation, we found that speech and dialogue acts are not able to capture the complexities and nuances of the social intentions of discourse participants. Therefore, we introduce a set of 9 social acts specifically designed to capture intentions related to being collegial and being adversarial. Social acts are pragmatic speech acts that signal a dialogue participant's social intentions. We annotate social acts in discourses communicated in English and Chinese taken from Wikipedia talk pages, public forums, and chat transcripts. Our results show that social acts can be reliably understood by annotators with a good level of inter-rater agreement. 0 0
Cognitive linguistics as the underlying framework for semantic annotation Pipitone A.
Pirrone R.
Proceedings - IEEE 6th International Conference on Semantic Computing, ICSC 2012 English 2012 In recent years many attempts have been made to design suitable sets of rules aimed at extracting the semantic meaning from plain text, and to achieve annotation, but very few approaches make extensive use of grammars. Current systems are mainly focused on extracting the semantic role of the entities described in the text. This approach has limitations: in such applications the semantic role is conceived merely as the meaning of the involved entities without considering their context. As an example, current semantic annotators often specify a date entity without any annotation regarding the kind of the date itself i.e. a birth date, a book publication date, and so on. Moreover, these systems use ontologies that have been developed specifically for the system's purposes and have reduced portability. Extensive use of both linguistic resources and semantic representations of the domain are needed in this scenario, the semantic representation of the domain addresses the semantic interpretation of the context, while NLP tools can help to solve some linguistic problems related to the semantic annotation, as synonymy, ambiguities, and co-references. A novel framework inspired to Cognitive Linguistics theories is proposed in this work that is aimed at facing the problem outlined above. In particular, our work is based on Construction Grammar (CxG). CxG defines a "construction" as a form-meaning couple. We use RDF triples in the domain ontology as the "semantic seeds" to build constructions. A suitable set of rules based on linguistic typology have been designed to infer semantics and syntax from the semantic seed, while combining them as the poles of constructions. A hierarchy of rules to infer syntactic patterns for either single words or sentences using Word Net and Frame Net has been designed to overcome the limitations when expressing the syntactic poles using solely the terms stated in the ontology. As a consequence, semantic annotation of plain text is achieved by computing all possible syntactic forms for the same meaning during the analysis of document corpora. The proposed framework has been finalized to semantic annotation of Wikipedia pages, the result is a system for automatic generation of Semantic Web wiki contents from standard Wikipedia pages, leading to a possible solution of the big challenge to make existing wiki sources semantic wikis. 0 0
Exploiting Wikipedia for cross-lingual and multilingual information retrieval Sorg P.
Philipp Cimiano
Data and Knowledge Engineering English 2012 In this article we show how Wikipedia as a multilingual knowledge resource can be exploited for Cross-Language and Multilingual Information Retrieval (CLIR/MLIR). We describe an approach we call Cross-Language Explicit Semantic Analysis (CL-ESA) which indexes documents with respect to explicit interlingual concepts. These concepts are considered as interlingual and universal and in our case correspond either to Wikipedia articles or categories. Each concept is associated to a text signature in each language which can be used to estimate language-specific term distributions for each concept. This knowledge can then be used to calculate the strength of association between a term and a concept which is used to map documents into the concept space. With CL-ESA we are thus moving from a Bag-Of-Words model to a Bag-Of-Concepts model that allows language-independent document representations in the vector space spanned by interlingual and universal concepts. We show how different vector-based retrieval models and term weighting strategies can be used in conjunction with CL-ESA and experimentally analyze the performance of the different choices. We evaluate the approach on a mate retrieval task on two datasets: JRC-Acquis and Multext. We show that in the MLIR settings, CL-ESA benefits from a certain level of abstraction in the sense that using categories instead of articles as in the original ESA model delivers better results. © 2012 Elsevier B.V. All rights reserved. 0 0
Impact of platform design on cross-language information exchange Hale S. Conference on Human Factors in Computing Systems - Proceedings English 2012 This paper describes two case studies examining the impact of platform design on cross-language communications. The sharing of off-site hyperlinks between language editions of Wikipedia and between users on Twitter with different languages in their user descriptions are analyzed and compared in the context of the 2011 Tohoku earthquake and tsunami in Japan. The paper finds that a greater number of links are shared across languages on Twitter, while a higher percentage of links are shared between Wikipedia articles. The higher percentage of links being shared on Wikipedia is attributed to the persistence of links and the ability for users to link articles on the same topic together across languages. 0 0
LensingWikipedia: Parsing text for the interactive visualization of human history Vadlapudi R.
Siahbani M.
Sarkar A.
Dill J.
IEEE Conference on Visual Analytics Science and Technology 2012, VAST 2012 - Proceedings English 2012 Extracting information from text is challenging. Most current practices treat text as a bag of words or word clusters, ignoring valuable linguistic information. Leveraging this linguistic information, we propose a novel approach to visualize textual information. The novelty lies in using state-of-the-art Natural Language Processing (NLP) tools to automatically annotate text which provides a basis for new and powerful interactive visualizations. Using NLP tools, we built a web-based interactive visual browser for human history articles from Wikipedia. 0 0
Manypedia: Comparing language points of view of Wikipedia communities Paolo Massa
Federico Scrinzi
WikiSym 2012 English 2012 The 4 million articles of the English Wikipedia have been written in a collaborative fashion by more than 16 million volunteer editors. On each article, the community of editors strive to reach a neutral point of view, representing all significant views fairly, proportionately, and without biases. However, beside the English one, there are more than 280 editions of Wikipedia in different languages and their relatively isolated communities of editors are not forced by the platform to discuss and negotiate their points of view. So the empirical question is: do communities on different language Wikipedias develop their own diverse Linguistic Points of View (LPOV)? To answer this question we created and released as open source Manypedia, a web tool whose aim is to facilitate cross-cultural analysis of Wikipedia language communities by providing an easy way to compare automatically translated versions of their different representations of the same topic. 0 0
Predicate-argument structure-based textual entailment recognition system exploiting wide-coverage lexical knowledge Shibata T.
Kurohashi S.
ACM Transactions on Asian Language Information Processing English 2012 This article proposes a predicate-argument structure based Textual Entailment Recognition system exploiting wide-coverage lexical knowledge. Different from conventional machine learning approaches where several features obtained from linguistic analysis resources are utilized our proposed method regards a Predicate-argument Structure As A Basic Unit Performs The Matchingalignment Between A Text Hypothesis. In Matching Between Predicate-arguments Wide-coverage Relations Between Wordsphrases Such As Synonym Is-a Are Utilized Which Are Automatically Acquired From A Dictionary Web Corpus Wikipedia. © 2012 ACM 1530-0226/2012/12-ART14 $15.00. 0 0
Properties of language networks in Japanese Wikipedia Sato H.
Kubo M.
Namatame A.
6th International Conference on Soft Computing and Intelligent Systems, and 13th International Symposium on Advanced Intelligence Systems, SCIS/ISIS 2012 English 2012 Linguistic activity is highly complicated things that is produced from human brain. When the topic which is written or spoken becomes difficult, the produced sentence and article become more complex. Traditional analysis of the linguistic activity was based on the word frequency in use. Recently, the analysis based on the relation between word usage is attracting attention. These relation can be represented by network called 'language networks.' Many findings from the research of complex networks can be applied to this area. In this study, we investigate cooccurrence networks that are made from Wikipedia's article. Several network indices are used to classify the co-occurrence networks. We found that the co-occurrence networks made from the similar categories show the similarities in terms of indices. 0 0
Agreement: How to reach it? defining language features leading to agreement in dialogue Zidrasco T.
Bobicev V.
Shiramatsu S.
Ozono T.
Shintani T.
International Conference Recent Advances in Natural Language Processing, RANLP English 2011 Consensus is the desired result in many argumentative discourses such as negotiations, public debates, and goal-oriented forums. However, due to the fact that usually people are poor arguers, a support of argumentation is necessary. Web-2 provides means for the online discussions which have their characteristic features. In our paper we study the features of discourse which lead to agreement. We use an argumentative corpus of Wikipedia discussions in order to investigate the influence of discourse structure and language on the final agreement. The corpus had been annotated with rhetorical relations and rhetorical structures leading to successful and unsuccessful discussions were analyzed. We also investigated language patterns extracted from the corpus in order to discover which ones are indicators of the following agreement. The results of our study can be used in system designing, whose purpose is to assist on-line interlocutors in consensus building. 0 0
Evaluating various linguistic features on semantic relation extraction Garcia M.
Gamallo P.
International Conference Recent Advances in Natural Language Processing, RANLP English 2011 Machine learning approaches for Information Extraction use different types of features to acquire semantically related terms from free text. These features may contain several kinds of linguistic knowledge: from orthographic or lexical to more complex features, like PoStags or syntactic dependencies. In this paper we select fourmain types of linguistic features and evaluate their performance in a systematic way. Despite the combination of some types of features allows us to improve the f-score of the extraction, we observed that by adjusting the positive and negative ratio of the training examples, we can build high quality classifiers with just a single type of linguistic feature, based on generic lexico-syntactic patterns. Experiments were performed on the Portuguese version of Wikipedia. 0 0
Improving query expansion for image retrieval via saliency and picturability Leong C.W.
Hassan S.
Ruiz M.E.
Rada Mihalcea
Lecture Notes in Computer Science English 2011 In this paper, we present a Wikipedia-based approach to query expansion for the task of image retrieval, by combining salient encyclopaedic concepts with the picturability of words. Our model generates the expanded query terms in a definite two-stage process instead of multiple iterative passes, requires no manual feedback, and is completely unsupervised. Preliminary results show that our proposed model is effective in a comparative study on the ImageCLEF 2010 Wikipedia dataset. 0 0
Measuring comparability of multilingual corpora extracted from wikipedia Otero P.G.
Lopez I.G.
CEUR Workshop Proceedings English 2011 Comparable corpora can be used for many linguistic tasks such as bilingual lexicon extraction. By improving the quality of comparable corpora, we improve the quality of the extraction. This article describes some strategies to build comparable corpora from Wikipedia and proposes a measure of comparability. Experiments were performed on Portuguese, Spanish, and English Wikipedia. 0 0
Multiword expressions and named entities in the Wiki50 corpus Veronika Vincze
Nagy T. I.
Berend G.
International Conference Recent Advances in Natural Language Processing, RANLP English 2011 Multiword expressions (MWEs) and named entities (NEs) exhibit unique and idiosyncratic features, thus, they often pose a problem to NLP systems. In order to facilitate their identification we developed the first corpus of Wikipedia articles in which several types of multiword expressions and named entities are manually annotated at the same time. The corpus can be used for training or testing MWE-detectors or NER systems, which we illustrate with experiments and it also makes it possible to investigate the co-occurrences of different types of MWEs and NEs within the same domain. 0 0
Ontology-based feature extraction Vicient C.
Sanchez D.
Moreno A.
Proceedings - 2011 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2011 English 2011 Knowledge-based data mining and classification algorithms require of systems that are able to extract textual attributes contained in raw text documents, and map them to structured knowledge sources (e.g. ontologies) so that they can be semantically analyzed. The system presented in this paper performs this tasks in an automatic way, relying on a predefined ontology which states the concepts in this the posterior data analysis will be focused. As features, our system focuses on extracting relevant Named Entities from textual resources describing a particular entity. Those are evaluated by means of linguistic and Web-based co-occurrence analyses to map them to ontological concepts, thereby discovering relevant features of the object. The system has been preliminary tested with tourist destinations and Wikipedia textual resources, showing promising results. 0 0
Query and tag translation for Chinese-Korean cross-language social media retrieval Wang Y.-C.
Chen J.-T.
Tsai R.T.-H.
Hsu W.-L.
Proceedings of the 2011 IEEE International Conference on Information Reuse and Integration, IRI 2011 English 2011 Collaborative tagging has been widely adopted by social media websites to allow users to describe content with metadata tags. Tagging can greatly improve search results. We propose a cross-language social media retrieval system (CLSMR) to help users retrieve foreign-language tagged media content. We construct a Chinese to Korean CLSMR system that translates Chinese queries into Korean, retrieves content, and then translates the Korean tags in the search results back into Chinese. Our system translates NEs using a dictionary of bilingual NE pairs from Wikipedia and a pattern-based software translator which learns regular NE patterns from the web. The top-10 precision of YouTube retrieved results for our system was 0.39875. The K-C NE tag translation accuracy for the top-10 YouTube results was 77.6%, which shows that our translation method is fairly effective for named entities. A questionnaire given to users showed that automatically translated tags were considered as informative as a human-written summary. With our proposed CLSMR system, Chinese users can retrieve online Korean media files and get a basic understanding of their content with no knowledge of the Korean language. 0 0
A cocktail approach to the VideoCLEF'09 linking task Raaijmakers S.
Versloot C.
De Wit J.
Lecture Notes in Computer Science English 2010 In this paper, we describe the TNO approach to the Finding Related Resources or linking task of VideoCLEF09. Our system consists of a weighted combination of off-the-shelf and proprietary modules, including the Wikipedia Miner toolkit of the University of Waikato. Using this cocktail of largely off-the-shelf technology allows for setting a baseline for future approaches to this task. 0 0
A framework for automatic semantic annotation of Wikipedia articles Pipitone A.
Pirrone R.
SWAP 2010 - 6th Workshop on Semantic Web Applications and Perspectives English 2010 Semantic wikis represent a novelty in the field of semantic technologies. Nowadays, there are many important "non-semantic" wiki sources, as the Wikipedia encyclopedia. A big challenge is to make existing wiki sources semantic wikis. In this way, a new generation of applications can be designed to brose, search, and reuse wiki contents, while reducing loss of data. The core of this problem is the extraction of semantic sense and the annotation from text. In this paper a hierarchical framework for automatic semantic annotation of plain text is presented that has been finalized to the use of Wikipedia pages as information source. The strategy is based on disambiguation of plain text using both domain ontology and linguistic pattern matching methods. The main steps are: TOC extraction from the original page, content annotation for each section linguistic rules, and semantic wiki generation. The complete framework is outlined and an application scenario is presented. 0 0
An N-gram-and-wikipedia joint approach to natural language identification Yang X.
Liang W.
2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings English 2010 Natural Language Identification is the process of detecting and determining in which language or languages a given piece of text is written. As one of the key steps in Computational Linguistics/Natural Language Processing(NLP) tasks, such as Machine Translation, Multi-lingual Information Retrieval and Processing of Language Resources, Natural Language Identification has drawn widespread attention and extensive research, making it one of the few relatively well studied sub-fields in the whole NLP field. However, various problems remain far from resolved in this field. Current noncomputational approaches require researchers possess sufficient prior linguistic knowledge about the languages to be identified, while current computational (statistical) approaches demand large-scale training set for each to-be-identified language. Apparently, drawbacks for both are that, few computer scientists are equipped with sufficient knowledge in Linguistics, and the size of the training set may get endlessly larger in pursuit of higher accuracy and the ability to process more languages. Also, faced with multi-lingual documents on the Internet, neither approach can render satisfactory results. To address these problems, this paper proposes a new approach to Natural Language Identification. It exploits N-Gram frequency statistics to segment a piece of text in a language-specific fashion, and then takes advantage of Wikipedia to determine the language used in each segment. Multiple experiments have demonstrated that satisfactory results can be rendered by this approach, especially with multi-lingual documents. 0 0
Combining text/image in WikipediaMM task 2009 Moulin C.
Barat C.
Lemaitre C.
Gery M.
Ducottet C.
Largeron C.
Lecture Notes in Computer Science English 2010 This paper reports our multimedia information retrieval experiments carried out for the ImageCLEF Wikipedia task 2009. We extend our previous multimedia model defined as a vector of textual and visual information based on a bag of words approach [6]. We extract additional textual information from the original Wikipedia articles and we compute several image descriptors (local colour and texture features). We show that combining linearly textual and visual information significantly improves the results. 0 0
Construction of a domain ontological structure from Wikipedia Xavier C.C.
De Lima V.L.S.
STIL 2009 - 2009 7th Brazilian Symposium in Information and Human Language Technology Portuguese 2010 Data extraction from Wikipedia for ontologies construction, enrichment and population is an emerging research field. This paper describes a study on automatic extraction of an ontological structure containing hyponymy and location relations from Wikipedia's Tourism category in Portuguese, illustrated with an experiment, and evaluation of its results. 0 0
Document expansion for text-based image retrieval at CLEF 2009 Min J.
Wilkins P.
Johannes Leveling
Jones G.J.F.
Lecture Notes in Computer Science English 2010 In this paper, we describe and analyze our participation in the WikipediaMM task at CLEF 2009. Our main efforts concern the expansion of the image metadata from the Wikipedia abstracts collection - DBpedia. In our experiments, we use the Okapi feedback algorithm for document expansion. Compared with our text retrieval baseline, our best document expansion RUN improves MAP by 17.89%. As one of our conclusions, document expansion from external resource can play an effective factor in the image metadata retrieval task. 0 0
Evaluating cross-language explicit semantic analysis and cross querying Maik Anderka
Nedim Lipka
Benno Stein
Lecture Notes in Computer Science English 2010 This paper describes our participation in the TEL@CLEF task of the CLEF 2009 ad-hoc track. The task is to retrieve items from various multilingual collections of library catalog records, which are relevant to a user's query. Two different strategies are employed: (i) the Cross-Language Explicit Semantic Analysis, CL-ESA, where the library catalog records and the queries are represented in a multilingual concept space that is spanned by aligned Wikipedia articles, and, (ii) a Cross Querying approach, where a query is translated into all target languages using Google Translate and where the obtained rankings are combined. The evaluation shows that both strategies outperform the monolingual baseline and achieve comparable results. Furthermore, inspired by the Generalized Vector Space Model we present a formal definition and an alternative interpretation of the CL-ESA model. This interpretation is interesting for real-world retrieval applications since it reveals how the computational effort for CL-ESA can be shifted from the query phase to a preprocessing phase. 0 0
Exploring the semantics behind a collection to improve automated image annotation Llorente A.
Motta E.
Stefan Ruger
Lecture Notes in Computer Science English 2010 The goal of this research is to explore several semantic relatedness measures that help to refine annotations generated by a baseline non-parametric density estimation algorithm. Thus, we analyse the benefits of performing a statistical correlation using the training set or using the World Wide Web versus approaches based on a thesaurus like WordNet or Wikipedia (considered as a hyperlink structure). Experiments are carried out using the dataset provided by the 2009 edition of the ImageCLEF competition, a subset of the MIR-Flickr 25k collection. Best results correspond to approaches based on statistical correlation as they do not depend on a prior disambiguation phase like WordNet and Wikipedia. Further work needs to be done to assess whether proper disambiguation schemas might improve their performance. 0 0
Extracting conceptual relations from Persian resources Fadaei H.
Shamsfard M.
ITNG2010 - 7th International Conference on Information Technology: New Generations English 2010 In this paper we present a relation extraction system which uses a combination of pattern based, structure based and statistical approaches. This system uses raw texts and Wikipedia articles to learn conceptual relations. Wikipedia structures are rich source of information in relation extraction and are well used in this system. A set of patterns are extracted for Persian language and are used to learn both taxonomic and non-taxonomic relations. This system is one of the few relation extraction systems designed for Persian language and is the first system among them which uses Wikipedia structures in the process of relation learning. 0 0
GOSPL: Grounding ontologies with social processes and natural language Debruyne C.
Reul Q.
Meersman R.
ITNG2010 - 7th International Conference on Information Technology: New Generations English 2010 In this paper, we present the GOSPL application that supports communities during the ontology engineering process by exploiting Social Web technologies and natural language. The resulting knowledge can then be transformed into RDF(S). 0 0
GikiCLEF topics and Wikipedia articles: Did they blend? Nuno Cardoso Lecture Notes in Computer Science English 2010 This paper presents a post-hoc analysis on how the Wikipedia collections fared in providing answers and justifications to GikiCLEF topics. Based on all solutions found by all GikiCLEF participant systems, this paper measures how self-sufficient the particular Wikipedia collections were to provide answers and justifications for the topics, in order to better understand the recall limit that a GikiCLEF system specialised in one single language has. 0 0
GikiCLEF: Expectations and lessons learned Diana Santos
Cabral L.M.
Lecture Notes in Computer Science English 2010 This overview paper is devoted to a critical assessment of GikiCLEF 2009, an evaluation contest specifically designed to expose and investigate cultural and linguistic issues in Wikipedia search, with eight participant systems and 17 runs. After providing a maximally short but self contained overview of the GikiCLEF task and participation, we present the open source SIGA system, and discuss, for each of the main guiding ideas, the resulting successes or shortcomings, concluding with further work and still unanswered questions. 0 0
Identifying geographical entities in users' queries Adrian Iftene Lecture Notes in Computer Science English 2010 In 2009 we built a system in order to compete in the LAGI task (Log Analysis and Geographic Query Identification). The system uses an external resource built into GATE in combination with Wikipedia and Tumba in order to identify geographical entities in user's queries. The results obtained with and without Wikipedia resources are comparable. The main advantage of only using GATE resources is the improved run time. In the process of system evaluation we have identified the main problem of our approach: the system has insufficient external resources for the recognition of geographic entities. 0 0
Methods for classifying videos by subject and detecting narrative peak points Dobrila T.-A.
Diaconasu M.-C.
Lungu I.-D.
Adrian Iftene
Lecture Notes in Computer Science English 2010 2009 marked UAIC's first participation at the VideoCLEF evaluation campaign. Our group built two separate systems for the "Subject Classification" and "Affect Detection" tasks. For the first task we created two resources starting from Wikipedia pages and pages identified with Google and used two tools for classification: Lucene and Weka. For the second task we extracted the audio component from a given video file, using FFmpeg. After that, we computed the average amplitude for each word from the transcript, by applying the Fast Fourier Transform algorithm in order to analyze the sound. A brief description of our systems' components is given in this paper. 0 0
Multimodal image retrieval over a large database Myoupo D.
Adrian Popescu
Le Borgne H.
Moellic P.-A.
Lecture Notes in Computer Science English 2010 We introduce a new multimodal retrieval technique which combines query reformulation and visual image reranking in order to deal with results sparsity and imprecision, respectively. Textual queries are reformulated using Wikipedia knowledge and results are then reordered using a k-NN based reranking method. We compare textual and multimodal retrieval and show that introducing visual reranking results in a significant improvement of performance. 0 0
Named entity disambiguation for german news articles Lommatzsch A.
Ploch D.
De Luca E.W.
Albayrak S.
LWA 2010 - Lernen, Wissen und Adaptivitat - Learning, Knowledge, and Adaptivity, Workshop Proceedings English 2010 Named entity disambiguation has become an important research area providing the basis for improving search engine precision and for enabling semantic search. Current approaches for the named entity disambiguation are usually based on exploiting structured semantic and lingual resources (e.g. WordNet, DBpedia). Unfortunately, each of these resources cover independently from each other insufficient information for the task of named entity disambiguation. On the one handWordNet comprises a relative small number of named entities while on the other hand DBpedia provides only little context for named entities. Our approach is based on the use of multi-lingual Wikipedia data. We show how the combination of multi-lingual resources can be used for named entity disambiguation. Based on a German and an English document corpus, we evaluate various similarity measures and algorithms for extracting data for named entity disambiguation. We show that the intelligent filtering of context data and the combination of multilingual information provides high quality named entity disambiguation results. 0 0
Overview of ResPubliQA 2009: Question answering evaluation over European legislation Penas A.
Forner P.
Sutcliffe R.
Rodrigo A.
Forascu C.
Iñaki Alegria
Giampiccolo D.
Moreau N.
Osenova P.
Lecture Notes in Computer Science English 2010 This paper describes the first round of ResPubliQA, a Question Answering (QA) evaluation task over European legislation, proposed at the Cross Language Evaluation Forum (CLEF) 2009. The exercise consists of extracting a relevant paragraph of text that satisfies completely the information need expressed by a natural language question. The general goals of this exercise are (i) to study if the current QA technologies tuned for newswire collections and Wikipedia can be adapted to a new domain (law in this case); (ii) to move to a more realistic scenario, considering people close to law as users, and paragraphs as system output; (iii) to compare current QA technologies with pure Information Retrieval (IR) approaches; and (iv) to introduce in QA systems the Answer Validation technologies developed in the past three years. The paper describes the task in more detail, presenting the different types of questions, the methodology for the creation of the test sets and the new evaluation measure, and analyzing the results obtained by systems and the more successful approaches. Eleven groups participated with 28 runs. In addition, we evaluated 16 baseline runs (2 per language) based only in pure IR approach, for comparison purposes. Considering accuracy, scores were generally higher than in previous QA campaigns. 0 0
Overview of VideoCLEF 2009: New perspectives on speech-based multimedia content enrichment Larson M.
Newman E.
Jones G.J.F.
Lecture Notes in Computer Science English 2010 VideoCLEF 2009 offered three tasks related to enriching video content for improved multimedia access in a multilingual environment. For each task, video data (Dutch-language television, predominantly documentaries) accompanied by speech recognition transcripts were provided. The Subject Classification Task involved automatic tagging of videos with subject theme labels. The best performance was achieved by approaching subject tagging as an information retrieval task and using both speech recognition transcripts and archival metadata. Alternatively, classifiers were trained using either the training data provided or data collected from Wikipedia or via general Web search. The Affect Task involved detecting narrative peaks, defined as points where viewers perceive heightened dramatic tension. The task was carried out on the "Beeldenstorm" collection containing 45 short-form documentaries on the visual arts. The best runs exploited affective vocabulary and audience directed speech. Other approaches included using topic changes, elevated speaking pitch, increased speaking intensity and radical visual changes. The Linking Task, also called "Finding Related Resources Across Languages," involved linking video to material on the same subject in a different language. Participants were provided with a list of multimedia anchors (short video segments) in the Dutch-language "Beeldenstorm" collection and were expected to return target pages drawn from English-language Wikipedia. The best performing methods used the transcript of the speech spoken during the multimedia anchor to build a query to search an index of the Dutch-language Wikipedia. The Dutch Wikipedia pages returned were used to identify related English pages. Participants also experimented with pseudo-relevance feedback, query translation and methods that targeted proper names. 0 0
Overview of the WikipediaMM task at ImageCLEF 2009 Tsikrika T.
Kludas J.
Lecture Notes in Computer Science English 2010 ImageCLEF's wikipediaMM task provides a testbed for the system-oriented evaluation of multimedia information retrieval from a collection of Wikipedia images. The aim is to investigate retrieval approaches in the context of a large and heterogeneous collection of images (similar to those encountered on the Web) that are searched for by users with diverse information needs. This paper presents an overview of the resources, topics, and assessments of the wikipediaMM task at ImageCLEF 2009, summarises the retrieval approaches employed by the participating groups, and provides an analysis of the main evaluation results. 0 0
Recursive question decomposition for answering complex geographic questions Sven Hartrumpf
Johannes Leveling
Lecture Notes in Computer Science English 2010 This paper describes the GIRSA-WP system and the experiments performed for GikiCLEF 2009, the geographic information retrieval task in the question answering track at CLEF 2009. Three runs were submitted. The first one contained only results from the InSicht QA system; it showed high precision, but low recall. The combination with results from the GIR system GIRSA increased recall considerably, but reduced precision. The second run used a standard IR query, while the third run combined such queries with a Boolean query with selected keywords. The evaluation showed that the third run achieved significantly higher mean average precision (MAP) than the second run. In both cases, integrating GIR methods and QA methods was successful in combining their strengths (high precision of deep QA, high recall of GIR), resulting in the third-best performance of automatic runs in GikiCLEF. The overall performance still leaves room for improvements. For example, the multilingual approach is too simple. All processing is done in only one Wikipedia (the German one); results for the nine other languages are collected by following the translation links in Wikipedia. 0 0
Rich ontology extraction and wikipedia expansion using language resources Schonberg C.
Pree H.
Freitag B.
Lecture Notes in Computer Science English 2010 Existing social collaboration projects contain a host of conceptual knowledge, but are often only sparsely structured and hardly machine-accessible. Using the well known Wikipedia as a showcase, we propose new and improved techniques for extracting ontology data from the wiki category structure. Applications like information extraction, data classification, or consistency checking require ontologies of very high quality and with a high number of relationships. We improve upon existing approaches by finding a host of additional relevant relationships between ontology classes, leveraging multi-lingual relations between categories and semantic relations between terms. 0 0
Semantic QA for encyclopaedic questions: EQUAL in GikiCLEF Iustin Dornescu Lecture Notes in Computer Science English 2010 This paper presents a new question answering (QA) approach and a prototype system, EQUAL, which relies on structural information from Wikipedia to answer open-list questions. The system achieved the highest score amongst the participants in the GikiCLEF 2009 task. Unlike the standard textual QA approach, EQUAL does not rely on identifying the answer within a text snippet by using keyword retrieval. Instead, it explores the Wikipedia page graph, extracting and aggregating information from multiple documents and enforcing semantic constraints. The challenges for such an approach and an error analysis are also discussed. 0 0
Text-based requirements preprocessing using nature language processing techniques Hejie Chen
He K.
Peng Liang
Li R.
2010 International Conference on Computer Design and Applications, ICCDA 2010 English 2010 In a distributed environment, non-technical stakeholders are required to write down requirement statements by themselves. Nature language is the first choice for them. In order to alleviate the burden of reading free-text requirement documents by requirements engineers, we extract goals and relevant stakeholders from requirement statements automatically by a computer-assisted way. In this paper, requirements are divided into system level requirements and instance level requirements. Methods are proposed to solve two types of requirements by analyzing the characteristics of requirement expressions, and combining techniques of nature language processing with semantic web. Semantic-enhanced segment and domain sentence pattern are two novel techniques utilized in our methods. Our approach accelerates goal extraction from text-based requirements and alleviates the burden of requirements engineers significantly. 0 0
The tower of Babel meets web 2.0: User-generated content and its applications in a multilingual context Brent Hecht
Darren Gergle
Conference on Human Factors in Computing Systems - Proceedings English 2010 This study explores language's fragmenting effect on user-generated content by examining the diversity of knowledge representations across 25 different Wikipedia language editions. This diversity is measured at two levels: the concepts that are included in each edition and the ways in which these concepts are described. We demonstrate that the diversity present is greater than has been presumed in the literature and has a significant influence on applications that use Wikipedia as a source of world knowledge. We close by explicating how knowledge diversity can be beneficially leveraged to create "culturally- aware applications" and "hyperlingual applications". 0 2
VisKQWL, a visual renderer for a semantic web query language Hartl A.
Weiand K.
Bry F.
Proceedings of the 19th International Conference on World Wide Web, WWW '10 English 2010 KiWi is a semantic Wiki that combines the Wiki philosophy of collaborative content creation with the methods of the Semantic Web in order to enable effective knowledge management. Querying a Wiki must be simple enough for beginning users, yet powerful enough to accommodate experienced users. To this end, the keyword-based KiWi query language (KWQL) supports queries ranging from simple lists of keywords to expressive rules for selecting and reshaping Wiki (meta-)data. In this demo, we showcase visKWQL, a visual interface for the KWQL language aimed at supporting users in the query construction process. visKWQL and its editor are described, and their functionality is illustrated using example queries. visKWQL's editor provides guidance throughout the query construction process through hints, warnings and highlighting of syntactic errors. The editor enables round-tripping between the twin languages KWQL and visKWQL, meaning that users can switch freely between the textual and visual form when constructing or editing a query. It is implemented using HTML, JavaScript, and CSS, and can thus be used in (almost) any web browser without any additional software. 0 0
When to cross over? Cross-language linking using Wikipedia for VideoCLEF 2009 Gyarmati A.
Jones G.J.F.
Lecture Notes in Computer Science English 2010 We describe Dublin City University (DCU)'s participation in the VideoCLEF 2009 Linking Task. Two approaches were implemented using the Lemur information retrieval toolkit. Both approaches first extracted a search query from the transcriptions of the Dutch TV broadcasts. One method first performed search on a Dutch Wikipedia archive, then followed links to corresponding pages in the English Wikipedia. The other method first translated the extracted query using machine translation and then searched the English Wikipedia collection directly. We found that using the original Dutch transcription query for searching the Dutch Wikipedia yielded better results. 0 0
Where in the Wikipedia is that answer? The XLDB at the GikiCLEF 2009 task Nuno Cardoso
Batista D.
Lopez-Pellicer F.J.
Silva M.J.
Lecture Notes in Computer Science English 2010 We developed a new semantic question analyser for a custom prototype assembled for participating in GikiCLEF 2009, which processes grounded concepts derived from terms, and uses information extracted from knowledge bases to derive answers. We also evaluated a newly developed named-entity recognition module, based in Conditional Random Fields, and a new world geo-ontology, derived from Wikipedia, which is used in the geographic reasoning process. 0 0
WikiPics: Multilingual image search based on wiki-mining Daniel Kinzler WikiSym 2010 English 2010 This demonstration introduces WikiPics, a language-independent image search engine for Wikimedia Commons. Based on the multilingual thesaurus provided by WikiWord, WikiPics allows users to search and navigate Wikimedia Commons in their preferred language, even though images on Commons are annotated in English nearly exclusively. 0 0
Zawilinski: A library for studying grammar in wiktionary Zachary Kurmas WikiSym 2010 English 2010 We present Zawilinski, a Java library that supports the extraction and analysis of grammatical data in Wiktionary. Zawilinski can efficiently (1) filter Wiktionary for content pertaining to a specified language, and (2) extract a word's inflections from its Wiktionary entry. We have thus far used Zawilinski to (1) measure the correctness of the inflections for a subset of the Polish words in the English Wiktionary and to (2) show that this grammatical data is very stable. (Only 131 out of 4748 Polish words have had their inflection data corrected.) We also explain Zawilinski's key features and discuss how it can be used to simplify the development of additional grammar-based analyses. 0 2
Augmenting Wiki system for collaborative EFL reading by digital pen annotations Chang C.-K. Proceedings - 2009 International Symposium on Ubiquitous Virtual Reality, ISUVR 2009 English 2009 Wikis are very useful for collaborative learning because of their sharing and flexible nature. Many learning activities can use Wiki to facilitate the processes, such as online glossaries, project reports, and dictionaries. Some EFL (English as a Foreign Language) instructors have paid attention to the popularity of Wiki. Although Wikis are very simple and intuitive for users with information literacy, Wikis need computing environment for each learners to edit Web pages. Generally, an instructor can only conduct a Wiki-based learning activity in a computer classroom. Although mobile learning devices (such as PDAs) for every learner can provide ubiquitous computing environment for a Wiki-based learning activity, this paper suggests another inexpensive way by integrating digital pen with Wiki. Consequently, a learner can annotate an EFL reading with his/her mother tongue by digital pen. After everyone finishes reading, all annotations can be collected into a Wiki system for instruction. Thus, an augmenting Wiki structure is constructed. Finally, learners' satisfactions about annotating in the prototype system are reported in this paper. 0 0
Automatic acquisition of attributes for ontology construction Gaoying Cui
Lu Q.
Li W.
Yirong Chen
Lecture Notes in Computer Science English 2009 An ontology can be seen as an organized structure of concepts according to their relations. A concept is associated with a set of attributes that themselves are also concepts in the ontology. Consequently, ontology construction is the acquisition of concepts and their associated attributes through relations. Manual ontology construction is time-consuming and difficult to maintain. Corpus-based ontology construction methods must be able to distinguish concepts themselves from concept instances. In this paper, a novel and simple method is proposed for automatically identifying concept attributes through the use of Wikipedia as the corpus. The built-in Template:Infobox in Wiki is used to acquire concept attributes and identify semantic types of the attributes. Two simple induction rules are applied to improve the performance. Experimental results show precisions of 92.5% for attribute acquisition and 80% for attribute type identification. This is a very promising result for automatic ontology construction. 0 0
Automatic multilingual lexicon generation using wikipedia as a resource Shahid A.R.
Kazakov D.
ICAART 2009 - Proceedings of the 1st International Conference on Agents and Artificial Intelligence English 2009 This paper proposes a method for creating a multilingual dictionary by taking the titles of Wikipedia pages in English and then finding the titles of the corresponding articles in other languages. The creation of such multilingual dictionaries has become possible as a result of exponential increase in the size of multilingual information on the web. Wikipedia is a prime example of such multilingual source of information on any conceivable topic in the world, which is edited by the readers. Here, a web crawler has been used to traverse Wikipedia following the links on a given page. The crawler takes out the title along with the titles of the corresponding pages in other targeted languages. The result is a set of words and phrases that are translations of each other. For efficiency, the URLs are organized using hash tables. A lexicon has been constructed which contains 7-tuples corresponding to 7 different languages, namely: English, German, French, Polish, Bulgarian, Greek and Chinese. 0 0
Building a semantic virtual museum: From wiki to semantic wiki using named entity recognition Alain Plantec
Vincent Ribaud
Vasudeva Varma
Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA English 2009 In this paper, we describe an approach for creating semantic wiki pages from regular wiki pages, in the domain of scientific museums, using information extraction methods in general and named entity recognition in particular. We make use of a domain specific ontology called CIDOC-CRM as a base structure for representing and processing knowledge. We have described major components of the proposed approach and a three-step process involving name entity recognition, identifying domain classes using the ontology and establishing the properties for the entities in order to generate semantic wiki pages. Our initial evaluation of the prototype shows promising results in terms of enhanced efficiency and time and cost benefits. 0 0
China physiome project: A comprehensive framework for anatomical and physiological databases from the China digital human and the visible rat Han D.
Qiaoling Liu
Luo Q.
Proceedings of the IEEE English 2009 The connection study between biological structure and function, as well as between anatomical data and mechanical or physiological models, has been of increasing significance with the rapid advancement in experimental physiology and computational physiology. The China Physiome Project (CPP) is dedicated in optimization of the connection exploration based on standardization and integration of the structural datasets and their derivatives of cryosectional images with various standards, collaboration mechanisms, and online services. The CPP framework hereby incorporates the three-dimensional anatomical models of human and rat anatomy, the finite-element models of whole-body human skeleton, and the multiparticle radiological dosimetry data of both the human and rat computational phantoms. The ontology of CPP was defined using MeSH and, with its all standardized models description implemented by M3L, a multiscale modeling language based on XML. Provided services based on Wiki concept include collaboration research, modeling version control, data sharing, online analysis of M3L documents. As a sample case, a multiscale model for human heart modeling, in which familial hypertrophic cardiomyopathy was studied according to the structure-function relations from genetic level to organ level, is integrated into the framework and given for demonstration of the functionality of multiscale physiological modeling based on CPP. 0 0
Conceptual image retrieval over a large scale database Adrian Popescu
Le Borgne H.
Moellic P.-A.
Lecture Notes in Computer Science English 2009 Image retrieval in large-scale databases is currently based on a textual chains matching procedure. However, this approach requires an accurate annotation of images, which is not the case on the Web. To tackle this issue, we propose a reformulation method that reduces the influence of noisy image annotations. We extract a ranked list of related concepts for terms in the query from WordNet and Wikipedia, and use them to expand the initial query. Then some visual concepts are used to re-rank the results for queries containing, explicitly or implicitly, visual cues. First evaluations on a diversified corpus of 150000 images were convincing since the proposed system was ranked 4 th and 2 nd at the WikipediaMM task of the ImageCLEF 2008 campaign [1]. 0 0
Crosslanguage Retrieval Based on Wikipedia Statistics Andreas Juffinger
Roman Kern
Michael Granitzer
Lecture Notes in Computer Science English 2009 In this paper we present the methodology, implementations and evaluation results of the crosslanguage retrieval system we have developed for the Robust WSD Task at CLEF 2008. Our system is based on query preprocessing for translation and homogenisation of queries. The presented preprocessing of queries includes two stages: Firstly, a query translation step based on term statistics of cooccuring articles in Wikipedia. Secondly, different disjunct query composition techniques to search in the CLEF corpus. We apply the same preprocessing steps for the monolingual as well as the crosslingual task and thereby acting fair and in a similar way across these tasks. The evaluation revealed that the similar processing comes at nearly no costs for monolingual retrieval but enables us to do crosslanguage retrieval and also a feasible comparison of our system performance on these two tasks. 0 0
GikiP at geoCLEF 2008: Joining GIR and QA forces for querying wikipedia Diana Santos
Nuno Cardoso
Paula Carvalho
Iustin Dornescu
Sven Hartrumpf
Johannes Leveling
Yvonne Skalban
Lecture Notes in Computer Science English 2009 This paper reports on the GikiP pilot that took place in 2008 in GeoCLEF. This pilot task requires a combination of methods from geographical information retrieval and question answering to answer queries to the Wikipedia. We start by the task description, providing details on topic choice and evaluation measures. Then we offer a brief motivation from several perspectives, and we present results in detail. A comparison of participants' approaches is then presented, and the paper concludes with improvements for the next edition. 0 0
Knowledge infusion into content-based recommender systems Giovanni Semeraro
Pasquale Lops
Pierpaolo Basile
De Gemmis M.
RecSys'09 - Proceedings of the 3rd ACM Conference on Recommender Systems English 2009 Content-based recommender systems try to recommend items similar to those a given user has liked in the past. The basic process consists of matching up the attributes of a user profile, in which preferences and interests are stored, with the attributes of a content object (item). Common-sense and domain-specific knowledge may be useful to give some meaning to the content of items, thus helping to generate more informative features than "plain" attributes. The process of learning user profiles could also benefit from the infusion of exogenous knowledge or open source knowledge, with respect to the classical use of endogenous knowledge (extracted from the items themselves). The main contribution of this paper is a proposal for knowledge infusion into content-based recommender systems, which suggests a novel view of this type of systems, mostly oriented to content interpretation by way of the infused knowledge. The idea is to provide the system with the "linguistic" and "cultural" background knowledge that hopefully allows a more accurate content analysis than classic approaches based on words. A set of knowledge sources is modeled to create a memory of linguistic competencies and of more specific world "facts", that can be exploited to reason about content as well as to support the user profiling and recommendation processes. The modeled knowledge sources include a dictionary, Wikipedia, and content generated by users (i.e. tags provided on items), while the core of the reasoning component is a spreading activation algorithm. Copyright 2009 ACM. 0 0
Language-model-based ranking for queries on RDF-graphs Elbassuoni S.
Maya Ramanath
Ralf Schenkel
Sydow M.
Gerhard Weikum
International Conference on Information and Knowledge Management, Proceedings English 2009 The success of knowledge-sharing communities like Wikipedia and the advances in automatic information extraction from textual and Web sources have made it possible to build large "knowledge repositories" such as DBpedia, Freebase, and YAGO. These collections can be viewed as graphs of entities and relationships (ER graphs) and can be represented as a set of subject-property-object (SPO) triples in the Semantic-Web data model RDF. Queries can be expressed in the W3C-endorsed SPARQL language or by similarly designed graph-pattern search. However, exact-match query semantics often fall short of satisfying the users' needs by returning too many or too few results. Therefore, IR-style ranking models are crucially needed. In this paper, we propose a language-model-based approach to ranking the results of exact, relaxed and keyword-augmented graph pattern queries over RDF graphs such as ER graphs. Our method estimates a query model and a set of result-graph models and ranks results based on their Kullback-Leibler divergence with respect to the query model. We demonstrate the effectiveness of our ranking model by a comprehensive user study. Copyright 2009 ACM. 0 0
Large-scale cross-media retrieval of wikipediaMM images with textual and visual query expansion Zhou Z.
Tian Y.
Yanyan Li
Huang T.
Gao W.
Lecture Notes in Computer Science English 2009 In this paper, we present our approaches for the WikipediaMM task at ImageCLEF 2008. We first experimented with a text-based image retrieval approach with query expansion, where the extension terms were automatically selected from a knowledge base that was semi-automatically constructed from Wikipedia. Encouragingly, the experimental results rank in the first place among all submitted runs. We also implemented a content-based image retrieval approach with query-dependent visual concept detection. Then cross-media retrieval was successfully carried out by independently applying the two meta-search tools and then combining the results through a weighted summation of scores. Though not submitted, this approach outperforms our text-based and content-based approaches remarkably. 0 0
Learning better transliterations Pasternack J.
Dan Roth
International Conference on Information and Knowledge Management, Proceedings English 2009 We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. (e.g. P(wash⇒ BaIII) = 0:95). To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2 0 0
Metadata and multilinguality in video classification He J.
Xiaodan Zhang
Weerkamp W.
Larson M.
Lecture Notes in Computer Science English 2009 The VideoCLEF 2008 Vid2RSS task involves the assignment of thematic category labels to dual language (Dutch/English) television episode videos. The University of Amsterdam chose to focus on exploiting archival metadata and speech transcripts generated by both Dutch and English speech recognizers. A Support Vector Machine (SVM) classifier was trained on training data collected from Wikipedia. The results provide evidence that combining archival metadata with speech transcripts can improve classification performance, but that adding speech transcripts in an additional language does not yield performance gains. 0 0
Mining cross-lingual/cross-cultural differences in concerns and opinions in blogs Hiroyuki Nakasaki
Mariko Kawaba
Takehito Utsuro
Tomohiro Fukuhara
Lecture Notes in Computer Science English 2009 The goal of this paper is to cross-lingually analyze multilingual blogs collected with a topic keyword. The framework of collecting multilingual blogs with a topic keyword is designed as the blog feed retrieval procedure. Mulitlingual queries for retrieving blog feeds are created from Wikipedia entries. Finally, we cross-lingually and cross-culturally compare less well known facts and opinions that are closely related to a given topic. Preliminary evaluation results support the effectiveness of the proposed framework. 0 0
Modeling clinical protocols using semantic mediawiki: The case of the oncocure project Eccher C.
Ferro A.
Seyfang A.
Marco Rospocher
Silvia Miksch
Lecture Notes in Computer Science English 2009 A computerized Decision Support Systems (DSS) can improve the adherence of the clinicians to clinical guidelines and protocols. The building of a prescriptive DSS based on breast cancer treatment protocols and its integration with a legacy Electronic Patient Record is the aim of the Oncocure project. An important task of this project is the encoding of the protocols in computer-executable form - a task that requires the collaboration of physicians and computer scientists in a distributed environment. In this paper, we describe our project and how semantic wiki technology was used for the encoding task. Semantic wiki technology features great flexibility, allowing to mix unstructured information and semantic annotations, and to automatically generate the final model with minimal adaptation cost. These features render semantic wikis natural candidates for small to medium scale modeling tasks, where the adaptation and training effort of bigger systems cannot be justified. This approach is not constrained to a specific protocol modeling language, but can be used as a collaborative tool for other languages. When implemented, our DSS is expected to reduce the cost of care while improving the adherence to the guideline and the quality of the documentation. 0 0
OTTHO: On the tip of my THOught Pierpaolo Basile
De Gemmis M.
Pasquale Lops
Giovanni Semeraro
Lecture Notes in Computer Science English 2009 This paper describes OTTHO (On the Tip of my THOught), a system designed for solving a language game called Guillotine. The rule of the game is simple: the player observes five words, generally unrelated to each other, and in one minute she has to provide a sixth word, semantically connected to the others. The system exploits several knowledge sources, such as a dictionary, a set of proverbs, and Wikipedia to realize a knowledge infusion process. The main motivation for designing an artificial player for Guillotine is the challenge of providing the machine with the cultural and linguistic background knowledge which makes it similar to a human being, with the ability of interpreting natural language documents and reasoning on their content. Our feeling is that the approach presented in this work has a great potential for other more practical applications besides solving a language game. 0 0
Overview of the WikipediaMM task at ImageCLEF 2008 Tsikrika T.
Kludas J.
Lecture Notes in Computer Science English 2009 The wikipediaMM task provides a testbed for the system- oriented evaluation of ad-hoc retrieval from a large collection of Wikipedia images. It became a part of the ImageCLEF evaluation campaign in 2008 with the aim of investigating the use of visual and textual sources in combination for improving the retrieval performance. This paper presents an overview of the task's resources, topics, assessments, participants' approaches, and main results. 0 0
Overview of the clef 2008 multilingual question answering track Forner P.
Penas A.
Eneko Agirre
Iñaki Alegria
Forascu C.
Moreau N.
Osenova P.
Prokopidis P.
Rocha P.
Sacaleanu B.
Sutcliffe R.
Tjong Kim Sang E.
Lecture Notes in Computer Science English 2009 The QA campaign at CLEF 2008 [1], was mainly the same as that proposed last year. The results and the analyses reported by last year's participants suggested that the changes introduced in the previous campaign had led to a drop in systems' performance. So for this year's competition it has been decided to practically replicate last year's exercise. Following last year's experience some QA pairs were grouped in clusters. Every cluster was characterized by a topic (not given to participants). The questions from a cluster contained co-references between one of them and the others. Moreover, as last year, the systems were given the possibility to search for answers in Wikipedia as document corpus beside the usual newswire collection. In addition to the main task, three additional exercises were offered, namely the Answer Validation Exercise (AVE), the Question Answering on Speech Transcriptions (QAST), which continued last year's successful pilots, together with the new Word Sense Disambiguation for Question Answering (QA-WSD). As general remark, it must be said that the main task still proved to be very challenging for participating systems. As a kind of shallow comparison with last year's results the best overall accuracy dropped significantly from 42% to 19% in the multi-lingual subtasks, but increased a little in the monolingual sub-tasks, going from 54% to 63%. 0 0
Overview of videoCLEF 2008: Automatic generation of topic-based feeds for dual language audio-visual content Larson M.
Newman E.
Jones G.J.F.
Lecture Notes in Computer Science English 2009 The VideoCLEF track, introduced in 2008, aims to develop and evaluate tasks related to analysis of and access to multilingual multimedia content. In its first year, VideoCLEF piloted the Vid2RSS task, whose main subtask was the classification of dual language video (Dutch-language television content featuring English-speaking experts and studio guests). The task offered two additional discretionary subtasks: feed translation and automatic keyframe extraction. Task participants were supplied with Dutch archival metadata, Dutch speech transcripts, English speech transcripts and ten thematic category labels, which they were required to assign to the test set videos. The videos were grouped by class label into topic-based RSS-feeds, displaying title, description and keyframe for each video. Five groups participated in the 2008 VideoCLEF track. Participants were required to collect their own training data; both Wikipedia and general web content were used. Groups deployed various classifiers (SVM, Naive Bayes and k-NN) or treated the problem as an information retrieval task. Both the Dutch speech transcripts and the archival metadata performed well as sources of indexing features, but no group succeeded in exploiting combinations of feature sources to significantly enhance performance. A small scale fluency/adequacy evaluation of the translation task output revealed the translation to be of sufficient quality to make it valuable to a non-Dutch speaking English speaker. For keyframe extraction, the strategy chosen was to select the keyframe from the shot with the most representative speech transcript content. The automatically selected shots were shown, with a small user study, to be competitive with manually selected shots. Future years of VideoCLEF will aim to expand the corpus and the class label list, as well as to extend the track to additional tasks. 0 0
Parallel annotation and population: A cross-language experience Sarrafzadeh B.
Shamsfard M.
Proceedings - 2009 International Conference on Computer Engineering and Technology, ICCET 2009 English 2009 In recent years automatic Ontology Population (OP) from texts has emerged as a new field of application for knowledge acquisition techniques. In OP, the instances of an ontology classes will be extracted from text and added under the ontology concepts. On the other hand, semantic annotation which is a key task in moving toward semantic web tries to tag instance data in a text by their corresponding ontology classes; so the ontology population activity accompanies generating semantic annotations usually. In this paper we introduce a cross-lingual population/ annotation system called POPTA which annotates Persian texts according to an English lexicalized ontology and populates the English ontology according to the input Persian texts. It exploits a hybrid approach, a combination of statistical and pattern-based methods as well as techniques founded on the web and search engines and a novel method of resolving translation ambiguities. POPTA also uses Wikipedia as a vast natural language encyclopedia to extract new instances to populate the input ontology. 0 0
Query expansion for effective geographic information retrieval Pu Q.
He D.
Li Q.
Lecture Notes in Computer Science English 2009 We developed two methods for monolingual Geo-CLEF 2008 task. The GCEC method aims to test the effectiveness of our online geographic coordinates extraction and clustering algorithm, and the WIKIGEO method wants to examine the usefulness of using the geographic coordinates information in Wikipedia for identifying geo-locations. We proposed a measure of topic distance to evaluate these two methods. The experiments results show that: 1) our online geographic coordinates extraction and clustering algorithm is useful for the type of locations that do not have clear corresponding coordinates; 2) the expansion based on the geo-locations generated by GCEC is effective in improving geographic retrieval; 3) Wikipedia can help in finding the coordinates for many geo-locations, but its usage for query expansion still needs further study; 4) query expansion based on title only obtained better results than that on the title and narrative parts, even though the latter contains more related geographic information. Further study is needed for this part. 0 0
Research summary: Intelligent Natural language processing techniques and tools Paolucci A. Lecture Notes in Computer Science English 2009 My research path started with my master thesis (supervisor Prof. Stefania Costantini) about a neurobiologically-inspired proposal in the field of natural language processing. In more detail, we proposed the "Semantic Enhanced DCGs" (for short SE-DCGs) extension to the well-known DCG's to allow for parallel syntactic and semantic analysis, and generate semantically-based description of the sentence at hand. The analysis carried out through SE-DCG's was called "syntactic-semantic fully informed analysis", and it was designed to be as close as possible (at least in principle) to the results in the context of neuroscience that I had revised and studied. As proof-of-concept, I implemented the prototype of semantic search engine, the Mnemosine system. Mnemosine is able to interact with a user in natural language and to provide contextual answer at different levels of detail. Mnemosine has been applied to a practical case-study, i.e., to the WikiPedia Web pages. A brief overview of this work was presented during CICL 08 [1]. 0 0
Terabytes of tobler: Evaluating the first law in a massive, domain-neutral representation of world knowledge Brent Hecht
Moxley E.
Lecture Notes in Computer Science English 2009 The First Law of Geography states, "everything is related to everything else, but near things are more related than distant things." Despite the fact that it is to a large degree what makes "spatial special," the law has never been empirically evaluated on a large, domain-neutral representation of world knowledge. We address the gap in the literature about this critical idea by statistically examining the multitude of entities and relations between entities present across 22 different language editions of Wikipedia. We find that, at least according to the myriad authors of Wikipedia, the First Law is true to an overwhelming extent regardless of language-defined cultural domain. 0 0
Trdlo, an open source tool for building transducing dictionary Grac M. Lecture Notes in Computer Science English 2009 This paper describes the development of an open-source tool named Trdlo. Trdlo was developed as part of our effort to build a machine translation system between very close languages. These languages usually do not have available pre-processed linguistic resources or dictionaries suitable for computer processing. Bilingual dictionaries have a big impact on quality of translation. Proposed methods described in this paper attempt to extend existing dictionaries with inferable translation pairs. Our approach requires only 'cheap' resources: a list of lemmata for each language and rules for inferring words from one language to another. It is also possible to use other resources like annotated corpora or Wikipedia. Results show that this approach greatly improves effectivity of building Czech-Slovak dictionary. 0 0
Using AliQAn in monolingual QA@CLEF 2008 Roger S.
Vila K.
Antonio Ferrandez
Pardino M.
Gomez J.M.
Puchol-Blasco M.
Peral J.
Lecture Notes in Computer Science English 2009 This paper describes the participation of the system AliQAn in the CLEF 2008 Spanish monolingual QA task. This time, the main goals of the current version of AliQAn were to deal with topic-related questions and to decrease the number of inexact answers. We have also explored the use of the Wikipedia corpora, which have posed some new challenges for the QA task. 0 0
Using answer retrieval patterns to answer portuguese questions Costa L.F. Lecture Notes in Computer Science English 2009 Esfinge is a general domain Portuguese question answering system which has been participating at QA@CLEF since 2004. It uses the information available in the "official" document collections used in QA@CLEF (newspaper text and Wikipedia) and information from the Web as an additional resource when searching for answers. Where it regards the use of external tools, Esfinge uses a syntactic analyzer, a morphological analyzer and a named entity recognizer. This year an alternative approach to retrieve answers was tested: whereas in previous years, search patterns were used to retrieve relevant documents, this year a new type of search patterns was also used to extract the answers themselves. We also evaluated the second and third best answers returned by Esfinge. This evaluation showed that when Esfinge answers correctly a question, it does so usually with its first answer. Furthermore, the experiments revealed that the answer retrieval patterns created for this participation improve the results, but only for definition questions. 0 0
VideoCLEF 2008: ASR classification with wikipedia categories Kusrsten J.
Richter D.
Eibl M.
Lecture Notes in Computer Science English 2009 This article describes our participation at the VideoCLEF track. We designed and implemented a prototype for the classification of the Video ASR data. Our approach was to regard the task as text classification problem. We used terms from Wikipedia categories as training data for our text classifiers. For the text classification the Naive-Bayes and kNN classifier from the WEKA toolkit were used. We submitted experiments for classification task 1 and 2. For the translation of the feeds to English (translation task) Google's AJAX language API was used. Although our experiments achieved only low precision of 10 to 15 percent, we assume those results will be useful in a combined setting with the retrieval approach that was widely used. Interestingly, we could not improve the quality of the classification by using the provided metadata. 0 0
Wanderlust: Extracting semantic relations from natural language text using dependency grammar patterns Akbik A.
Bross J.
CEUR Workshop Proceedings English 2009 A great share of applications in modern information technology can benefit from large coverage, machine accessible knowledge bases. However, the bigger part of todays knowledge is provided in the form of unstructured data, mostly plain text. As an initial step to exploit such data, we present Wanderlust, an algorithm that automatically extracts semantic relations from natural language text. The procedure uses deep linguistic patterns that are defined over the dependency grammar of sentences. Due to its linguistic nature, the method performs in an unsupervised fashion and is not restricted to any specific type of semantic relation. The applicability of the proposed approach is examined in a case study, in which it is put to the task of generating a semantic wiki from the English Wikipedia corpus. We present an exhaustive discussion about the insights obtained from this particular case study including considerations about the generality of the approach. 0 0
Whither scheme? 21st century approaches to scheme in CS1 Brown R.
Davis J.
Rebelsky S.A.
Harvey B.
SIGCSE'09 - Proceedings of the 40th ACM Technical Symposium on Computer Science Education English 2009 Since the decline of Pascal as a "standard" introductory language in the late 1970's and early 1980's, faculty members have adopted (and, often, discarded) a variety of languages for the introductory course: C, C++, Java, Modula-2, Ada, Python, Ruby, and more. Different approaches and different opinions have led to a number of "language wars" in the SIGCSE community, wars that we hope to avoid in this panel. Throughout this period, Scheme has had a constant audience. A wide variety of schools, from small liberal arts colleges to major research universities, have adopted and stuck with Scheme. Many begin with Structure and Interpretation of Computer Programs (SICP) [1], although a wide variety of approaches have evolved since then. To its adopters, Scheme has many strengths, including a simple syntax, a small language definition, the ability to consider multiple paradigms, and the power of higher-order programming. The Scheme community remains strong, in part, because of DrScheme [3], an open-source development environment appropriate for novices. Although DrScheme was developed in the context of How to Design Programs [2] and the TeachScheme project (which provides its own 21st century approach to Scheme in CS1), DrScheme is used in a wide variety of contexts. More than twenty years have passed since the publication of SICP. In those twenty years, the face of computing has changed significantly. When modern students think of computing, they think of things like Google, Wikis, graphics, games, and more. Is there still a role for Scheme in this new world of computing? In this panel, we consider some ways in which introductory courses currently use Scheme, strategies that preserve the strengths of Scheme while incorporating "new computing". 0 0
Whither scheme?: 21 st century approaches to scheme in CS1 Brown R.
Davis J.
Rebelsky S.A.
Harvey B.
SIGCSE Bulletin Inroads English 2009 Since the decline of Pascal as a standard introductory language in the late 1970's and early 1980's, faculty members have adopted (and, often, discarded) a variety of languages for the introductory course: C, C++, Java, Modula-2, Ada, Python, Ruby, and more. Different approaches and different opinions have led to a number of language wars in the SIGCSE community, wars that we hope to avoid in this panel. Throughout this period, Scheme has had a constant audience. A wide variety of schools, from small liberal arts colleges to major research universities, have adopted and stuck with Scheme. Many begin with Structure and Interpretation of Computer Programs(SICP) [1], although a wide variety of approaches have evolved since then. To its adopters, Scheme has many strengths, including a simple syntax, a small language definition, the ability to consider multiple paradigms, and the power of higher-order programming. The Scheme community remains strong, in part, because of DrScheme [3], an open-source development environment appropriate for novices. Although DrScheme was developed in the context of How to Design Programs [2] and the TeachScheme project (which provides its own 21st century approach to Scheme in CS1), DrScheme is used in a wide variety of contexts. More than twenty years have passed since the publication of SICP. In those twenty years, the face of computing has changed significantly. When modern students think of computing, they think of things like Google, Wikis, graphics, games, and more. Is there still a role for Scheme in this new world of computing? In this panel, we consider some ways in which introductory courses currently use Scheme, strategies that preserve the strengths of Scheme while incorporating new computing. 0 0
WikiTranslate: Query translation for cross-lingual information retrieval using only wikipedia Dong Nguyen
Arnold Overwijk
Claudia Hauff
Trieschnigg D.R.B.
Djoerd Hiemstra
Franciska De Jong
Lecture Notes in Computer Science English 2009 This paper presents WikiTranslate, a system which performs query translation for cross-lingual information retrieval (CLIR) using only Wikipedia to obtain translations. Queries are mapped to Wikipedia concepts and the corresponding translations of these concepts in the target language are used to create the final query. WikiTranslate is evaluated by searching with topics formulated in Dutch, French and Spanish in an English data collection. The system achieved a performance of 67% compared to the monolingual baseline. 0 0
WordVenture - Cooperative WordNet editor: Architecture for lexical semantic acquisition Szymanski J. KEOD 2009 - 1st International Conference on Knowledge Engineering and Ontology Development, Proceedings English 2009 This article presents architecture for acquiring lexical semantics in a collaborative approach paradigm. The system enables functionality for editing semantic networks in a wikipedia-like style. The core of the system is a user-friendly interface based on interactive graph navigation. It has been used for semantic network presentation, and brings simultaneously modification functionality. 0 0
A collaborative multilingual database project on aymara implemented in peru and bolivia Beck H.
Legg S.
Hardman M.J.
Lord G.
Llanque-Chana J.
Lowe E.
American Society of Agricultural and Biological Engineers Annual International Meeting 2008, ASABE 2008 English 2008 A web-based collaborative environment including on-line authoring tools that is managed by a central database was developed in collaboration with several countries including Peru, Bolivia, and the United States. The application involved developing a linguistics database and eLearning environment for documenting, preserving, and promoting language training for Aymara, a language indigenous to Peru and Bolivia. The database, an ontology management system called Lyra, incorporates all elements of the language (dialogues, phrase patterns, phrases, words, and morphemes) as well as cultural multimedia resources (images and sound recordings). The organization of the database enables a high level of integration among language elements and cultural resources. Authoring tools are used by experts in the Aymara language to build the linguistic database. These tools are accessible on-line as part of the collaborative environment using standard web browsers incorporating the Java plug-in. The eLearning student interface is a web-based program written in Flash. The Flash program automatically interprets and formats data objects retrieved from the database in XML format. The student interface is presented in Spanish and English. A web service architecture is used to publish the database on-line so that it can be accessed and utilized by other application programs in a variety of formats. 0 0
A lexical approach for Spanish question answering Tellez A.
Juarez A.
Hernandez G.
Denicia C.
Villatoro E.
Montes M.
Villasenor L.
Lecture Notes in Computer Science English 2008 This paper discusses our system's results at the Spanish Question Answering task of CLEF 2007. Our system is centered in a full data-driven approach that combines information retrieval and machine learning techniques. It mainly relies on the use of lexical information and avoids any complex language processing procedure. Evaluation results indicate that this approach is very effective for answering definition questions from Wikipedia. In contrast, they also reveal that it is very difficult to respond factoid questions from this resource solely based on the use of lexical overlaps and redundancy. 0 0
Collaborative end-user development on handheld devices Ahmadi N.
Repenning A.
Ioannidou A.
Proceedings - 2008 IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC 2008 English 2008 Web 2.0 has enabled end users to collaborate through their own developed artifacts, moving on from text (e.g., Wikipedia, Blogs) to images (e.g., Flickr) and movies (e.g., YouTube), changing end-user's role from consumer to producer. But still there is no support for collaboration through interactive end-user developed artifacts, especially for emerging handheld devices, which are the next collaborative platform. Featuring fast always-on networks, Web browsers that are as powerful as their desktop counterparts, and innovative user interfaces, the newest generation of handheld devices can run highly interactive content as Web applications. We have created Ristretto Mobile, a Web-compliant framework for running end-user developed applications on handheld devices. The Webbased Ristretto Mobile includes compiler and runtime components to turn end-user applications into Web applications that can run on compatible handheld devices, including the Apple iPhone and Nokia N800. Our paper reports on the technological and cognitive challenges in creating interactive content that runs efficiently and is user accessible on handheld devices. 0 0
Combining multiple resources to build reliable wordnets Fiser D.
Sagot B.
Lecture Notes in Computer Science English 2008 This paper compares automatically generated sets of synonyms in French and Slovene wordnets with respect to the resources used in the construction process. Polysemous words were disambiguated via a five-language word-alignment of the SEERA.NET parallel corpus, a subcorpus of the JRC Acquis. The extracted multilingual lexicon was disambiguated with the existing wordnets for these languages. On the other hand, a bilingual approach sufficed to acquire equivalents for monosemous words. Bilingual lexicons were extracted from different resources, including Wikipedia, Wiktionary and EUROVOC thesaurus. A representative sample of the generated synsets was evaluated against the goldstandards. 0 0
Combining wikipedia and newswire texts for question answering in spanish De Pablo-Sanchez C.
Martinez-Fernandez J.L.
Gonzalez-Ledesma A.
Samy D.
Martinez P.
Moreno-Sandoval A.
Al-Jumaily H.
Lecture Notes in Computer Science English 2008 This paper describes the adaptations of the MIRACLE group QA system in order to participate in the Spanish monolingual question answering task at QA@CLEF 2007. A system, initially developed for the EFE collection, was reused for Wikipedia. Answers from both collections were combined using temporal information extracted from questions and collections. Reusing the EFE subsystem has proven not feasible, and questions with answers only in Wikipedia have obtained low accuracy. Besides, a co-reference module based on heuristics was introduced for processing topic-related questions. This module achieves good coverage in different situations but it is hindered by the moderate accuracy of the base system and the chaining of incorrect answers. 0 0
Conflictual consensus in the chinese version of Wikipedia Liao H.-T. International Symposium on Technology and Society, Proceedings English 2008 The paper examines how the recent development of the Chinese version of Wikipedia (CW) has developed to accommodate the diverse regional differences of its contributors. Although contributors are all users of the Chinese language, the orthographic, linguistic, regional and political differences among them do exist. Thus, CW has to attend to the different needs of users from four regions of origin (Mainland, Hong Kong/Macau, Taiwan, and Singapore/Malaysia). The paper shows how a technological polity is built, with an aim to accommodate regional diversity, by importing Wikipedia governance principles, implementing user-generated character conversion, and establishing the "Avoid Region-Centric Policy". It has been observed that although the orthographic and lexical differences have been preserved and respected, the offline political and ideological differences seem to threaten its potential growth, especially when compared to its self-censored copycat Baidu Baike. This paper then suggests it is neither the internal conflicts nor the external competition that matters most to CW, but rather the evolution of its polity 0 0
Coreference resolution for questions and answer merging by validation Sven Hartrumpf
Glockner I.
Johannes Leveling
Lecture Notes in Computer Science English 2008 For its fourth participation at QA@CLEF, the German question answering (QA) system InSicht was improved for CLEF 2007 in the following main areas: questions containing pronominal or nominal anaphors are treated by a coreference resolver; the shallow QA methods are improved; and a specialized module is added for answer merging. Results showed a performance drop compared to last year mainly due to problems in handling the newly added Wikipedia corpus. However, dialog treatment by coreference resolution delivered very accurate results so that follow-up questions can be handled similarly to isolated questions. 0 0
Cross-language retrieval with wikipedia Schonhofen P.
Benczur A.
Biro I.
Csalogany K.
Lecture Notes in Computer Science English 2008 We demonstrate a twofold use of Wikipedia for cross-lingual information retrieval. As our main contribution, we exploit Wikipedia hyperlinkage for query term disambiguation. We also use bilingual Wikipedia articles for dictionary extension. Our method is based on translation disambiguation; we combine the Wikipedia based technique with a method based on bigram statistics of pairs formed by translations of different source language terms. 0 0
Dublin City University at CLEF 2007: Cross-language speech retrieval experiments YanChun Zhang
Jones G.J.F.
Zhang K.
Lecture Notes in Computer Science English 2008 The Dublin City University participation in the CLEF 2007 CL-SR English task concentrated primarily on issues of topic translation. Our retrieval system used the BM25F model and pseudo relevance feedback. Topics were translated into English using the Yahoo! BabelFish free online service combined with domain-specific translation lexicons gathered automatically from Wikipedia. We explored alternative topic translation methods using these resources. Our results indicate that extending machine translation tools using automatically generated domain-specific translation lexicons can provide improved CLIR effectiveness for this task. 0 0
Employing a domain specific ontology to perform semantic search Morneau M.
Mineau G.W.
Lecture Notes in Computer Science English 2008 Increasing the relevancy of Web search results has been a major concern in research over the last years. Boolean search, metadata, natural language based processing and various other techniques have been applied to improve the quality of search results sent to a user. Ontology-based methods were proposed to refine the information extraction process but they have not yet achieved wide adoption by search engines. This is mainly due to the fact that the ontology building process is time consuming. An all inclusive ontology for the entire World Wide Web might be difficult if not impossible to construct, but a specific domain ontology can be automatically built using statistical and machine learning techniques, as done with our tool: SeseiOnto. In this paper, we describe how we adapted the SeseiOnto software to perform Web search on the Wikipedia page on climate change. SeseiOnto, by using conceptual graphs to represent natural language and an ontology to extract links between concepts, manages to properly answer natural language queries about climate change. Our tests show that SeseiOnto has the potential to be used in domain specific Web search as well as in corporate intranets. 0 0
Enriching multilingual language resources by discovering missing cross-language links in Wikipedia Oh J.-H.
Daisuke Kawahara
Kiyotaka Uchimoto
Jun'ichi Kazama
Kentaro Torisawa
Proceedings - 2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008 English 2008 We present a novel method for discovering missing crosslanguage links between English and Japanese Wikipedia articles. We collect candidates of missing cross-language links - a pair of English and Japanese Wikipedia articles, which could be connected by cross-language links. Then we select the correct cross-language links among the candidates by using a classifier trained with various types of features. Our method has three desirable characteristics for discovering missing links. First, our method can discover cross-language links with high accuracy (92% precision with 78% recall rates). Second, the features used in a classifier are language-independent. Third, without relying on any external knowledge, we generate the features based on resources automatically obtained from Wikipedia. In this work, we discover approximately 105 missing crosslanguage links from Wikipedia, which are almost two-thirds as many as the existing cross-language links in Wikipedia. 0 1
KANSHIN: A cross-lingual concern analysis system using multilingual blog articles Tomohiro Fukuhara
Arai Y.
Hidetaka Masuda
Kimura A.
Yoshinaka T.
Takehito Utsuro
Hiroshi Nakagawa
Proceedings - 2008 International Workshop on Information-Explosion and Next Generation Search, INGS 2008 English 2008 An architecture of cross-lingual concern analysis (CLCA) using multilingual blog articles, and its prototype system are described. As various people who are living in various countries use the Web, cross-lingual information retrieval (CLIR) plays an important role in the next generation search. In this paper, we propose a CLCA as one of CLIR applications for facilitating users to find concerns of people across languages. We propose a layer architecture of CLCA, and its prototype system called KANSHIN. The system collects Japanese, Chinese, Korean, and English blog articles, and analyzes concerns across languages. Users can find concerns from several viewpoints such as temporal, geographical, and a network of blog sites. The system also facilitates users to browse multilingual keywords using Wikipedia, and the system facilitates users to find spam blogs. An overview of the CLCA architecture and the system are described. 0 0
Lexical and semantic resources for NLP: From words to meanings Gentile A.L.
Pierpaolo Basile
Iaquinta L.
Giovanni Semeraro
Lecture Notes in Computer Science English 2008 A user expresses her information need through words with a precise meaning, but from the machine point of view this meaning does not come with the word. A further step is needful to automatically associate it to the words. Techniques that process human language are required and also linguistic and semantic knowledge, stored within distinct and heterogeneous resources, which play an important role during all Natural Language Processing (NLP) steps. Resources management is a challenging problem, together with the correct association between URIs coming from the resources and meanings of the words. This work presents a service that, given a lexeme (an abstract unit of morphological analysis in linguistics, which roughly corresponds to a set of words that are different forms of the same word), returns all syntactic and semantic information collected from a list of lexical and semantic resources. The proposed strategy consists in merging data with origin from stable resources, such as WordNet, with data collected dynamically from evolving sources, such as the Web or Wikipedia. That strategy is implemented in a wrapper to a set of popular linguistic resources that provides a single point of access to them, in a transparent way to the user, to accomplish the computational linguistic problem of getting a rich set of linguistic and semantic annotations in a compact way. 0 0
MIRACLE progress in monolingual information retrieval at Ad-Hoc CLEF 2007 Gonzalez-Cristobal J.-C.
Goni-Menoyo J.M.
Villena-Roman J.
Lana-Serrano S.
Lecture Notes in Computer Science English 2008 This paper presents the 2007 MIRACLE's team approach to the AdHoc Information Retrieval track. The main work carried out for this campaign has been around monolingual experiments, in the standard and in the robust tracks. The most important contributions have been the general introduction of automatic named-entities extraction and the use of wikipedia resources. For the 2007 campaign, runs were submitted for the following languages and tracks: a) Monolingual: Bulgarian, Hungarian, and Czech. b) Robust monolingual: French, English and Portuguese. 0 0
NAGA: Harvesting, searching and ranking knowledge Gjergji Kasneci
Suchanek F.M.
Ifrim G.
Elbassuoni S.
Maya Ramanath
Gerhard Weikum
Proceedings of the ACM SIGMOD International Conference on Management of Data English 2008 The presence of encyclopedic Web sources, such as Wikipedia, the Internet Movie Database (IMDB), World Factbook, etc. calls for new querying techniques that are simple and yet more expressive than those provided by standard keyword-based search engines. Searching for explicit knowledge needs to consider inherent semantic structures involving entities and relationships. In this demonstration proposal, we describe a semantic search system named NAGA. NAGA operates on a knowledge graph, which contains millions of entities and relationships derived from various encyclopedic Web sources, such as the ones above. NAGA's graph-based query language is geared towards expressing queries with additional semantic information. Its scoring model is based on the principles of generative language models, and formalizes several desiderata such as confidence, informativeness and compactness of answers. We propose a demonstration of NAGA which will allow users to browse the knowledge base through a user interface, enter queries in NAGA's query language and tune the ranking parameters to test various ranking aspects. 0 0
Named entity normalization in user generated content Jijkoun V.
Khalid M.A.
Marx M.
Maarten de Rijke
Proceedings of SIGIR 2008 Workshop on Analytics for Noisy Unstructured Text Data, AND'08 English 2008 Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of structured databases, this task (known as record linkage and data, de-duplication) has been a topic of active research for more than five decades. For edited content, such as news articles, the named entity normalization (NEN) task is one that has recently attracted considerable attention. We consider the task in the challenging context of user generated content (UGC), where it forms a key ingredient of tracking and media-analysis systems. A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references. To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements. Copyright 2008 ACM. 0 0
Overview of the CLEF 2007 multilingual question answering track Giampiccolo D.
Forner P.
Herrera J.
Penas A.
Ayache C.
Forascu C.
Jijkoun V.
Osenova P.
Rocha P.
Sacaleanu B.
Sutcliffe R.
Lecture Notes in Computer Science English 2008 The fifth QA campaign at CLEF [1], having its first edition in 2003, offered not only a main task but an Answer Validation Exercise (AVE) [2], which continued last year's pilot, and a new pilot: the Question Answering on Speech Transcripts (QAST) [3, 15]. The main task was characterized by the focus on cross-linguality, while covering as many European languages as possible. As novelty, some QA pairs were grouped in clusters. Every cluster was characterized by a topic (not given to participants). The questions from a cluster possibly contain co-references between one of them and the others. Finally, the need for searching answers in web formats was satisfied by introducing Wikipedia as document corpus. The results and the analyses reported by the participants suggest that the introduction of Wikipedia and the topic related questions led to a drop in systems' performance. 0 0
Priberam's question answering system in QA@CLEF 2007 Amaral C.
Cassan A.
Figueira H.
Martins A.
Mendes A.
Mendes P.
Pinto C.
Vidal D.
Lecture Notes in Computer Science English 2008 This paper accounts for Priberam's participation in the monolingual question answering (QA) track of CLEF 2007. In previous participations, Priberam's QA system obtained encouraging results both in monolingual and cross-language tasks. This year we endowed the system with syntactical processing, in order to capture the syntactic structure of the question. The main goal was to obtain a more tuned question categorisation and consequently a more precise answer extraction. Besides this, we provided our system with the ability to handle topic-related questions and to use encyclopaedic sources like Wikipedia. The paper provides a description of the improvements made in the system, followed by the discussion of the results obtained in Portuguese and Spanish monolingual runs. 0 0
Question answering with joost at CLEF 2007 Gosse Bouma
Kloosterman G.
Mur J.
Van Noord G.
Van Der Plas L.
Tiedemann J.
Lecture Notes in Computer Science English 2008 We describe our system for the monolingual Dutch and multilingual English to Dutch QA tasks. We describe the preprocessing of Wikipedia, inclusion of query expansion in IR, anaphora resolution in follow-up questions, and a question classification module for the multilingual task. Our best runs achieved 25.5% accuracy for the Dutch monolingual task, and 13.5% accuracy for the multilingual task. 0 0
Simultaneous multilingual search for translingual information retrieval Parton K.
McKeown K.R.
Allan J.
Henestroza E.
International Conference on Information and Knowledge Management, Proceedings English 2008 We consider the problem of translingual information retrieval, where monolingual searchers issue queries in a different language than the document language(s) and the results must be returned in the language they know, the query language. We present a framework for translingual IR that integrates document translation and query translation into the retrieval model. The corpus is represented as an aligned, jointly indexed "pseudo-parallel" corpus, where each document contains the text of the document along with its translation into the query language. The queries are formulated as multilingual structured queries, where each query term and its translations into the document language(s) are treated as synonym sets. This model leverages simultaneous search in multiple languages against jointly indexed documents to improve the accuracy of results over search using document translation or query translation alone. For query translation, we compared a statistical machine translation (SMT) approach to a dictionarybased approach. We found that using a Wikipedia-derived dictionary for named entities combined with an SMT-based dictionary worked better than SMT alone. Simultaneous multilingual search also has other important features suited to translingual search, since it can provide an indication of poor document translation when a match with the source document is found. We show how close integration of CLIR and SMT allows us to improve result translation in addition to IR results. Copyright 2008 ACM. 0 0
The university of amsterdam's question answering system at QA@CLEF 2007 Jijkoun V.
Hofmann K.
Ahn D.
Khalid M.A.
Van Rantwijk J.
Maarten de Rijke
Tjong Kim Sang E.
Lecture Notes in Computer Science English 2008 We describe a new version of our question answering system, which was applied to the questions of the 2007 CLEF Question Answering Dutch monolingual task. This year, we made three major modifications to the system: (1) we added the contents of Wikipedia to the document collection and the answer tables; (2) we completely rewrote the module interface code in Java; and (3) we included a new table stream which returned answer candidates based on information which was learned from question-answer pairs. Unfortunately, the changes did not lead to improved performance. Unsolved technical problems at the time of the deadline have led to missing justifications for a large number of answers in our submission. Our single run obtained an accuracy of only 8% with an additional 12% of unsupported answers (compared to 21% in the last year's task). 0 0
What Happened to Esfinge in 2007? Cabral L.M.
Costa L.F.
Diana Santos
Lecture Notes in Computer Science English 2008 Esfinge is a general domain Portuguese question answering system which uses the information available on the Web as an additional resource when searching for answers. Other external resources and tools used are a broad coverage parser, a morphological analyser, a named entity recognizer and a Web-based database of word co-occurrences. In this fourth participation in CLEF, in addition to the new challenges posed by the organization (topics and anaphors in questions and the use of Wikipedia to search and support answers), we experimented with a multiple question and multiple answer approach in QA. 0 0
Wikipedia mining for huge scale Japanese association thesaurus construction Kotaro Nakayama
Masahiro Ito
Takahiro Hara
Shojiro Nishio
Proceedings - International Conference on Advanced Information Networking and Applications, AINA English 2008 Wikipedia, a huge scale Web-based dictionary, is an impressive corpus for knowledge extraction. We already proved that Wikipedia can be used for constructing an English association thesaurus and our link structure mining method is significantly effective for this aim. However, we want to find out how we can apply this method to other languages and what the requirements, differences and characteristics are. Nowadays, Wikipedia supports more than 250 languages such as English, German, French, Polish and Japanese. Among Asian languages, the Japanese Wikipedia is the largest corpus in Wikipedia. In this research, therefore, we analyzed all Japanese articles in Wikipedia and constructed a huge scale Japanese association thesaurus. After constructing the thesaurus, we realized that it shows several impressive characteristics depending on language and culture. 0 0
The top-ten wikipedias : A quantitative analysis using wikixray Felipe Ortega
Gonzalez-Barahona J.M.
Gregorio Robles
ICSOFT 2007 - 2nd International Conference on Software and Data Technologies, Proceedings English 2007 In a few years, Wilcipedia has become one of the information systems with more public (both producers and consumers) of the Internet. Its system and information architecture is relatively simple, but has proven to be capable of supporting the largest and more diverse community of collaborative authorship worldwide. In this paper, we analyze in detail this community, and the contents it is producing. Using a quantitative methodology based on the analysis of the public Wikipedia databases, we describe the main characteristics of the 10 largest language editions, and the authors that work in them. The methodology (which is almost completely automated) is generic enough to be used on the rest of the editions, providing a convenient framework to develop a complete quantitative analysis of the Wikipedia. Among other parameters, we study the evolution of the number of contributions and articles, their size, and the differences in contributions by different authors, inferring some relationships between contribution patterns and content. These relationships reflect (and in part, explain) the evolution of the different language editions so far, as well as their future trends. 0 0
FromWikipedia to semantic relationships: A semi-automated annotation approach? Maria Ruiz-Casado
Enrique Alfonseca
Pablo Castells
CEUR Workshop Proceedings English 2006 In this paper, an experiment is presented for the automatic annotation of several semantic relationships in the Wikipedia, a collaborative on-line encyclopedia. The procedure is based on a methodology for the automatic discovery and generalisation of lexical patterns that allows the recognition of relationships among concepts. This methodology requires as information source any written, general-domain corpora and applies natural language processing techniques to extract the relationships from the textual corpora. It has been tested with eight different relations from the Wikipedia corpus. 0 0