Cross-language

From WikiPapers
Jump to: navigation, search

Cross-language is included as keyword or extra keyword in 0 datasets, 0 tools and 26 publications.

Datasets

There is no datasets for this keyword.

Tools

There is no tools for this keyword.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Chinese and Korean cross-lingual issue news detection based on translation knowledge of Wikipedia Zhao S.
Tsolmon B.
Lee K.-S.
Lee Y.-S.
Lecture Notes in Electrical Engineering English 2014 Cross-lingual issue news and analyzing the news content is an important and challenging task. The core of the cross-lingual research is the process of translation. In this paper, we focus on extracting cross-lingual issue news from the Twitter data of Chinese and Korean. We propose translation knowledge method for Wikipedia concepts as well as the Chinese and Korean cross-lingual inter-Wikipedia link relations. The relevance relations are extracted from the category and the page title of Wikipedia. The evaluation achieved a performance of 83% in average precision in the top 10 extracted issue news. The result indicates that our method is an effective for cross-lingual issue news detection. 0 0
Multilinguals and wikipedia editing Hale S.A. WebSci 2014 - Proceedings of the 2014 ACM Web Science Conference English 2014 This article analyzes one month of edits to Wikipedia in order to examine the role of users editing multiple language editions (referred to as multilingual users). Such multilingual users may serve an important function in diffusing information across different language editions of the encyclopedia, and prior work has suggested this could reduce the level of self-focus bias in each edition. This study finds multilingual users are much more active than their single-edition (monolingual) counterparts. They are found in all language editions, but smaller-sized editions with fewer users have a higher percentage of multilingual users than larger-sized editions. About a quarter of multilingual users always edit the same articles in multiple languages, while just over 40% of multilingual users edit different articles in different languages. When non-English users do edit a second language edition, that edition is most frequently English. Nonetheless, several regional and linguistic cross-editing patterns are also present. Copyright 0 0
Okinawa in Japanese and English Wikipedia Hale S.A. Conference on Human Factors in Computing Systems - Proceedings English 2014 This research analyzes edits by foreign-language users in Wikipedia articles about Okinawa, Japan, in the Japanese and English editions of the encyclopedia. Okinawa, home to both English and Japanese speaking users, provides a good case to look at content differences and cross-language editing in a small geographic area on Wikipedia. Consistent with prior work, this research finds large differences in the representations of Okinawa in the content of the two editions. The number of users crossing the language boundary to edit both editions is also extremely small. When users do edit in a non-primary language, they most frequently edit articles that have cross-language (interwiki) links, articles that are edited more by other users, and articles that have more images. Finally, the possible value of edits from foreign-language users and design possibilities to motivate wider contributions from foreign-language users are discussed. 0 0
Boosting cross-lingual knowledge linking via concept annotation Zhe Wang
Jing-Woei Li
Tang J.
IJCAI International Joint Conference on Artificial Intelligence English 2013 Automatically discovering cross-lingual links (CLs) between wikis can largely enrich the cross-lingual knowledge and facilitate knowledge sharing across different languages. In most existing approaches for cross-lingual knowledge linking, the seed CLs and the inner link structures are two important factors for finding new CLs. When there are insufficient seed CLs and inner links, discovering new CLs becomes a challenging problem. In this paper, we propose an approach that boosts cross-lingual knowledge linking by concept annotation. Given a small number of seed CLs and inner links, our approach first enriches the inner links in wikis by using concept annotation method, and then predicts new CLs with a regression-based learning model. These two steps mutually reinforce each other, and are executed iteratively to find as many CLs as possible. Experimental results on the English and Chinese Wikipedia data show that the concept annotation can effectively improve the quantity and quality of predicted CLs. With 50,000 seed CLs and 30% of the original inner links in Wikipedia, our approach discovered 171,393 more CLs in four runs when using concept annotation. 0 0
Cross lingual entity linking with bilingual topic model Zhang T.
Kang Liu
Jun Zhao
IJCAI International Joint Conference on Artificial Intelligence English 2013 Cross lingual entity linking means linking an entity mention in a background source document in one language with the corresponding real world entity in a knowledge base written in the other language. The key problem is to measure the similarity score between the context of the entity mention and the document of the cand idate entity. This paper presents a general framework for doing cross lingual entity linking by leveraging a large scale and bilingual knowledge base, Wikipedia. We introduce a bilingual topic model that mining bilingual topic from this knowledge base with the assumption that the same Wikipedia concept documents of two different languages share the same semantic topic distribution. The extracted topics have two types of representation, with each type corresponding to one language. Thus both the context of the entity mention and the document of the cand idate entity can be represented in a space using the same semantic topics. We use these topics to do cross lingual entity linking. Experimental results show that the proposed approach can obtain the competitive results compared with the state-of-art approach. 0 0
Searching for Translated Plagiarism with the Help of Desktop Grids Pataki M.
Marosi A.C.
Journal of Grid Computing English 2013 Translated or cross-lingual plagiarism is defined as the translation of someone else's work or words without marking it as such or without giving credit to the original author. The existence of cross-lingual plagiarism is not new, but only in recent years, due to the rapid development of the natural language processing, appeared the first algorithms which tackled the difficult task of detecting it. Most of these algorithms utilize machine translation to compare texts written in different languages. We propose a different method, which can effectively detect translations between language-pairs where machine translations still produce low quality results. Our new algorithm presented in this paper is based on information retrieval (IR) and a dictionary based similarity metric. The preprocessing of the candidate documents for the IR is computationally intensive, but easily parallelizable. We propose a desktop Grid solution for this task. As the application is time sensitive and the desktop Grid peers are unreliable, a resubmission mechanism is used which assures that all jobs of a batch finish within a reasonable time period without dramatically increasing the load on the whole system. © 2012 Springer Science+Business Media B.V. 0 0
BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network Roberto Navigli
Ponzetto S.P.
Artificial Intelligence English 2012 We present an automatic approach to the construction of BabelNet, a very large, wide-coverage multilingual semantic network. Key to our approach is the integration of lexicographic and encyclopedic knowledge from WordNet and Wikipedia. In addition, Machine Translation is applied to enrich the resource with lexical information for all languages. We first conduct in vitro experiments on new and existing gold-standard datasets to show the high quality and coverage of BabelNet. We then show that our lexical resource can be used successfully to perform both monolingual and cross-lingual Word Sense Disambiguation: thanks to its wide lexical coverage and novel semantic relations, we are able to achieve state-of the-art results on three different SemEval evaluation tasks. © 2012 Elsevier B.V. 0 0
Building a large scale knowledge base from Chinese Wiki Encyclopedia Zhe Wang
Jing-Woei Li
Pan J.Z.
Lecture Notes in Computer Science English 2012 DBpedia has been proved to be a successful structured knowledge base, and large scale Semantic Web data has been built by using DBpedia as the central interlinking-hubs of the Web of Data in English. But in Chinese, due to the heavily imbalance in size (no more than one tenth) between English and Chinese in Wikipedia, there are few Chinese linked data are published and linked to DBpedia, which hinders the structured knowledge sharing both within Chinese resources and cross-lingual resources. This paper aims at building large scale Chinese structured knowledge base from Hudong, which is one of the largest Chinese Wiki Encyclopedia websites. In this paper, an upper-level ontology schema in Chinese is first learned based on the category system and Infobox information in Hudong. Totally, there are 19542 concepts are inferred, which are organized in hierarchy with maximally 20 levels. 2381 properties with domain and range information are learned according to the attributes in the Hudong Infoboxes. Then, 802593 instances are extracted and described using the concepts and properties in the learned ontology. These extracted instances cover a wide range of things, including persons, organizations, places and so on. Among all the instances, 62679 of them are linked to identical instances in DBpedia. Moreover, the paper provides RDF dump or SPARQL to access the established Chinese knowledge base. The general upper-level ontology and wide coverage makes the knowledge base a valuable Chinese semantic resource. It not only can be used in Chinese linked data building, the fundamental work for building multi lingual knowledge base across heterogeneous resources of different languages, but also can largely facilitate many useful applications of large-scale knowledge base such as knowledge question-answering and semantic search. 0 0
CoSyne: Synchronizing multilingual wiki content Bronner A.
Matteo Negri
Yashar Mehdad
Angela Fahrni
Christof Monz
WikiSym 2012 English 2012 CoSyne is a content synchronization system for assisting users and organizations involved in the maintenance of multilingual wikis. The system allows users to explore the diversity of multilingual content using a monolingual view. It provides suggestions for content modification based on additional or more specific information found in other language versions, and enables seamless integration of automatically translated sentences while giving users the flexibility to edit, correct and control eventual changes to the wiki page. To support these tasks, CoSyne employs state-of-the-art machine translation and natural language processing techniques. 0 0
Cross-lingual knowledge linking across wiki knowledge bases Zhe Wang
Jing-Woei Li
Tang J.
WWW'12 - Proceedings of the 21st Annual Conference on World Wide Web English 2012 Wikipedia becomes one of the largest knowledge bases on the Web. It has attracted 513 million page views per day in January 2012. However, one critical issue for Wikipedia is that articles in different language are very unbalanced. For example, the number of articles on Wikipedia in English has reached 3.8 million, while the number of Chinese articles is still less than half million and there are only 217 thousand cross-lingual links between articles of the two languages. On the other hand, there are more than 3.9 million Chinese Wiki articles on Baidu Baike and Hudong.com, two popular encyclopedias in Chinese. One important question is how to link the knowledge entries distributed in different knowledge bases. This will immensely enrich the information in the online knowledge bases and benefit many applications. In this paper, we study the problem of cross-lingual knowledge linking and present a linkage factor graph model. Features are defined according to some interesting observations. Experiments on the Wikipedia data set show that our approach can achieve a high precision of 85.8% with a recall of 88.1%. The approach found 202,141 new cross-lingual links between English Wikipedia and Baidu Baike. 0 0
English-to-traditional Chinese cross-lingual link discovery in articles with wikipedia corpus Chen L.-P.
Shih Y.-L.
Chen C.-T.
Ku T.
Hsieh W.-T.
Chiu H.-S.
Yang R.-D.
Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, ROCLING 2012 English 2012 In this paper, we design a processing flow to produce linked data in articles, providing anchor-based term's additional information and related terms in different languages (English to Chinese). Wikipedia has been a very important corpus and knowledge bank. Although Wikipedia describes itself not a dictionary or encyclopedia, it is if high potential values in applications and data mining researches. Link discovery is a useful IR application, based on Data Mining and NLP algorithms and has been used in several fields. According to the results of our experiment, this method does make the result has improved. 0 0
Exploiting Wikipedia for cross-lingual and multilingual information retrieval Sorg P.
Philipp Cimiano
Data and Knowledge Engineering English 2012 In this article we show how Wikipedia as a multilingual knowledge resource can be exploited for Cross-Language and Multilingual Information Retrieval (CLIR/MLIR). We describe an approach we call Cross-Language Explicit Semantic Analysis (CL-ESA) which indexes documents with respect to explicit interlingual concepts. These concepts are considered as interlingual and universal and in our case correspond either to Wikipedia articles or categories. Each concept is associated to a text signature in each language which can be used to estimate language-specific term distributions for each concept. This knowledge can then be used to calculate the strength of association between a term and a concept which is used to map documents into the concept space. With CL-ESA we are thus moving from a Bag-Of-Words model to a Bag-Of-Concepts model that allows language-independent document representations in the vector space spanned by interlingual and universal concepts. We show how different vector-based retrieval models and term weighting strategies can be used in conjunction with CL-ESA and experimentally analyze the performance of the different choices. We evaluate the approach on a mate retrieval task on two datasets: JRC-Acquis and Multext. We show that in the MLIR settings, CL-ESA benefits from a certain level of abstraction in the sense that using categories instead of articles as in the original ESA model delivers better results. © 2012 Elsevier B.V. All rights reserved. 0 0
Editing knowledge resources: The wiki way Francesco Ronzano
Andrea Marchetti
Maurizio Tesconi
International Conference on Information and Knowledge Management, Proceedings English 2011 The creation, customization, and maintenance of knowledge resources are essential for fostering the full deployment of Language Technologies. The definition and refinement of knowledge resources are time- and resource-consuming activities. In this paper we explore how the Wiki paradigm for online collaborative content editing can be exploited to gather massive social contributions from common Web users in editing knowledge resources. We discuss the Wikyoto Knowledge Editor, also called Wikyoto. Wikyoto is a collaborative Web environment that enables users with no knowledge engineering background to edit the multilingual network of knowledge resources exploited by KYOTO, a cross-lingual text mining system developed in the context of the KYOTO European Project. 0 0
English-to-Korean cross-lingual link detection for Wikipedia Marigomen R.
Kang I.-S.
Communications in Computer and Information Science English 2011 In this paper, we introduce a method for automatically discovering possible links between documents in different languages. We utilized the large collection of articles in Wikipedia as our resource for keyword extraction, word sense disambiguation and in creating a bilingual dictionary. Our system runs using these set of methods for which given an English text or input document, it automatically determines important words or phrases within the context and links it to a corresponding Wikipedia article in other languages. In this system we use the Korean Wikipedia corpus as the linking document. 0 0
Knowledge transfer across multilingual corpora via latent topics De Smet W.
Tang J.
Moens M.-F.
Lecture Notes in Computer Science English 2011 This paper explores bridging the content of two different languages via latent topics. Specifically, we propose a unified probabilistic model to simultaneously model latent topics from bilingual corpora that discuss comparable content and use the topics as features in a cross-lingual, dictionary-less text categorization task. Experimental results on multilingual Wikipedia data show that the proposed topic model effectively discovers the topic information from the bilingual corpora, and the learned topics successfully transfer classification knowledge to other languages, for which no labeled training data are available. 0 0
Learning from partially annotated sequences Fernandes E.R.
Brefeld U.
Lecture Notes in Computer Science English 2011 We study sequential prediction models in cases where only fragments of the sequences are annotated with the ground-truth. The task does not match the standard semi-supervised setting and is highly relevant in areas such as natural language processing, where completely labeled instances are expensive and require editorial data. We propose to generalize the semi-supervised setting and devise a simple transductive loss-augmented perceptron to learn from inexpensive partially annotated sequences that could for instance be provided by laymen, the wisdom of the crowd, or even automatically. Experiments on mono- and cross-lingual named entity recognition tasks with automatically generated partially annotated sentences from Wikipedia demonstrate the effectiveness of the proposed approach. Our results show that learning from partially labeled data is never worse than standard supervised and semi-supervised approaches trained on data with the same ratio of labeled and unlabeled tokens. 0 0
No free lunch: Brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity Ture F.
Elsayed T.
Lin J.
SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2011 This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multilingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints. 0 0
Ranking multilingual documents using minimal language dependent resources Santosh G.S.K.
Kiran Kumar N.
Vasudeva Varma
Lecture Notes in Computer Science English 2011 This paper proposes an approach of extracting simple and effective features that enhances multilingual document ranking (MLDR). There is limited prior research on capturing the concept of multilingual document similarity in determining the ranking of documents. However, the literature available has worked heavily with language specific tools, making them hard to reimplement for other languages. Our approach extracts various multilingual and monolingual similarity features using a basic language resource (bilingual dictionary). No language-specific tools are used, hence making this approach extensible for other languages. We used the datasets provided by Forum for Information Retrieval Evaluation (FIRE) for their 2010 Adhoc Cross-Lingual document retrieval task on Indian languages. Experiments have been performed with different ranking algorithms and their results are compared. The results obtained showcase the effectiveness of the features considered in enhancing multilingual document ranking. 0 0
Cross-lingual analysis of concerns and reports on crimes in blogs Hiroyuki Nakasaki
Abe Y.
Takehito Utsuro
Kawada Y.
Tomohiro Fukuhara
Kando N.
Yoshioka M.
Hiroshi Nakagawa
Yoji Kiyota
Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering English 2010 Among other domains and topics on which some issues are frequently argued in the blogosphere, the domain of crime is one of the most seriously discussed by various kinds of bloggers. Such information on crimes in blogs is especially valuable for outsiders from abroad who are not familiar with cultures and crimes in foreign countries. This paper proposes a framework of cross-lingually analyzing people's concerns, reports, and experiences on crimes in their own blogs. In the retrieval of blog feeds/posts, we take two approaches, focusing on various types of bloggers such as experts in the crime domain and victims of criminal acts. 0 0
Translingual document representations from discriminative projections Platt J.C.
Yih W.-T.
Toutanova K.
EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference English 2010 Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corresponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best. 0 0
Cross-lingual Dutch to english alignment using EuroWordNet and Dutch Wikipedia Gosse Bouma CEUR Workshop Proceedings English 2009 This paper describes a system for linking the thesaurus of the Netherlands Institute for Sound and Vision to English WordNet and dbpedia. We used EuroWordNet, a multilingual wordnet, and Dutch Wikipedia as intermediaries for the two alignments. EuroWordNet covers most of the subject terms in the thesaurus, but the organization of the cross-lingual links makes selection of the most appropriate English target term almost impossible. Using page titles, redirects, disambiguation pages, and anchor text harvested from Dutch Wikipedia gives reasonable performance on subject terms and geographical terms. Many person and organization names in the thesaurus could not be located in (Dutch or English) Wikipedia. 0 0
Cross-lingual semantic relatedness using encyclopedic knowledge Hassan S.
Rada Mihalcea
EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009 English 2009 In this paper, we address the task of crosslingual semantic relatedness. We introduce a method that relies on the information extracted from Wikipedia, by exploiting the interlanguage links available between Wikipedia versions in multiple languages. Through experiments performed on several language pairs, we show that the method performs well, with a performance comparable to monolingual measures of relatedness. 0 0
Crosslanguage Retrieval Based on Wikipedia Statistics Andreas Juffinger
Roman Kern
Michael Granitzer
Lecture Notes in Computer Science English 2009 In this paper we present the methodology, implementations and evaluation results of the crosslanguage retrieval system we have developed for the Robust WSD Task at CLEF 2008. Our system is based on query preprocessing for translation and homogenisation of queries. The presented preprocessing of queries includes two stages: Firstly, a query translation step based on term statistics of cooccuring articles in Wikipedia. Secondly, different disjunct query composition techniques to search in the CLEF corpus. We apply the same preprocessing steps for the monolingual as well as the crosslingual task and thereby acting fair and in a similar way across these tasks. The evaluation revealed that the similar processing comes at nearly no costs for monolingual retrieval but enables us to do crosslanguage retrieval and also a feasible comparison of our system performance on these two tasks. 0 0
Cross-lingual blog analysis based on multilingual blog distillation from multilingual wikipedia entries Mariko Kawaba
Hiroyuki Nakasaki
Takehito Utsuro
Tomohiro Fukuhara
ICWSM 2008 - Proceedings of the 2nd International Conference on Weblogs and Social Media English 2008 The goal of this paper is to cross-lingually analyze multilingual blogs collected with a topic keyword. The framework of collecting multilingual blogs with a topic keyword is designed as the blog distillation (feed search) procedure. Mulitlingual queries for retrieving blog feeds are created from Wikipedia entries. Finally, we cross-lingually and cross-culturally compare less well known facts and opinions that are closely related to a given topic. Preliminary evaluation results support the effectiveness of the proposed framework. Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 0
Enriching the crosslingual link structure of wikipedia - A classification-based approach Sorg P.
Philipp Cimiano
AAAI Workshop - Technical Report English 2008 The crosslingual link structure of Wikipedia represents a valuable resource which can be exploited for crosslingual natural language processing applications. However, this requires that it has a reasonable coverage and is furthermore accurate. For the specific language pair German/English that we consider in our experiments, we show that roughly 50% of the articles are linked from German to English and only 14% from English to German. These figures clearly corroborate the need for an approach to automatically induce new cross-language links, especially in the light of such a dynamically growing resource such as Wikipedia. In this paper we present a classification-based approach with the goal of inferring new cross-language links. Our experiments show that this approach has a recall of 70% with a precision of 94% for the task of learning cross-language links on a test dataset. 0 0
GikiP: Evaluating geographical answers from wikipedia Diana Santos
Nuno Cardoso
International Conference on Information and Knowledge Management, Proceedings English 2008 This paper describes GikiP, a pilot task that took place in 2008 in CLEF. We present the motivation behind GikiP and the use of Wikipedia as the evaluation collection, detail the task and we list new ideas for its continuation. 0 0