From WikiPapers
Jump to: navigation, search

Translation is included as keyword or extra keyword in 0 datasets, 0 tools and 47 publications.


There is no datasets for this keyword.


There is no tools for this keyword.


Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Acquisition des traductions de requêtes à partir de wikipédia pour la recherche d'information translingue Chakour H.
Sadat F.
Vision 2020: Sustainable Growth, Economic Development, and Global Competitiveness - Proceedings of the 23rd International Business Information Management Association Conference, IBIMA 2014 French 2014 The multilingual encyclopedia Wikipedia has become a very useful resource for the construction and enrichment of linguistic resources, such as dictionaries and ontologies. In this study, we are interested by the exploitation of Wikipedia for query translation in Cross-Language Information Retrieval. An application is completed for the Arabic-English pair of languages. All possible translation candidates are extracted from the titles of Wikipedia articles based on the inter-links between Arabic and English; which is considered as direct translation. Furthermore, other links such as Arabic to French and French to English are exploited for a transitive translation. A slight stemming and segmentation of the query into multiple tokens can be made if no translation can be found for the entire query. Assessments monolingual and cross-lingual systems were conducted using three weighting schemes of the Lucene search engine (default, Tf-Idf and BM25). In addition, the performance of the so-called translation method was compared with those of GoogleTranslate and MyMemory. 0 0
Cross-language and cross-encyclopedia article linking using mixed-language topic model and hypernym translation Wang Y.-C.
Wu C.-K.
Tsai R.T.-H.
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 Creating cross-language article links among different online encyclopedias is now an important task in the unification of multilingual knowledge bases. In this paper, we propose a cross-language article linking method using a mixed-language topic model and hypernym translation features based on an SVM model to link English Wikipedia and Chinese Baidu Baike, the most widely used Wiki-like encyclopedia in China. To evaluate our approach, we compile a data set from the top 500 Baidu Baike articles and their corresponding English Wiki articles. The evaluation results show that our approach achieves 80.95% in MRR and 87.46% in recall. Our method does not heavily depend on linguistic characteristics and can be easily extended to generate crosslanguage article links among different online encyclopedias in other languages. 0 0
Could someone please translate this? - Activity analysis of wikipedia article translation by non-experts Ari Hautasaari English 2013 Wikipedia translation activities aim to improve the quality of the multilingual Wikipedia through article translation. We performed an activity analysis of the translation work done by individual English to Chinese non-expert translators, who translated linguistically complex Wikipedia articles in a laboratory setting. From the analysis, which was based on Activity Theory, and which examined both information search and translation activities, we derived three translation strategies that were used to inform the design of a support system for human translation activities in Wikipedia. Copyright 2013 ACM. 0 0
Managing information disparity in multilingual document collections Kevin Duh
Yeung C.-M.A.
Iwata T.
Masaaki Nagata
ACM Transactions on Speech and Language Processing English 2013 Information disparity is a major challenge with multilingual document collections. When documents are dynamically updated in a distributed fashion, information content among different language editions may gradually diverge. We propose a framework for assisting human editors to manage this information disparity, using tools from machine translation and machine learning. Given source and target documents in two different languages, our system automatically identifies information nuggets that are new with respect to the target and suggests positions to place their translations. We perform both real-world experiments and large-scale simulations on Wikipedia documents and conclude our system is effective in a variety of scenarios. 0 0
Searching for Translated Plagiarism with the Help of Desktop Grids Pataki M.
Marosi A.C.
Journal of Grid Computing English 2013 Translated or cross-lingual plagiarism is defined as the translation of someone else's work or words without marking it as such or without giving credit to the original author. The existence of cross-lingual plagiarism is not new, but only in recent years, due to the rapid development of the natural language processing, appeared the first algorithms which tackled the difficult task of detecting it. Most of these algorithms utilize machine translation to compare texts written in different languages. We propose a different method, which can effectively detect translations between language-pairs where machine translations still produce low quality results. Our new algorithm presented in this paper is based on information retrieval (IR) and a dictionary based similarity metric. The preprocessing of the candidate documents for the IR is computationally intensive, but easily parallelizable. We propose a desktop Grid solution for this task. As the application is time sensitive and the desktop Grid peers are unreliable, a resubmission mechanism is used which assures that all jobs of a batch finish within a reasonable time period without dramatically increasing the load on the whole system. © 2012 Springer Science+Business Media B.V. 0 0
TransWiki: Supporting translation teaching Biuk-Aghai R.P.
Hari Venkatesan
Lecture Notes in Computer Science English 2013 Web-based learning systems have become common in recent years and wikis, websites whose pages anyone can edit, have enabled online collaborative text production. When applied to education, wikis have the potential to facilitate collaborative learning. We have developed a customized wiki system which we have used at our university in teaching translation in collaborative student groups. We report on the design and implementation of our wiki system and an evaluation of its use. 0 0
Cross-lingual knowledge discovery: Chinese-to-English article linking in wikipedia Tang L.-X.
Andrew Trotman
Shlomo Geva
Xu Y.
Lecture Notes in Computer Science English 2012 In this paper we examine automated Chinese to English link discovery in Wikipedia and the effects of Chinese segmentation and Chinese to English translation on the hyperlink recommendation. Our experimental results show that the implemented link discovery framework can effectively recommend Chinese-to-English cross-lingual links. The techniques described here can assist bi-lingual users where a particular topic is not covered in Chinese, is not equally covered in both languages, or is biased in one language; as well as for language learning. 0 0
Exploiting a web-based encyclopedia as a knowledge base for the extraction of multilingual terminology Sadat F. Lecture Notes in Computer Science English 2012 Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopaedias such as Wikipedia as comparable corpora for bilingual terminology extraction. We propose an approach to extract terms and their translations from different types of Wikipedia link information and data. The next step will be using linguistic-based information to re-rank and filter the extracted term candidates in the target language. Preliminary evaluations using the combined statistics-based and linguistic-based approaches were applied on different pairs of languages including Japanese, French and English. These evaluations showed a real open improvement and a good quality of the extracted term candidates for building or enriching multilingual anthologies, dictionaries or feeding a cross-language information retrieval system with the related expansion terms of the source query. 0 0
In search of the ur-Wikipedia: Universality, similarity, and translation in the Wikipedia inter-language link network Morten Warncke-Wang
Anuradha Uduwage
Zhenhua Dong
John Riedl
WikiSym 2012 English 2012 Wikipedia has become one of the primary encyclopaedic information repositories on the World Wide Web. It started in 2001 with a single edition in the English language and has since expanded to more than 20 million articles in 283 languages. Criss-crossing between the Wikipedias is an inter-language link network, connecting the articles of one edition of Wikipedia to another. We describe characteristics of articles covered by nearly all Wikipedias and those covered by only a single language edition, we use the network to understand how we can judge the similarity between Wikipedias based on concept coverage, and we investigate the flow of translation between a selection of the larger Wikipedias. Our findings indicate that the relationships between Wikipedia editions follow Tobler's first law of geography: similarity decreases with increasing distance. The number of articles in a Wikipedia edition is found to be the strongest predictor of similarity, while language similarity also appears to have an influence. The English Wikipedia edition is by far the primary source of translations. We discuss the impact of these results for Wikipedia as well as user-generated content communities in general. 0 0
Supporting multilingual discussion for collaborative translation Noriyuki Ishida
Lin D.
Toshiyuki Takasaki
Toru Ishida
Proceedings of the 2012 International Conference on Collaboration Technologies and Systems, CTS 2012 English 2012 In recent years, collaborative translation has become more and more important for translation volunteers to share knowledge among different languages, among which Wikipedia translation activity is a typical example. During the collaborative translation processes, users with different mother tongues always conduct frequent discussions about certain words or expressions to understand the content of original article and to decide the correct translation. To support such kind of multilingual discussions, we propose an approach to embedding a service-oriented multilingual infrastructure with discussion functions in collaborative translation systems, where discussions can be automatically translated into different languages with machine translators, dictionaries, and so on. Moreover, we propose a Meta Translation Algorithm to adapt the features of discussions for collaborative translation, where discussion articles always consist of expressions in different languages. Further, we implement the proposed approach on LiquidThreads, a BBS on Wikipedia, and apply it for multilingual discussion for Wikipedia translation to verify the effectiveness of this research. 0 0
Accessing dynamic web page in users language Sharma M.K.
Saha P.K.
Sarcar S.
Ghosh S.
Samanta D.
TechSym 2011 - Proceedings of the 2011 IEEE Students' Technology Symposium English 2011 In recent years, there is a rapid advancement in Information and Communication Technology (ICT). However, the explosive growth of ICT and its many applications in education, health, agriculture etc. are confined to a limited number of privileged people who have both language and digital literacy. At present the repositories in Internet are mainly in English, as a consequence users unfamiliar to English are not able to get benefits from Internet. Although many enterprises like Google have addressed this problem by providing translation engines but they have their own limitations. One major limitation is that translation engines fail to translate the dynamic content of the web pages which are written in English in web server database. We address the problem in this work and propose a user friendly interface mechanism through which a user can interact to any web services in Internet. We illustrate the access of Indian Railway Passenger Reservation System and interaction with Wikipedia English Website signifying the efficacy of the proposed mechanism as two case studies. 0 0
Analysis on multilingual discussion for Wikipedia translation Linsi Xia
Naomi Yamashita
Toru Ishida
Proceedings - 2011 2nd International Conference on Culture and Computing, Culture and Computing 2011 English 2011 In current Wikipedia translation activities, most translation tasks are performed by bilingual speakers who have high language skills and specialized knowledge of the articles. Unfortunately, compared to the large amount of Wikipedia articles, the number of such qualified translators is very small. Thus the success of Wikipedia translation activities hinges on the contributions from non-bilingual speakers. In this paper, we report on a study investigating the effects of introducing a machine translation mediated BBS that enables monolinguals to collaboratively translate Wikipedia articles using their mother tongues. From our experiment using this system, we found out that users made high use of the system and communicated actively across different languages. Furthermore, most of such multilingual discussions seemed to be successful in transferring knowledge between different languages. Such success appeared to be made possible by a distinctive communication pattern which emerged as the users tried to avoid misunderstandings from machine translation errors. These findings suggest that there is a fair chance of non-bilingual speakers being capable of effectively contributing to Wikipedia translation activities with the assistance of machine translation. 0 0
Calculating Wikipedia article similarity using machine translation evaluation metrics Maike Erdmann
Andrew Finch
Kotaro Nakayama
Eiichiro Sumita
Takahiro Hara
Shojiro Nishio
Proceedings - 25th IEEE International Conference on Advanced Information Networking and Applications Workshops, WAINA 2011 English 2011 Calculating the similarity of Wikipedia articles in different languages is helpful for bilingual dictionary construction and various other research areas. However, standard methods for document similarity calculation are usually very simple. Therefore, we describe an approach of translating one Wikipedia article into the language of the other article, and then calculating article similarity with standard machine translation evaluation metrics. An experiment revealed that our approach is effective for identifying Wikipedia articles in different languages that are covering the same concept. 0 0
CoSyne: A framework for multilingual content synchronization of wikis Christof Monz
Vivi Nastase
Matteo Negri
Angela Fahrni
Yashar Mehdad
Michael Strube
WikiSym 2011 Conference Proceedings - 7th Annual International Symposium on Wikis and Open Collaboration English 2011 Wikis allow a large base of contributors easy access to shared content, and freedom in editing it. One of the side-effects of this freedom was the emergence of parallel and independently evolving versions in a variety of languages, reflecting the multilingual background of the pool of contributors. For the Wiki to properly represent the user-added content, this should be fully available in all its languages. Working on parallel Wikis in several European languages, we investigate the possibility to "synchronize" different language versions of the same document, by: i) pinpointing topically related pieces of information in the different languages, ii) identifying information that is missing or less detailed in one of the two versions, iii) translating this in the appropriate language, iv) inserting it in the appropriate place. Progress along such directions will allow users to share more easily content across language boundaries. 0 0
Cross-language information retrieval with latent topic models trained on a comparable corpus Vulic I.
De Smet W.
Moens M.-F.
Lecture Notes in Computer Science English 2011 In this paper we study cross-language information retrieval using a bilingual topic model trained on comparable corpora such as Wikipedia articles. The bilingual Latent Dirichlet Allocation model (BiLDA) creates an interlingual representation, which can be used as a translation resource in many different multilingual settings as comparable corpora are available for many language pairs. The probabilistic interlingual representation is incorporated in a statistical language model for information retrieval. Experiments performed on the English and Dutch test datasets of the CLEF 2001-2003 CLIR campaigns show the competitive performance of our approach compared to cross-language retrieval methods that rely on pre-existing translation dictionaries that are hand-built or constructed based on parallel corpora. 0 0
Discussion about translation in Wikipedia Ari Hautasaari
Toru Ishida
Proceedings - 2011 2nd International Conference on Culture and Computing, Culture and Computing 2011 English 2011 Discussion pages in individual Wikipedia articles are a channel for communication and collaboration between Wikipedia contributors. Although discussion pages contribute to a large portion of the online encyclopedia, there have been relatively few in-depth studies conducted on the type of communication and collaboration in the multilingual Wikipedia, especially regarding translation activities. This paper reports the results on an analysis of discussion about translation in the Finnish, French and Japanese Wikipedias. The analysis results highlight the main problems in Wikipedia translation requiring interaction with the community. Unlike reported in previous works, community interaction in Wikipedia translation focuses on solving problems in source referencing, proper nouns and transliteration in articles, rather than mechanical translation of words and sentences. Based on these findings we propose future directions for supporting translation activities in Wikipedia. 0 0
Extracting the multilingual terminology from a web-based encyclopedia Fatiha S. Proceedings - International Conference on Research Challenges in Information Science English 2011 Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopedias such as Wikipedia as comparable corpora for bilingual terminology extraction. We propose an approach to extract terms and their translations from different types of Wikipedia link information and data. The next step will be using a linguistic-based information to re-rank and filter the extracted term candidates in the target language. Preliminary evaluations using the combined statistics-based and linguistic-based approaches were applied on different pairs of languages including Japanese, French and English. These evaluations showed a real open improvement and a good quality of the extracted term candidates for building or enriching multilingual ontologies, dictionaries or feeding a cross-language information retrieval system with the related expansion terms of the source query. 0 0
Hybrid and interactive domain-specific translation for multilingual access to digital libraries Jones G.J.F.
Fuller M.
Newman E.
YanChun Zhang
Lecture Notes in Computer Science English 2011 Accurate high-coverage translation is a vital component of reliable cross language information retrieval (CLIR) systems. This is particularly true for retrieval from archives such as Digital Libraries which are often specific to certain domains. While general machine translation (MT) has been shown to be effective for CLIR tasks in laboratory information retrieval evaluation tasks, it is generally not well suited to specialized situations where domain-specific translations are required. We demonstrate that effective query translation in the domain of cultural heritage (CH) can be achieved using a hybrid translation method which augments a standard MT system with domain-specific phrase dictionaries automatically mined from Wikipedia . We further describe the use of these components in a domain-specific interactive query translation service. The interactive system selects the hybrid translation by default, with other possible translations being offered to the user interactively to enable them to select alternative or additional translation(s). The objective of this interactive service is to provide user control of translation while maximising translation accuracy and minimizing the translation effort of the user. Experiments using our hybrid translation system with sample query logs from users of CH websites demonstrate a large improvement in the accuracy of domain-specific phrase detection and translation. 0 0
Identifying word translations from comparable corpora using latent topic models Vulic I.
De Smet W.
Moens M.-F.
ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies English 2011 A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from word-topic distributions with similarity measures in the original space, are also reported. 0 0
No free lunch: Brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity Ture F.
Elsayed T.
Lin J.
SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2011 This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multilingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints. 0 0
Query and tag translation for Chinese-Korean cross-language social media retrieval Wang Y.-C.
Chen J.-T.
Tsai R.T.-H.
Hsu W.-L.
Proceedings of the 2011 IEEE International Conference on Information Reuse and Integration, IRI 2011 English 2011 Collaborative tagging has been widely adopted by social media websites to allow users to describe content with metadata tags. Tagging can greatly improve search results. We propose a cross-language social media retrieval system (CLSMR) to help users retrieve foreign-language tagged media content. We construct a Chinese to Korean CLSMR system that translates Chinese queries into Korean, retrieves content, and then translates the Korean tags in the search results back into Chinese. Our system translates NEs using a dictionary of bilingual NE pairs from Wikipedia and a pattern-based software translator which learns regular NE patterns from the web. The top-10 precision of YouTube retrieved results for our system was 0.39875. The K-C NE tag translation accuracy for the top-10 YouTube results was 77.6%, which shows that our translation method is fairly effective for named entities. A questionnaire given to users showed that automatically translated tags were considered as informative as a human-written summary. With our proposed CLSMR system, Chinese users can retrieve online Korean media files and get a basic understanding of their content with no knowledge of the Korean language. 0 0
Supporting multilingual discussion for Wikipedia translation Noriyuki Ishida
Toshiyuki Takasaki
Masanobu Ishimatsu
Toru Ishida
Proceedings - 2011 2nd International Conference on Culture and Computing, Culture and Computing 2011 English 2011 Nowadays Wikipedia has become useful contents on the Web. However, there are great differences among the number of the articles from language to language. Some people try to increase the numbers by the translation, where they should have a discussion (regarding the discussion about the translation itself) because there are some specific words or phrases in an article. They can make use of machine translation in order to participate in the discussion with their own language, which leads to some problems. In this paper, we present the algorithm "Meta Translation", to keep the designated segments untranslated, and to add the description into it. 0 0
The translation mining of the out of Vocabulary based on Wikipedia Sun C.
Hong Y.
Ge Y.
Yao J.
Qinghua Zhu
Jisuanji Yanjiu yu Fazhan/Computer Research and Development Chinese 2011 The query translation is one of the key factors that affect the performance of cross-language information retrieval (CLIR). In the process of querying, the excavation of the out of vocabulary (OOV) has the important significance to improve CLIRT. Out of Vocabulary means the words or phrase which can't be found in the dictionary. In this paper, according to Wikipedia data structure and language features, we divide translation environment into target-existence environment and target-deficit environment. Depending on the difficulty of translation mining in the target-deficit environment, we adopt the frequency change information and adjacency information to realize the extraction of candidate units, and compare common extraction methods of units. The results verify that our methods are more effective. We establish the strategy of mixed translation mining based on the frequency-distance model, surface pattern matching model and summary-score model, and add the model one by one, and then verify the function influence of each model. The experiments use the mining technique of OOV in search engine as baseline and then evaluate the results with TOP1. The results verify that the mixed translation mining method based on Wikipedia can achieve the correct translation rate of 0.6822, and the improvements on this method are 6.98% over the baseline. 0 0
A monolingual tree-based translation model for sentence simplification Zhu Z.
Bernhard D.
Iryna Gurevych
Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference English 2010 In this paper, we consider sentence simplification as a special form of translation with the complex sentence as the source and the simple sentence as the target. We propose a Tree-based Simplification Model (TSM), which, to our knowledge, is the first statistical simplification model covering splitting, dropping, reordering and substitution integrally. We also describe an efficient method to train our model with a large-scale parallel dataset obtained from the Wikipedia and Simple Wikipedia. The evaluation shows that our model achieves better readability scores than a set of baseline systems. 0 0
Creating a Wikipedia-based Persian-English word association dictionary Rahimi Z.
Shakery A.
2010 5th International Symposium on Telecommunications, IST 2010 English 2010 One of the most important issues in cross language information retrieval is how to cross the language barrier between the query and the documents. Different translation resources have been studied for this purpose. In this research, we study using Wikipedia for query translation by constructing a Wikipedia-based bilingual association dictionary. We use English and Persian Wikipedia inter-language links to align related titles and then mine word by word associations between the two languages using the extracted alignments. We use the mined word association dictionary for translating queries in Persian-English cross language information retrieval. Our experimental results on Hamshari corpus show that the proposed method is effective in extracting word associations and that Persian Wikipedia is a promising translation resource. Using the association dictionary, we can improve the pure dictionary-based method, where the only translation resource is a bilingual dictionary, by 33.6% and its recall by 26.2%. 0 0
Cross-language retrieval using link-based language models Benjamin Roth
Dietrich Klakow
SIGIR 2010 Proceedings - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval English 2010 We propose a cross-language retrieval model that is solely based on Wikipedia as a training corpus. The main contributions of our work are: 1. A translation model based on linked text in Wikipedia and a term weighting method associated with it. 2. A combination scheme to interpolate the link translation model with retrieval based on Latent Dirichlet Allocation. On the CLEF 2000 data we achieve improvement with respect to the best German-English system at the bilingual track (non-significant) and improvement against a baseline based on machine translation (significant). 0 0
Evaluating cross-language explicit semantic analysis and cross querying Maik Anderka
Nedim Lipka
Benno Stein
Lecture Notes in Computer Science English 2010 This paper describes our participation in the TEL@CLEF task of the CLEF 2009 ad-hoc track. The task is to retrieve items from various multilingual collections of library catalog records, which are relevant to a user's query. Two different strategies are employed: (i) the Cross-Language Explicit Semantic Analysis, CL-ESA, where the library catalog records and the queries are represented in a multilingual concept space that is spanned by aligned Wikipedia articles, and, (ii) a Cross Querying approach, where a query is translated into all target languages using Google Translate and where the obtained rankings are combined. The evaluation shows that both strategies outperform the monolingual baseline and achieve comparable results. Furthermore, inspired by the Generalized Vector Space Model we present a formal definition and an alternative interpretation of the CL-ESA model. This interpretation is interesting for real-world retrieval applications since it reveals how the computational effort for CL-ESA can be shifted from the query phase to a preprocessing phase. 0 0
Japanese-chinese information retrieval with an iterative weighting scheme Lin C.-C.
Wang Y.U.-C.
Tsai R.T.-H.
Journal of Information Science and Engineering English 2010 This paper describes our Japanese-Chinese cross language information retrieval system. We adopt query-translation approach and employ both a conventional JapaneseChinese bilingual dictionary and Wikipedia to translate query terms. We propose that Wikipedia can be regarded as a good dictionary for named entity translation. According to the nature of Japanese writing system, we propose that query terms should be processed differently based on their written forms. We use an iterative method for weighttuning and term disambiguation, which is based on the PageRank algorithm. When evaluating on the NTCIR-5 test set, our system achieves as high as 0.2217 and 0.2276 in relax MAP (Mean Average Precision) measurement of T-runs and D-runs. 0 0
Revisiting context-based projection methods for term-translation spotting in comparable corpora Laroche A.
Philippe Langlais
Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference English 2010 Context-based projection methods for identifying the translation of terms in comparable corpora has attracted a lot of attention in the community, e.g. (Fung, 1998; Rapp, 1999). Surprisingly, none of those works have systematically investigated the impact of the many parameters controlling their approach. The present study aims at doing just this. As a test case, we address the task of translating terms of the medical domain by exploiting pages mined from Wikipedia. One interesting outcome of this study is that significant gains can be obtained by using an association measure that is rarely used in practice. 0 0
When to cross over? Cross-language linking using Wikipedia for VideoCLEF 2009 Gyarmati A.
Jones G.J.F.
Lecture Notes in Computer Science English 2010 We describe Dublin City University (DCU)'s participation in the VideoCLEF 2009 Linking Task. Two approaches were implemented using the Lemur information retrieval toolkit. Both approaches first extracted a search query from the transcriptions of the Dutch TV broadcasts. One method first performed search on a Dutch Wikipedia archive, then followed links to corresponding pages in the English Wikipedia. The other method first translated the extracted query using machine translation and then searched the English Wikipedia collection directly. We found that using the original Dutch transcription query for searching the Dutch Wikipedia yielded better results. 0 0
WikiPics: Multilingual image search based on wiki-mining Daniel Kinzler WikiSym 2010 English 2010 This demonstration introduces WikiPics, a language-independent image search engine for Wikimedia Commons. Based on the multilingual thesaurus provided by WikiWord, WikiPics allows users to search and navigate Wikimedia Commons in their preferred language, even though images on Commons are annotated in English nearly exclusively. 0 0
WikiPics: multilingual image search based on Wiki-mining Daniel Kinzler WikiSym English 2010 0 0
Automatic multilingual lexicon generation using wikipedia as a resource Shahid A.R.
Kazakov D.
ICAART 2009 - Proceedings of the 1st International Conference on Agents and Artificial Intelligence English 2009 This paper proposes a method for creating a multilingual dictionary by taking the titles of Wikipedia pages in English and then finding the titles of the corresponding articles in other languages. The creation of such multilingual dictionaries has become possible as a result of exponential increase in the size of multilingual information on the web. Wikipedia is a prime example of such multilingual source of information on any conceivable topic in the world, which is edited by the readers. Here, a web crawler has been used to traverse Wikipedia following the links on a given page. The crawler takes out the title along with the titles of the corresponding pages in other targeted languages. The result is a set of words and phrases that are translations of each other. For efficiency, the URLs are organized using hash tables. A lexicon has been constructed which contains 7-tuples corresponding to 7 different languages, namely: English, German, French, Polish, Bulgarian, Greek and Chinese. 0 0
Crosslanguage Retrieval Based on Wikipedia Statistics Andreas Juffinger
Roman Kern
Michael Granitzer
Lecture Notes in Computer Science English 2009 In this paper we present the methodology, implementations and evaluation results of the crosslanguage retrieval system we have developed for the Robust WSD Task at CLEF 2008. Our system is based on query preprocessing for translation and homogenisation of queries. The presented preprocessing of queries includes two stages: Firstly, a query translation step based on term statistics of cooccuring articles in Wikipedia. Secondly, different disjunct query composition techniques to search in the CLEF corpus. We apply the same preprocessing steps for the monolingual as well as the crosslingual task and thereby acting fair and in a similar way across these tasks. The evaluation revealed that the similar processing comes at nearly no costs for monolingual retrieval but enables us to do crosslanguage retrieval and also a feasible comparison of our system performance on these two tasks. 0 0
Learning better transliterations Pasternack J.
Dan Roth
International Conference on Information and Knowledge Management, Proceedings English 2009 We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. (e.g. P(wash⇒ BaIII) = 0:95). To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2 0 0
Overview of videoCLEF 2008: Automatic generation of topic-based feeds for dual language audio-visual content Larson M.
Newman E.
Jones G.J.F.
Lecture Notes in Computer Science English 2009 The VideoCLEF track, introduced in 2008, aims to develop and evaluate tasks related to analysis of and access to multilingual multimedia content. In its first year, VideoCLEF piloted the Vid2RSS task, whose main subtask was the classification of dual language video (Dutch-language television content featuring English-speaking experts and studio guests). The task offered two additional discretionary subtasks: feed translation and automatic keyframe extraction. Task participants were supplied with Dutch archival metadata, Dutch speech transcripts, English speech transcripts and ten thematic category labels, which they were required to assign to the test set videos. The videos were grouped by class label into topic-based RSS-feeds, displaying title, description and keyframe for each video. Five groups participated in the 2008 VideoCLEF track. Participants were required to collect their own training data; both Wikipedia and general web content were used. Groups deployed various classifiers (SVM, Naive Bayes and k-NN) or treated the problem as an information retrieval task. Both the Dutch speech transcripts and the archival metadata performed well as sources of indexing features, but no group succeeded in exploiting combinations of feature sources to significantly enhance performance. A small scale fluency/adequacy evaluation of the translation task output revealed the translation to be of sufficient quality to make it valuable to a non-Dutch speaking English speaker. For keyframe extraction, the strategy chosen was to select the keyframe from the shot with the most representative speech transcript content. The automatically selected shots were shown, with a small user study, to be competitive with manually selected shots. Future years of VideoCLEF will aim to expand the corpus and the class label list, as well as to extend the track to additional tasks. 0 0
Parallel annotation and population: A cross-language experience Sarrafzadeh B.
Shamsfard M.
Proceedings - 2009 International Conference on Computer Engineering and Technology, ICCET 2009 English 2009 In recent years automatic Ontology Population (OP) from texts has emerged as a new field of application for knowledge acquisition techniques. In OP, the instances of an ontology classes will be extracted from text and added under the ontology concepts. On the other hand, semantic annotation which is a key task in moving toward semantic web tries to tag instance data in a text by their corresponding ontology classes; so the ontology population activity accompanies generating semantic annotations usually. In this paper we introduce a cross-lingual population/ annotation system called POPTA which annotates Persian texts according to an English lexicalized ontology and populates the English ontology according to the input Persian texts. It exploits a hybrid approach, a combination of statistical and pattern-based methods as well as techniques founded on the web and search engines and a novel method of resolving translation ambiguities. POPTA also uses Wikipedia as a vast natural language encyclopedia to extract new instances to populate the input ontology. 0 0
Trdlo, an open source tool for building transducing dictionary Grac M. Lecture Notes in Computer Science English 2009 This paper describes the development of an open-source tool named Trdlo. Trdlo was developed as part of our effort to build a machine translation system between very close languages. These languages usually do not have available pre-processed linguistic resources or dictionaries suitable for computer processing. Bilingual dictionaries have a big impact on quality of translation. Proposed methods described in this paper attempt to extend existing dictionaries with inferable translation pairs. Our approach requires only 'cheap' resources: a list of lemmata for each language and rules for inferring words from one language to another. It is also possible to use other resources like annotated corpora or Wikipedia. Results show that this approach greatly improves effectivity of building Czech-Slovak dictionary. 0 0
Using Wikipedia and Wiktionary in domain-specific information retrieval Muller C.
Iryna Gurevych
Lecture Notes in Computer Science English 2009 The main objective of our experiments in the domain-specific track at CLEF 2008 is utilizing semantic knowledge from collaborative knowledge bases such as Wikipedia and Wiktionary to improve the effectiveness of information retrieval. While Wikipedia has already been used in IR, the application of Wiktionary in this task is new. We evaluate two retrieval models, i.e. SR-Text and SR-Word, based on semantic relatedness by comparing their performance to a statistical model as implemented by Lucene. We refer to Wikipedia article titles and Wiktionary word entries as concepts and map query and document terms to concept vectors which are then used to compute the document relevance. In the bilingual task, we translate the English topics into the document language, i.e. German, by using machine translation. For SR-Text, we alternatively perform the translation process by using cross-language links in Wikipedia, whereby the terms are directly mapped to concept vectors in the target language. The evaluation shows that the latter approach especially improves the retrieval performance in cases where the machine translation system incorrectly translates query terms. 0 0
VideoCLEF 2008: ASR classification with wikipedia categories Kusrsten J.
Richter D.
Eibl M.
Lecture Notes in Computer Science English 2009 This article describes our participation at the VideoCLEF track. We designed and implemented a prototype for the classification of the Video ASR data. Our approach was to regard the task as text classification problem. We used terms from Wikipedia categories as training data for our text classifiers. For the text classification the Naive-Bayes and kNN classifier from the WEKA toolkit were used. We submitted experiments for classification task 1 and 2. For the translation of the feeds to English (translation task) Google's AJAX language API was used. Although our experiments achieved only low precision of 10 to 15 percent, we assume those results will be useful in a combined setting with the retrieval approach that was widely used. Interestingly, we could not improve the quality of the classification by using the provided metadata. 0 0
WikiTranslate: Query translation for cross-lingual information retrieval using only wikipedia Dong Nguyen
Arnold Overwijk
Claudia Hauff
Trieschnigg D.R.B.
Djoerd Hiemstra
Franciska De Jong
Lecture Notes in Computer Science English 2009 This paper presents WikiTranslate, a system which performs query translation for cross-lingual information retrieval (CLIR) using only Wikipedia to obtain translations. Queries are mapped to Wikipedia concepts and the corresponding translations of these concepts in the target language are used to create the final query. WikiTranslate is evaluated by searching with topics formulated in Dutch, French and Spanish in an English data collection. The system achieved a performance of 67% compared to the monolingual baseline. 0 0
Cross-language retrieval with wikipedia Schonhofen P.
Benczur A.
Biro I.
Csalogany K.
Lecture Notes in Computer Science English 2008 We demonstrate a twofold use of Wikipedia for cross-lingual information retrieval. As our main contribution, we exploit Wikipedia hyperlinkage for query term disambiguation. We also use bilingual Wikipedia articles for dictionary extension. Our method is based on translation disambiguation; we combine the Wikipedia based technique with a method based on bigram statistics of pairs formed by translations of different source language terms. 0 0
Simultaneous multilingual search for translingual information retrieval Parton K.
McKeown K.R.
Allan J.
Henestroza E.
International Conference on Information and Knowledge Management, Proceedings English 2008 We consider the problem of translingual information retrieval, where monolingual searchers issue queries in a different language than the document language(s) and the results must be returned in the language they know, the query language. We present a framework for translingual IR that integrates document translation and query translation into the retrieval model. The corpus is represented as an aligned, jointly indexed "pseudo-parallel" corpus, where each document contains the text of the document along with its translation into the query language. The queries are formulated as multilingual structured queries, where each query term and its translations into the document language(s) are treated as synonym sets. This model leverages simultaneous search in multiple languages against jointly indexed documents to improve the accuracy of results over search using document translation or query translation alone. For query translation, we compared a statistical machine translation (SMT) approach to a dictionarybased approach. We found that using a Wikipedia-derived dictionary for named entities combined with an SMT-based dictionary worked better than SMT alone. Simultaneous multilingual search also has other important features suited to translingual search, since it can provide an indication of poor document translation when a match with the source document is found. We show how close integration of CLIR and SMT allows us to improve result translation in addition to IR results. Copyright 2008 ACM. 0 0
WikiBABEL: Community creation of multilingual data Kumaran A.
Saravanan K.
Maurice S.
WikiSym 2008 - The 4th International Symposium on Wikis, Proceedings English 2008 In this paper, we present a collaborative framework - wikiBABEL - for the efficient and effective creation of multilingual content by a community of users. The wikiBABEL framework leverages the availability of fairly stable content in a source language (typically, English) and a reasonable and not necessarily perfect machine translation system between the source language and a given target language, to create the rough initial content in the target language that is published in a collaborative platform. The platform provides an intuitive user interface and a set of linguistic tools for collaborative correction of the rough content by a community of users, aiding creation of clean content in the target language. We describe the architectural components implementing the wikiBABEL framework, namely, the systems for source and target language content management, mechanisms for coordination and collaboration and intuitive user interface for multilingual editing and review. Importantly, we discuss the integrated linguistic resources and tools, such as, bilingual dictionaries, machine translation and transliteration systems, etc., to help the users during the content correction and creation process. In addition, we analyze and present the prime factors - user-interface features or linguistic tools and resources - that significantly influence the user experiences in multilingual content creation. In addition to the creation of multilingual content, another significant motivation for the wikiBABEL framework is the creation of parallel corpora as a by-product. Parallel linguistic corpora are very valuable resources for both Statistical Machine Translation (SMT) and Crosslingual Information Retrieval (CLIR) research, and may be mined effectively from multilingual data with significant content overlap, as may be created in the wikiBABEL framework. Creation of parallel corpora by professional translators is very expensive, and hence the SMT and CLIR research have been largely confined to a handful of languages. Our attempt to engage the large and diverse Internet user population may aid creation of such linguistic resources economically, and may make computational linguistics research possible and practical in many languages of the world. 0 0
Korean-Chinese person name translation for cross language information retrieval Wang Y.-C.
Lee Y.-H.
Lin C.-C.
Tsai R.T.-H.
Hsu W.-L.
PACLIC 21 - The 21st Pacific Asia Conference on Language, Information and Computation, Proceedings English 2007 Named entity translation plays an important role in many applications, such as information retrieval and machine translation. In this paper, we focus on translating person names, the most common type of name entity in Korean-Chinese cross language information retrieval (KCIR). Unlike other languages, Chinese uses characters (ideographs), which makes person name translation difficult because one syllable may map to several Chinese characters. We propose an effective hybrid person name translation method to improve the performance of KCIR. First, we use Wikipedia as a translation tool based on the inter-language links between the Korean edition and the Chinese or English editions. Second, we adopt the Naver people search engine to find the query name's Chinese or English translation. Third, we extract Korean-English transliteration pairs from Google snippets, and then search for the English-Chinese transliteration in the database of Taiwan's Central News Agency or in Google. The performance of KCIR using our method is over five times better than that of a dictionary-based system. The mean average precision is 0.3490 and the average recall is 0.7534. The method can deal with Chinese, Japanese, Korean, as well as non-CJK person name translation from Korean to Chinese. Hence, it substantially improves the performance of KCIR. 0 0
On the evolution of computer terminology and the SPOT on-line dictionary project Hynek J.
Brada P.
Openness in Digital Publishing: Awareness, Discovery and Access - Proceedings of the 11th International Conference on Electronic Publishing, ELPUB 2007 English 2007 In this paper we discuss the issue of ICT terminology and translations of specific technical terms. We also present SPOT - a new on-line dictionary of computer terminology. SPOT's web platform is adaptable to any language and/or field. We hope that SPOT will become an open platform for discussing controversial computer terms (and their translations into Czech) among professionals. The resulting on-line computer dictionary is freely available to the general public, university teachers, students, editors and professional translators. The dictionary includes some novel features, such as presenting translated terms used in several different contexts - a feature highly appreciated namely by users lacking technical knowledge for deciding which of the dictionary terms being offered should be used. 0 0
Wikifying your interface: Facilitating community-based interface translation Cameron Jones M.
Rathi D.
Twidale M.B.
Proceedings of the Conference on Designing Interactive Systems: Processes, Practices, Methods, and Techniques, DIS English 2006 We explore the application of a wiki-based technology and style of interaction to enabling the incremental translation of a collaborative application into a number of different languages, including variant English language interfaces better suited to the needs of particular user communities. The development work allows us to explore in more detail the design space of functionality and interfaces relating to tailoring, customization, personalization and localization, and the challenges of designing to support ongoing incremental contributions by members of different use communities. Copyright 2006 ACM. 0 1