Wiki Category Matrix Visualization

From WikiPapers
Jump to: navigation, search

Wiki Category Matrix Visualization is a tool that generates a visual representation of data sizes across topics of a multi-level category hierarchy in matrix form. It provides a "big picture" overview of topics in terms of categorization.

It analyzes the Wikipedia category data to determine the number of articles assigned to any category, and to determine the most similar parent category for each category. The resulting visualization takes the first two levels of categories from the given Wikipedia and plots these on both the x and y axes, and plots a disc representing the number of co-assignments of articles to the given pair of categories.

To illustrate this, the figure below shows an example (only an extract of the whole visualization is shown). A certain number of articles (about 100) are assigned to both category "Transportation" (highlighted in the figure below with number 1) and category "Engineering" (number 2). The visualization shows a proportionally sized disc (number 3) at the intersection of these two categories. Moreover, as the category "Transportation" belongs to parent category "Everyday life" (number 4), and category "Engineering" belongs to category "Science" (number 5), these 100 co-assigned articles would also contribute to the count of all articles co-assigned to categories "Everyday life" and "Science", shown as a larger disc of first level category co-assignments (number 6).

An example of matrix visualization

This tool was developed by Cheong-Iao Pang as part of his master degree studies at the University of Macau, supervised by Dr. Robert P. Biuk-Aghai. Further improvements were made by Peter Kin-Fong Fong.

License[edit]

This tool is released under Educational Community License Version 2.0.

Requirements[edit]

  • Read access to Mediawiki database (only 'pages' and 'categorylinks' tables are needed)
  • Java SE 1.6 or above
  • Libraries (to be placed in lib directory)
    • MySQL Connector/J 5.1.22 or above
    • jopt-simple 4.3 or above

Usage instruction[edit]

Step 1: Download the libraries mentioned above and place them into the lib directory. Please go to the following web pages to get the files.

Step 2: Edit run.sh (Linux or Mac OS X users) or run.bat (Windows users). Change the following parameters to fit your setup:

  • dbconn: JDBC connection string, in the following format:
    jdbc:mysql://(host):(port)/(database)
  • dbuser: Username of the database user that have read access to required DB
  • dbpass: Password of the user above
  • root_title: Category title of the "root category", i.e. the category that contains all the other content categories. Different wikis usually have different root category title, please lookup your wiki.

Step 3: Run the run.sh / run.bat to generate the matrix visualization graph. A few text files will be created in the process, containing the category tree and similarity data. The file name of output visualization image is output.png

If an out of memory error occurs, try to increase the maximum heap memory. Replace -Xmx256M at the last line of batch file with larger values, like -Xmx512M.


Publications

Title Author(s) Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract
Visualization of large category hierarchies Robert P. Biuk-Aghai
Cheong-Iao Pang
Felix Hon Hou Cheang
Category
Hierarchical data
Information visualization
Large-scale data
Wiki
Visual Information Communication - International Symposium English 2011 Large data repositories such as electronic journal databases, document corpora and wikis often organise their content into categories. Librarians, researchers, and interested users who wish to know the content distribution among different categories face the challenge of analysing large amounts of data. Information visualization can assist the user by shifting the analysis task to the human visual sub-system. In this paper we describe three visualization methods we have implemented, which help users understand category hierarchies and content distribution within large document repositories, and present an evaluation of these visualizations, pointing out each of their relative strengths for communicating information about the underlying category structure.
A method for category similarity calculation in Wikis Cheong-Iao Pang
Robert P. Biuk-Aghai
Wiki
Category similarity
WikiSym English 2010 Wikis, such as Wikipedia, allow their authors to assign categories to articles in order to better organize related content. This paper presents a method to calculate similarities between categories, illustrated by a calculation for the top-level categories in the Simple English version of Wikipedia.