12 August 2009

Mediawiki: Importing large batches of pages from Wikipedia

How to import/export to Mediawiki

After a few days of trying out various methods of importing (a large number of) pages into my local Mediawiki (including writing my own import script, before I realized Mediawiki had export/import capabilities), I've come up with a procedure that works. Below are steps + solutions to common problems that occur when importing.
  1. Go to the Special:Export page of the source wiki, e.g.: http://en.wikipedia.org/wiki/Special:Export, and select a bunch of articles you want to export. DO NOT select "Include Templates" and click Export. If you select "include templates", the xml file will simply include the template syntax in the pages which use templates, but _not_ the actual source templates. The result would be a lot of ugly syntax in the pages you import making you wonder whether the import worked. (By the way it's interesting that most of the pages in wikipedia seem to be using templates extensively - I realized this when I imported using the "include templates" option).
  2. Now, go to the Special:Import page of the target wiki, choose the XML file you just downloaded, press Import.
Most likely, you will get one or more of these errors. Here are solutions to the ones I encountered:
  • Session timeout - the file is too big. Try to import smaller batches. (Similarly if you get a maximum file upload size type error).
  • Fatal error: Allowed memory size of nnnnnnn bytes exhausted (tried to allocate nnnnnnnn bytes) - this is a PHP error - here's how to resolve it.
  • Error in fetchObject()": Illegal mix of collations for operation ' IN ' (localhost) - this is a mediawiki problem, luckily there's a quick fix here.

For completeness, here's the Python import script I wrote. It seems to work but no guarantees. It's always better to try the mediawiki import first.

15 July 2009

Input validation, search sort and my word cloud algorithm

I've added client-side user input validation and made the search result headings as links, so that when clicked the results are re-sorted based on that criterion. For example, clicking on "Title" sorts the results in alphabetical order, and if clicked again toggles ascending/descending sort. I'd like to optimize how this happens though, because every time the results are re-sorted they are also re-calculated which is not needed. One idea is to 'index' all the wiki pages before the extension runs, so that tags are generated for each page and stored in a database, which can then be searched much faster. That also means that the script which indexes the wiki pages should be run periodically, as content is not static.

I've also spent some time playing around with creating word clouds via PHP's GD library for image creation and my implementation of this algorithm, but I realized that algorithm is too slow to be useful. The reason I'm writing my own implementation is that I previously used the Google WordCloud API but the results it produces are not exactly what I wanted. The google "word clouds" are simply lines of text (plain text, not an image), with a font resize for words which occur more frequently. I wanted something more like wordle.net (but unfortunately that's closed-source). While I couldn't find an open-source project which does the same thing (maybe I didn't search very extensively), today I wrote another implementation using the GD image library, and this time I created my own algorithm. Here's what it looks like with some random words:


And here's how it works using PHP and GD:
  1. Get an array of (word => frequency) pairs.
  2. Randomize the array to produce different-looking word clouds on refresh. (And keep track of the relation between each word and its frequency).
  3. Initialize the image object, available colours, image background and font path via GD.
  4. Slice the image into imaginary lines. The line height is the maximum font size used (which I calculate by getting the highest word frequency and multiplying it by a resize factor).


  5. Keep track of current line number, starting at 0. Place the first word on the current line (line 0). I vary the x-coordinate randomly between 1 and 8 pixels from the left image border (so that consecutive lines don't appear to line up). The y-coordinate is equal to the word height.


  6. Continue placing words on the current line, until the border is reached. Notice that the words stick to the top of the line, and they vary in size (proportional to their frequency), so it doesn't look like they're lined up. I also vary the colour randomly.


  7. Once the right border is reached, increment the variable storing the current line number, and repeat the process until all words are displayed.


  8. Finally, don't display the lines.


    Things that need to be improved: the sparsity of the word cloud is due to the fact that the code which determines how far "left" to place a word along a line is a bit of a hack - because I'm not sure how to calculate the varying font width.

03 July 2009

Adding features

Yesterday I finished implementing Ajax using Mediawiki's interface for Ajax calls. I'm happy to say it's working quite nicely now. I also rearranged some of the menus/links for (hopefully) easier navigation.

I also looked into Google's Word Cloud API and added word clouds of the articles' content as another easy way to give the user visual representation of the similarities between articles. Google's documentation is fantastic, so this only took about 5 min to get it working :) (By the way, to avoid being dependent on Google's server availability, I added the API JS scripts and stylesheets to the server locally, so no connection to google takes place when the user decides to display a word cloud).

Next, Lucene.

After that, there are several things I need to do before the extension will be ready for testing:
  • Add user input validation
  • Write the project documentation
  • Tidy up the source code and comment it
  • Fix any bugs/issues that come up during testing

02 July 2009

AJAX and MW

I spent the last few days refactoring code and incorporating an XMLHttpRequest object to make the rendering of user-specified search options asynchronous. Since changing the options has to re-generate the search results, and generating the results was being handled by the main extension function, I had to either pass selected bits of the results to the external script called by XMLHttpRequest which would then render the search options (for example, weighting of textual and structural search), or move all the processing of results in the external script. I decided to use the second approach because that would make for more readable code and would be a better design decision in the long run. Of course, that led to another problem, as the Mediawiki API is not available for use by the external script (and I was using the API to make calls to the MW database in order to fetch article content for making comparisons). After a few days of searching for a solution, it turns out that MW comes with support/interface for AJAX extensions.

On a different note, I asked some friends for name suggestions, and I got Myelink - (pronounced as my link, it's a play on myelin and link) and here's the explanation I got: "the Myelin Sheath is a layer that increases the speed of neurotransmission by... a lot.. 1000x at times depending on the animal and location. Myelinated neurons are also famous for being in the white matter of brains, which serves as an information pathway for the brain and connects different parts and facilitates communication. Which is what the extensions is set out to do.." I like the connotation and since what I am currently focusing on is generating related pages with the aim to facilitate communication between different contributors of scientific experiments/articles, I think it fits nicely :)