15 July 2009

Input validation, search sort and my word cloud algorithm

I've added client-side user input validation and made the search result headings as links, so that when clicked the results are re-sorted based on that criterion. For example, clicking on "Title" sorts the results in alphabetical order, and if clicked again toggles ascending/descending sort. I'd like to optimize how this happens though, because every time the results are re-sorted they are also re-calculated which is not needed. One idea is to 'index' all the wiki pages before the extension runs, so that tags are generated for each page and stored in a database, which can then be searched much faster. That also means that the script which indexes the wiki pages should be run periodically, as content is not static.

I've also spent some time playing around with creating word clouds via PHP's GD library for image creation and my implementation of this algorithm, but I realized that algorithm is too slow to be useful. The reason I'm writing my own implementation is that I previously used the Google WordCloud API but the results it produces are not exactly what I wanted. The google "word clouds" are simply lines of text (plain text, not an image), with a font resize for words which occur more frequently. I wanted something more like wordle.net (but unfortunately that's closed-source). While I couldn't find an open-source project which does the same thing (maybe I didn't search very extensively), today I wrote another implementation using the GD image library, and this time I created my own algorithm. Here's what it looks like with some random words:


And here's how it works using PHP and GD:
  1. Get an array of (word => frequency) pairs.
  2. Randomize the array to produce different-looking word clouds on refresh. (And keep track of the relation between each word and its frequency).
  3. Initialize the image object, available colours, image background and font path via GD.
  4. Slice the image into imaginary lines. The line height is the maximum font size used (which I calculate by getting the highest word frequency and multiplying it by a resize factor).


  5. Keep track of current line number, starting at 0. Place the first word on the current line (line 0). I vary the x-coordinate randomly between 1 and 8 pixels from the left image border (so that consecutive lines don't appear to line up). The y-coordinate is equal to the word height.


  6. Continue placing words on the current line, until the border is reached. Notice that the words stick to the top of the line, and they vary in size (proportional to their frequency), so it doesn't look like they're lined up. I also vary the colour randomly.


  7. Once the right border is reached, increment the variable storing the current line number, and repeat the process until all words are displayed.


  8. Finally, don't display the lines.


    Things that need to be improved: the sparsity of the word cloud is due to the fact that the code which determines how far "left" to place a word along a line is a bit of a hack - because I'm not sure how to calculate the varying font width.

1 comments:

  1. thanks for sharing this site. you can download lots of ebook from here


    http://feboook.blogspot.com

    ReplyDelete