Widefinder in Kamaelia?

December 09, 2007 at 05:55 PM | categories: python, oldblog | View Comments

Tim Bray's widefinder project is an interesting little problem. How do you find the top 10 most popular accesses to your website in a way that gains performance when you use more CPUs? Having done log analysis in the past, there's always a tradeoff Looking at the original Ruby source
counts = {}
counts.default = 0

ARGF.each_line do |line|
  if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
    counts[$1] += 1
  end
end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
  puts "#{counts[key]}: #{key}"
end

This really looks fairly Kamaelia friendly. It requires support not yet written ( :) ), but it certainly fits Kamaelia's model. (The code needing writing is relatively minimal now though, given we have basic multicore support)

So, what would this need to look like? Well, to my mind it'd look something like this: (pseudo kamaelia)
Pipeline(
    readfile,
    SplitWorkers(8, -> Create 8 Analyser workers, ideally in different CPUs
                 Pipeline(Analyser -> Builds a heap for counts, emits heap
                          Serialiser -> outputs values in order from the heap
                 ) -> Output from a worker is a tuple (id, value)
    )
    merge(8) -> merge heap streams (Can't start until it has values from all workers
    Head(10) -> output first 10 items
    ConsoleEchoer()
)

Even then, for this problem, this is suboptimal - we want the merge to not actually output the merged heap, but rather the values from the heaps, so that it doesn't do unnecessary work. Either way though, an interesting problem. May get round to implementing this after I've dealt with STM.

Aside from distribution across CPUs, and collation of results, this looks relatively simple to implement. Mainly blogging about this whilst I remember. The other thing here as well to note is that the results would never need to be completely sorted in this version. In fact, the more CPUs you have, the less sorted the data would be!
blog comments powered by Disqus