Monday, October 31, 2011

k-means and Octave

In the lecture on unsupervised learning on the Stanford's AI online course, there is the presentation of the k-means algorithm.

The algorithm is simple:

  • Bind each data point to the closest centroid:




  • Adjust each centroid's position to the mean of its bound data points:

  • And  cycle through these until there is no more change


Here is a quick example with random points:


The code is in my github k-means-octave repository.



Bayes network, variable independence and AI

In one of the Stanford AI class homework (now closed), the following question was asked. Given the following Bayes network, are B and C independent knowing A and D?


Following the usual rules, we are presented with a dilemna: A being known would imply that B and C are independent, but D being known implies that B and C are dependent.

From a probabilistic point of view, two variables A and B are independent if

Pr[A=v]=Pr[A=v|B=u]

So, to check the independence of two variables, one has to compute the conditional probability and compare to the probability without the added condition.

A quick solution is to model the Bayes network with a python script you may find on my github. The result shows that B and C are dependent if A and D are known, but independent if only A is known.



P(B)= 0.590436
P(B|C)= 0.464383811907
P(C)= 0.629475
P(C|B)= 0.49508837537

P(C,(D))= 0.650509052716
P(C|B,(D))= 0.554403493324
P(B,(D))= 0.704196855276
P(B|C,(D))= 0.600159513419

P(C,(A))= 0.448997199205
P(C|B,(A))= 0.449069612702
P(B,(A))= 0.949959858907
P(B|C,(A))= 0.9501130668

P(C,(A,D))= 0.174715296242
P(C|B,(A,D))= 0.139904742675
P(B,(A,D))= 0.844759945818
P(B|C,(A,D))= 0.676448630334

In order to double check, I did the formal calculation for the case where A and D are known, which gave me the same results. 

Python rocks! And so does the Stanford AI course!



Wednesday, October 19, 2011

NMAP - using nmap scripting engine (NSE)

NMAP is one of the tools I find super useful. No need to present it, it's powerful, it's fast, it has a ton of functions a features.

Recently, I've been playing with the NSE, or scripts, to offload some of my discovery to nmap rather than combine multiple tools. However, I got an error for "citrixxml" not being found. I tried to update the DB, same issue.

# export NMAPDIR=/usr/share/nmap
# nmap --script-updatedb
Starting Nmap 5.21 ( http://nmap.org ) at 2011-10-19 16:40 EDT
NSE: Updating rule database.
NSE: error while updating Script Database:
[string "local nse = ......"]:17: /usr/share/nmap/scripts//citrix-brute-xml.nse:35: module 'citrixxml' not found:
no field package.preload['citrixxml']
no file './citrixxml.lua'
no file '/usr/local/share/lua/5.1/citrixxml.lua'
no file '/usr/local/share/lua/5.1/citrixxml/init.lua'
no file '/usr/local/lib/lua/5.1/citrixxml.lua'
no file '/usr/local/lib/lua/5.1/citrixxml/init.lua'
no file '/usr/share/lua/5.1/citrixxml.lua'
no file '/usr/share/lua/5.1/citrixxml/init.lua'
no file '/usr/share/nmap/nselib/citrixxml.lua'
no file './citrixxml.so'
no file '/usr/local/lib/lua/5.1/citrixxml.so'
no file '/usr/lib/lua/5.1/citrixxml.so'
no file '/usr/local/lib/lua/5.1/loadall.so'
stack traceback:
[C]: in function 'assert'
[string "local nse = ......"]:17: in main chunk
A quick trip to /usr/share/nmap/nselib revealed that that particular file was missing. It's available however on the nmap website.

# cd $NMAPDIR/nselib
# wget http://nmap.org/svn/nselib/citrixxml.lua
The following "nmap --script-updatedb" ran like a charm. 

Sunday, October 9, 2011

"Best Practices" ...

Ok, I got one "best practices" too many. Consultants, colleagues, vendors, they all swear, breath and live by these magic words. "Best Practices".

When I hear that expression, I can't help but have a bunch of questions popping in my mind: who made them? What is the reference platform? What's the test scenario? What are the constraints and trade-offs? What are the limits?

But it seems that these "Best Practices" are universal. Got that software? Here are the best practices, they cover everything, all scenarios, all cases, have no constraints and have no limits. From Security, going through file servers, web servers and finally to database servers, Best Practices are everywhere. You're going to deploy that server with that OS? Here are the best practices, everything will run nice and smooth and you'll never have to change anything.

Too often, I'm under the impression that these best practices are just a substitute for some people's inability to understand what they're doing, what they're working with or their use cases, that these "Best Practices" are little short than "cook books" aimed at giving users a way to have something that will work OK in most of the cases, but that will never work "great".

Here are my list of "Best Practices". To be used with everything.

  • Understand your systems, most of all, know what the constraints and trade-offs are;
  • Understand your use cases, if possible, have a set of tests in a handy;
  • Read the "Best Practices", don't be ruled by them;
  • Read all the white papers and user cases you can. Try to find similarities;
  • If possible, have a test system you can tweak and break;
  • Document what you did, and when possible, share with the community at large.