White Hot Chocolate, Mathematics and Computer Science: 2014

Wednesday, December 3, 2014

An Art Movement Where Art and Science Collide

Science Friday is one of my favorite podcasts. Filled with tons of interesting interviews, comments and reports, this is my weekend "radio station."

In last Friday's, one of the topics was "An Art Movement Where Art and Science Collide". The part that totally baffled me was the two tunes: one is from Johann Sebastian Bach, the other was generated by an algorithm written by David Cope (here are some other references: 1, 2 and 3). While I had no difficulties identifying the "real" Bach (I know that piece for having played it), I was really impressed by the likeliness to the real works composed by JS Bach.

I do remember, in the olden days of the 3DO, a program by Sid Meyer called "CPU Bach" that composed works "in the style of" Johann Sebastian Bach. This one goes a step further. I wonder if we should create a "Music Turing Test"

There is another piece on David Cope's page on the USCS website. Though I think that some parts would not have been written that way by Johann Sebastian Bach, the end result is truly amazing.

Monday, December 1, 2014

PoS malware found targeting mass transit systems

The security company InterCrawler has found a new malware strain that targets the mass transit systems.

In the report, a sentence had both my eyebrows raise and my jaw drops at the same time:

"During ongoing POS investigations it was determined that some operators of Point-of-Sale terminals have violated their own internal security policies and have used their terminal for gaming and WEB-surfing, checking e-mail from it, sending messages, and viewing social networks. These cases have a common denominator of weak passwords and logins, many of which were found in large 3rd party credential exposures."

This is almost 2015 and still people operating Point of Sale terminals are still incapable of realizing that their actions can result in huge dramas.

To the casual reader, this seems bad. To the security-minded, this is even worse: it means that these machines had, at the time of the breach, access to the Internet. This is in direct violation of the PCI standard.

Last Friday was Black Friday in the US, I am curious to discover how many retailers were compromised and how much money cybercriminals have amassed.

Saturday, November 22, 2014

AstroViz - Colliding Galaxies

If you don’t know the American Museum of Natural History, now is a good tile to be acquainted. Among the various research groups, the astrophysics, led by Dr. Tyson, is very active and has a nice visualization section.

Recently, it has been announced that our galaxy and the Andromeda galaxy are on a collision course. While this is the correct technical term, the result will mostly be that the 2 galaxies will go through each other a few times before merging and creating an even bigger galaxy. That is in about 3 billion years.

The AMNH has created a nice visual on this: Colliding Galaxies. Enjoy!

Tuesday, September 30, 2014

About the Home Depot breach

This is no longer a secret: the Home Depot was breached and scoundrels potentially got their hands on credit card information. What is unteresting, though, is the bits of information that were published by Ars Technica:

The security architect had a run in with justice for sabotaging the network of his previous company
Some of the personal in the security team left due to management ignoring their warnings and recommendations

The former may be okay: I am about giving second chances to people, however hiring someone who demonstrated a lack of maturity in handling a previous departure as the main security guy for a big store that handles millions of credit card transactions per day is risky at the very least.

The second seems like a broken record: security people got really concerned, put the info in an email or a document, and are ignored by the management, who claims that security people cry "wolf!" all the time. That may be, but given the number of recent breaches, I think that we don't hear enough "wolf."

However, what concerned me the most is a sentence in the NY Times article:

Thefts like the one that hit Home Depot — and an ever-growing list of merchants including Albertsons, UPS, Goodwill Industries and Neiman Marcus — are the “new normal,” according to security experts.

That is really saying that your banker can claim it's normal for a bank to be rob but they won't close the vault, or for a surgeon that people die all the time, but they won't clean their hands before surgery.

It doesn't have to be this way, but security costs (a bit) and requires people to adapt. The latter is, from what I have encountered so far, the hardest: people don't change their habits even when these very habits are dangerous and putting the company and its clients at risk. How many times have I heard "yeah, these servers absolutely need access to the Internet" or "yeah, all our employees can connect to the network any time of the day or the night, any day."

I have read estimates that put the Target breach at around $1 billion for the resulting credit card fraud. The one from the Home Depot is slated around $3 billion. All together, that's $4 billion, roughly the cost of a team of 50 security specialists for more than 50 years. It would be naive to say that this is a victimless crime: in the long run, we all pay for the mistakes of these companies, through higher credit card bills and premiums.

Monday, September 8, 2014

Randall Munroe's "What If" book is out!

Randall Munroe - the talented author of xkcd - launched his book "What If?: Serious Scientific Answers to Absurd Hypothetical Questions." Check it out!

Friday, September 5, 2014

Neil deGrasse Tyson Is Worried That Humans Are Too Stupid For Aliens

Ah! This is a must see: Dr deGrasse Tyson putting our modern technology in perspective - Just for giggles, in his scale, Voyager 1 would be about 280 miles away. Assuming Dr. Tyson sits in his office at the Hayden Planetarium, that is past Boston, MA. I also love the part where he explains why Dr. Stephen Hawking is a bit concerned by aliens landing on earth.

There is also the possibility that the first aliens to visit us will be either microbes or viruses, something small and that will sustain the cold and vacuum of space for a long time without dying.

Friday, August 15, 2014

Google, Asian telecoms to build $300 mln undersea cable to Japan

Interesting: Google has come to an agreement with five Asian telecom operators to deploy an undersea cable between the US and Asia.

Monday, August 11, 2014

NASA Space Sounds

The space is never truly empty - photons created in stars and in all particules interactions are all over the place, resulting in electromagnetic waves that can be recorded and converted to sound.

NASA probes recorded these electromagnetic waves and converted them to sounds that can be played. Here are a few samples.

Friday, August 8, 2014

P.F. Chang's Provides Data Breach Update, Confirms Compromised Locations

The Chinese restaurant chain has released the list of its compromised locations: here.

Monday, August 4, 2014

Happy Birthday Fortran (II)

In the previous article, I show a small piece of code. Let's have a look at a larger one.

In the sixties, Edward Lorenz, who was working at MIT, found that a weather model he built exhibited a very strange behavior: two sets of inputs, only different by minute quantities, would initially behave similarly and then diverge widely.

After he studied the phenomenon, he showed an example of another system that had the same property, a system which is known as Lorenz's attractor. Several texts have been published on the subject, let's just recap the main points:

The system oscillates between two points (attractors)
Two different starting points will lead to two different trajectories
A trajectory never repeats itself

As a computer represents real numbers with a finite precision, the second point is open to discussion: if two trajectories are so close that the difference between the closest points of each is below the machine precision, the rounding may result in the one of the trajectories being altered. This is actually what happens and started Lorenz investigations: while the internal representation of the numbers was up to 6 decimals, Lorenz's printout was only 4. When he stopped his experiment and restarted it later from an earlier point, the results started by being the same, then diverged. The field of chaos theory was created ...

My second homage to Fortran is a program that generates the coordinates for the points of the Lorenz's attractor. The code can be found in my GitHub repository.

Two methods are used to calculate the points: Explicit Euler (or Forward Euler), Implicit Euler (or Backward Euler) and a mixed method, which uses the average of a backward and forward Euler step.

The Explicit Euler is quite simple and straightforward to implement: the three differential equations are explicit and do not require solving anything special.

The Implicit Euler, on the other hand and by extension the Implicit-Explicit Euler, is a bit more challenging: each step requires solving a system of three equations with three unknowns. This is done - in my case - using Newton's method, which in turn requires computing the inverse of the Jacobian matrix, which is performed in my code using Gauss-Jordan decomposition: starting with the original matrix and the identity matrix, I apply all the operations needed to transform the original matrix into the identity matrix. As the same operations are applied to the second (starting as the identity) matrix, the result is that the original matrix is transform into the identity matrix, and the identity matrix is transformed into the inverse of the original matrix.

Both the Implicit and Explicit Euler are first order methods, the mixed Implicit-Explicit Euler is a second order method: its convergence is faster and it is more stable. (Read here for a discussion of the Backward and Forward Euler methods)

Thursday, May 29, 2014

IP Informer - a program to look for several IPs in several black lists

So, you have a list of IP addresses you would like to quickly check against several black lists such as Abuse.ch 's Zeus Tracker, Spy Eye Tracker, Palevo Tracker, Feodo Tracker or the Malware Domain List. Here is a small tool in Google's go language that does exactly that: it takes a file containing a list of IP addresses (one per line) and checks each of them into these lists. Given that the black lists are defined in a configuration file, it is very easy to add or remove specific lists.

I keep adding new features, so feel free to check the GitHub repository from time to time.

Monday, May 19, 2014

Happy Birthday Fortran!

For some of us, the name "Fortran" or "FORTRAN" evokes a computer language closely associated with massive super computers and complex mathematical models. For others, it is reminiscent of a war with C++ for the supremacy in the scientific computing world.

It has been continuously developed since its initial publication in 1957, and the latest revision came out in 2010, with another minor revision planned for 2015. Fortran is not dead, far from that, even if it has a though competition from other languages such as Haskell, Clojure or even Python.

Wikipedia has an extensive history of the language.

The first example of code in Fortran I will present is the determination of the fraction that generates a given pattern.

Let's take 0.1278, where the underlined part repeats ad infinitum. The fraction needed to obtain this value is 211/1650. For the rest of this post, I will call the part that repeats the repeated part and the part that does not repeat the prefix. The algorithm to find the fraction is well known, let's focus on the code.

It contains three parts: computing the non reduced fraction, computing the greatest common denominator (gcd) of the numerator and denominator and reducing the fraction to a numerator and a denominator that are relatively prime. Let's start with the gcd.

For this, I use Euclid's algorithm. The code in Fortran 95 to achieve this is


function gcd(a, b) result(c)

! Returns the GCD of a and b

 integer :: a,b,c,u,l,m

 if ( a > b ) then

   u = a

   l = b

 else

   u = b

   l = a

 end if

 do while (l > 0)

    m = modulo(u,l)

    u=l

    l=m

 end do

 c=u

 return

end function

It consists of a few parts: having u contains the largest value, l the smallest then looping until u modulo l is 0. The simplification subroutine is even simpler.

subroutine simplify(a, b, c, d)

! Returns the fraction a/b in its simplified form

! c/d where c and d are relatively prime

 integer, intent(in) :: a,b

 integer, intent(out) :: c,d

 integer :: n,gcd

 n=gcd(a,b)

 c=a/n

 d=b/n

 return

end subroutine

Now, the core of the problem is solved by two other functions - one that takes care of fractions with a prefix, the other one of fractions without a prefix. There are some issues in the code presented, but at this point and for a simple presentation, this is not important.

subroutine findfractionpref(pref, rept, mpref, a, b)

! Returns the fraction a/b such as its division gives the pattern prefreptreptrept ! ....

! With the necessary multiplier

 integer, intent(in) :: pref, rept, mpref

 integer, intent(out) :: a, b

 integer :: d1, d2, num, den

 d1=1+floor(log10(real(pref)))

 d2=1+floor(log10(real(rept)))

 num=((pref*10**(d2)+rept)-pref)

 den=10**(d1+d2)-10**(d1)

 if ( mpref > 0) then

  den=den*10**mpref

 else

  num=num*10**mpref

 end if

 call simplify(num, den,a , b)

 return

end subroutine



subroutine findfractionnopref(rept, a,b)

! Returns the fraction a/b such as its division gives the pattern reptreptrept...

 integer, intent(in) :: rept

 integer, intent(out) :: a,b

 integer :: d1, num, den

 d1=1+floor(log10(real(rept)))

 num=rept

 den=10**d1-1

 call simplify(num,den,a,b)

 return

end subroutine

Lastly, the main part of the program, used to read the various information and call the necessary subroutines.

program fractionfinder

 implicit none

 integer :: pref,rept,a,b

 character :: c

 do

  print *, 'Does your fraction include a prefix (yY/nN) or Q to quit (qQ)?'

  read(*,'(A1)'), c

  if ((c == 'y').or.(c == 'Y')) then

   print *, 'Prefix part?'

   read(*, '(I12)'), pref

   print *, 'Repeated part?'

   read(*,'(I12)'), rept

   call findfractionpref(pref,rept,0, a,b)

   print *, 'The requested fraction is ', a, '/', b

  else if ((c == 'n').or.(c == 'N')) then

   print *, 'Repeated part?'

   read(*, '(I8)'), rept

   call findfractionnopref(rept,a,b)

   print *, 'The requested fraction is ', a, '/', b

  else if ((c == 'q').or.(c == 'Q')) then

   goto 100

  end if

 end do

100 print *, 'Bye bye!'

 stop

end program

Many thanks go to Rae Simpson for the help she provided with some of the terms in this post!

Sunday, April 27, 2014

New gTLD Applications

Do you know what "cruise", "pamperedchef", "通用电气公司" and "click" have in common? They are proposals for new generic top-level domain ("gTLD") names. We all know the "com", "org","uk", "be" and other "net" names such as google.com or wikipedia.org, but some companies and entities fell a bit constrained by having to use overly generic categorizations in either commercial, organizational, governmental or generic entities.

A first wave of requests created some domain specifics TLD, such as .aero, .biz, and so forth. Mikko Hypponen, the charismatic CRO of the Finnish security company F-Secure, recently posted on twitter a few examples of the next wave. He also posted the wiki page that includes all the new gTLD applications.

Surprisingly enough, there is no ".omg", ".lolcat" or ".canIhazchezburger". It is also interesting to see that the internationalized domain names are making an appearance in the TLDs.

Wednesday, April 16, 2014

Global warming, python and statistics part 2

In the previous post, we stopped after establishing the long term trends for the high and low temperatures. This gave a very general overview of how the temperatures are evolving on average over a long period of time, in this case a bit more than a century.

Let's refine a bit and determine how the temperature evolves for the same day of the year, namely January the first, April the first, July the first and October the first, for the various years in the dataset.

The temperature curve for each day in the year is quite specific and separated from the other days, both for the high and low temperatures. The weather on earth is a cycle of period about 1 year.

Now, let's detrend the original data: this means removing the long term growing trend to have a data set composed uniquely of the short term variations. We will also plot the curve representing the averages over a year for each day of the year (all the January 1st, all the January 2nd ...)

For the averages over a year, the detrended datasets were used. To account for the loss of the average in the detrending process, the initial dataset average has been added back. A sinusoidal fitting has been added for the high and low temperature curves. As it appears, that fits quite well.

At this point, we have (a) a long term linear trend and (b) a yearly cycle. These information are useful to give an idea of how temperatures evolve, but they are only averages. Let's have a look at how the temperature is distributed around that average.

In order to do that, the relevant days of the year (i.e. January 1st) will be corrected to remove the long term trend and the yearly cycle. This will leave the difference with the average temperature. Let's do this with the same days as before, January 1st, April 1st, July 1st and October 1st.

Be careful though: the long term trend AND the sinusoidal fitting both contain the average value - if these two are subtracted from the data, the result will be that the average value will be removed twice. In this case, I decided to ignore the offset value from the sinusoidal fitting. Other techniques are possible.

Let's check how the temperature is distributed around the trend/cycle.

From this, it seems that the distribution is fairly normal. The parameters for the stacked distribution are mean μ=8.301619e-02C and the standard deviation σ=4.411646e+00C.

Now, let's be careful: even if the distribution of the differences around the trend+cycle looks like it is a normal distribution, do not confuse this with a normal random variable: the weather, while it looks strangely random, is not a variable independent of everything else, including its history: the temperature at day D will influence the temperature at day D+1. That is one of the reasons why the weather forecast is possible. If temperature were a purely random variable, there would be no forecasting of the weather.

To be perfectly complete, there is a slight difference between the distributions for the high temperatures and the low temperatures.

Temperature set	Mean [C]	Standard Deviation [C]
High	1.684078e-01	4.777461e+00
Low	-2.375521e-03	4.015207e+00

However, from now on, I will consider it as a "somehow random variable."

[To be continued]

Monday, April 14, 2014

Global warming, python and statistics

Global warming ... some people believe in it, others don't. A growing number of scientists show data to prove its reality, detractors show other data sets. However, debating of who is right is not the point here.

The interest - for me at least - of global warming is that large sets of weather data are available. In this article, I will use the Global Historical Climatology Network - Daily data set (here for the readme.txt) hosted by the National Oceanographic and Atmospheric Administration. This data set contains the daily measures for various atmospheric parameters such as maximum and minimum temperatures, precipitation and so forth, for various stations identified by their ID. In the data set I downloaded, each file contains the results for a single station.

The data is line by line, each line representing a month, with fixed length. From the readme file, the structure is as follow:

------------------------------
Variable Columns Type
------------------------------
ID 1-11 Character
YEAR 12-15 Integer
MONTH 16-17 Integer
ELEMENT 18-21 Character
VALUE1 22-26 Integer
MFLAG1 27-27 Character
QFLAG1 28-28 Character
SFLAG1 29-29 Character
VALUE2 30-34 Integer
MFLAG2 35-35 Character
QFLAG2 36-36 Character
SFLAG2 37-37 Character
. . .
. . .
. . .
VALUE31 262-266 Integer
MFLAG31 267-267 Character
QFLAG31 268-268 Character
SFLAG31 269-269 Character
------------------------------

Non possible days (such as February 30) and missing measures have a value set to -9999. For this, I am interested in two variables: TMAX and TMIN. they are expressed in tenth of Celsius.

As said, the file is organized line by line, with each line representing a month's worth of measures. Each line as 31 entries of the type value+3 flags. The function readFile() reads each line and return a tuple of 4 numpy arrays: the dates for the measures of the High temperatures, the high temperature measures, the dates for the measures of the Low temperatures and the low temperatures.

Let's start with the weather station in Central Park, New York NY.

The reason for returning the dates as well is that there may be missing measurements, i.e. days for which the Tmax, the Tmin or both may be missing from the file.

Here comes the first graph with two subplots: the Tmax is the top one.

It is quite difficult to say anything about the data besides that "they look the same shifted by about 10C". In order to remove some of the "hairy" behavior, we will apply a sliding average on the data, with 7, 30, 182, 365 and 3650 days, a week, roughly a month, roughly six months, a year and roughly ten years.

A sliding average is simply the average calculated over the last N points. This is used to smooth out the possible variations due to either the randomness or the seasonal variations. Let's use an example.

The following is a straight line (slope: 1, intercept: 10) with a superimposed Gaussian noise (mean: 0, deviation: 5). The X-range goes from 0 to 10, using 1001 points. Here is a graph (not "the" graph - guess why!)

From it, it "looks like" there is an increase. However, this is not really visible. Let's apply a sliding window to it, with size 10, 20 and 50.

Even at sample size 10 (in blue), the trend is visible. At sample size 50 (in red), the trend is impossible to miss.

This sliding window is performed in python with the numpy.convolve() function. This function does the convolution of two vectors (one-dimension arrays).

Back to our temperature samples, let's use this tool to smooth the temperatures. So, what do we expect?

The sliding average over a week won't probably change a lot - the weather cycle on earth is about a year, or 52 weeks. The same for a month. With a sliding window of six weeks, I expect the seasonal cycle will start showing. I really expect the trends to be visible with the year and ten years averaging.

With the year sliding window, the trend appears but is still quite noisy. With 10 years, there is no doubt left: a trend appears for both the high and low temperatures. A visual estimation gives that for both measures, over the period the average has risen by about 2.25°C. Let's use a better method than "it looks like."

In the statistician's toolbox, a good tool to estimate the trend of a dataset is linear regression or linear fitting, usually using the least-square method (minimising the vertical distance). I will not explain all the values returned by scipy.stats.linregress(). Of these, I will use the slope, the intercept and the R-value.

Data source (High)	Slope [°C/day] (High)	Intercept [°C] (High)	R-value [-] (High)	Slope [°C/day] (low)	Intercept [°C] (low)	R-value [-] (low)
Raw dataset	4.5933e-05	-1.6242e+01	6.4263e-02	3.2326e-05	-1.4788e+01	5.0001e-02
7-day sliding window average	4.5997e-05	-1.6287e+01	6.8779e-02	3.2374e-05	-1.4821e+01	5.2816e-02
30-day sliding window average	4.6027e-05	-1.6303e+01	7.1649e-02	3.2403e-05	-1.4837e+01	5.5005e-02
182-day sliding window average	4.6947e-05	-1.6924e+01	1.1339e-01	3.3059e-05	-1.5273e+01	8.7093e-02
365-day sliding window average	4.7696e-05	-1.7464e+01	6.8588e-01	3.3669e-05	-1.5716e+01	5.6593e-01
3650-day sliding window average	4.8371e-05	-1.7851e+01	9.0518e-01	3.2297e-05	-1.4709e+01	7.9373e-01

The first notable point is that the slopes are different for the low and for the high by about 1.6e-5 °C/day. Second - and it was expected - the R-value increases with the sliding window length: as the window increases, the data set is closer and closer to a line. As a result, the linear regression model matches more and more.

If the values for the slope don't look that much, remember that they are per day: over the course of 100 years, this represents about 1.68°C for the high temperature and 1.18°C for the low. The average temperature has risen by about 1.43°C over a century.

[To be continued]

Friday, April 11, 2014

Possible tetraquark particles spotted at the LHC

Scientists at the LHC may have found another type of matter in the form of tetraquarks, or particles made of 4 quarks. If confirmed, this would be an important step in understanding how matter behaves at its most fundamental level.

The paper on arXiv: arXiv:1404.1903v1 [hep-ex]

Wednesday, April 9, 2014

Chinese Physicists provide a lower bound for the speed of "spooky action at a distance"

This concerns the speed at which information is transferred between two entangled particles. The paper is really interesting and the lower bound - if correct - is four orders of magnitude higher than the speed of light.

The paper on arxiv: arXiv:1303.0614v2 [quant-ph].

Monday, April 7, 2014

Radioactive waste used to peek inside a star explosion - space - 03 April 2014 - New Scientist

Radioactive waste used to peek inside a star explosion - space - 03 April 2014 - New Scientist and the original paper DOI 10.1016/j.physletb.2014.03.003.

Wednesday, April 2, 2014

Neil Armstrong on Being a Nerd

From a great man ...

Saturday, March 29, 2014

Mac OS X file versus Mac Ports file

So recently, I started having an issue - the default "file" command on my Mac OS machine didn't identify any file, but returned an error message instead.

Trying the native Mac OS version worked, but the one installed by MacPorts (/opt/local/bin/file) would just report a regex error.

$ /usr/bin/file magicmagic: magic text file for file(1) cmd$ /opt/local/bin/file magicmagic: ERROR: line 19439: regex error 17, (illegal byte sequence)

A closer look at the magic file that lives in /opt/local/share/misc/ revealed that line 19439 is

0 regex/s \\`(\r\n|;|[[]|\xFF\xFE)

Disabling the line with a "#" and recompiling the magic file with the "-C -m <magic file>" solved the issue.

Friday, March 14, 2014

An afternoon at the museum - Dark Matter

The American Museum of Natural History in New York is one of my top 20 museums in the world. Although I am not a big biology and mineralogy fan, I really enjoy walking among the collections of dinosaurs, primates and meteorites. My favourite section is the Hayden Planetarium, where presentations about an half an hour long are displayed inside the sphere. Currently, it is about dark matter. If you are in New York for a few days and are looking for an afternoon to spend, I suggest it.

Monday, March 3, 2014

Modelling Security Awareness

A nagging issue in security is to evaluate the level of a group or of a single person: does that person know? How well does she understand the concepts? Does she realise what the consequences might be?

Traditionally, this is done through a series of tests/trainings: a company such as PhishMe offers a wide range of phishing tests, that lead to trainings or information pages. For example, if a user provides his credentials, he is sent to a video that explains he would have potentially given access to the corporate network to cybercriminals. While there is a huge value in doing these, it only assesses the ratio of people who failed the test, but not the depth of awareness of people who succeed: a person may pass the test because "providing my e-mail credentials when clicking on a link is wrong", but may fail to see that the PDF file attached to the next e-mail is malicious.

In order to evaluate how well security concepts are understood, I suggest using a different scale, IKUC. This stands for

I - Ignore
K - Know
U - Understand
C - Care

Ignore - the base level: the person has no knowledge of it. The term or concept may have been heard or read about, but the person can, at best, vaguely formulate it.

Know - the level at which a person can quote a precise definition or explain what the concept is, but this sounds like a mechanical regurgitation.

Understand - not only the person knows the definition but also succeeds in explaining it and how the concept works.

Care - the goal level: the person understands the term or concept, but also the threat and impact that may result. This is the realisation that the term or concept is not merely words, but an actual attack that can affect the person or the company in various ways.

This is a progression: in order to understand something, you have to know it. In order to care, you have to understand it. While one may argue that it is possible to care for something that is known but not understood, I think that this is inefficient, as it quickly turns to recognising scenarios instead of the broader, underlying concept. This may be seen as the "don't click on links in e-mails", which leaves the possibility for clicking on files or answering the e-mail with the information the attacker seeks.

By elevating the user from a basic knowledge to understanding, not only will the concept by clearer and easier to recognise, but also this enables the user to relate a variety of threats as being really the same "thing." In the long run, this saves time and money to the company by not having to develop a scenario for everything.

Caring is the next step, it is the realisation that not only there is a threat, but that threat has an impact on the person or the firm. That is the realisation that "bad things don't happen at random." This is, for me, the "true awareness" and is summed up in the idiom "once burned, twice shy." However, "cyberburning" can be persistent (think "credit score damage") or even fatal (DigiNotar, Mt. Gox and an article from Fox Business). This is by far the hardest step, as human being we tend to downplay the risks or impacts when we want something (either to possess it or as a mean to achieve a goal, such as performing one's duty), but to exaggerate the inconvenience of anything that may stand between us and these goals/things.

Unfortunately, this "magnification of inconvenience" and "downplaying of risks" clouds the step from "Understanding" to "Caring": "if it is inconvenient and not that risky, why should I care?" Sounds familiar? For me, way too much.

A good security awareness program has to address both the K, U and C states. It has to make sure everyone knows what is being explained (the "K"): if it is phishing, does everybody know what phishing is? Can it be defined in a simple way and without requiring to drop various examples? From there, does everybody understand how this works and is everybody able to recognise such a scenario for what it is?

As I wrote, getting to the C is the hardest part, due to having to go "over the ledge of perception of the "rarity", "lack of danger" and "inconvenience of doing otherwise." It is also by far the most important step. This may be related to a speed limit on a street: we all know what a speed limit is, most of us understand why a speed limitation may be placed somewhere, but some of us fail to care and just disregard the limitation. From time to time, this leads to an accident, injuries and possible death.

I think this is where all the "phishing" companies fail: they focus on bringing people to the C directly, regardless of the previous state. A more comprehensive process would be to make sure that everyone attending such a training has gone through the K and is at the U state before leaping to the C state.

Saturday, February 22, 2014

HHS Info for 2013

The human health services (HHS) publishes on its website the list of breaches that affected at least 500 people. This is a trove of information concerning the Health breaches.

For the year 2013, there are 217 breaches that either started or ended, totalling 7,636,544 records, an average of 35,191.45 records per breach. The minimum is 500 (the minimum to be publicly reported), the maximum 4,029,530 records. The first quartile is 1,127 records and the third 6,332 records.

The breach that resulted in 4,029,530 records compromised affected Advocate Health and Hospitals Corporation, d/b/a Advocate Medical Group and was due to the theft of a desktop machine.

The following graph shows the geographical distribution of these breaches, with green being the least and red the most. The "white" states reported no breach affecting more than 500 records for 2013, which doesn't mean there was none: either each breach affected less than 500 users or the breaches were not reported to the authorities, which would be a clear violation of HIPAA.

The five states with the highest number of breaches are California (23), Texas (17), Florida (15), North Carolina (14) and Illinois (13). These five states represent 37.78% of all breaches.

In terms of number of records compromised, the map becomes

The five states with the most affected records are Illinois (4,112,982), California (940,541), New Jersey (852,953), Texas (781,771) and Indiana (218,084). These five states represent 90% of all compromised records. It is to be noted that Illinois inherits the title of "highest number of records compromised state" due to the "Advocate Health & Hospitals Corporation" breach.

The most cited cause for a breach is "Theft" and related, with 92 breaches or 42% of all breaches, totalling 5,923,705 records. Interestingly enough, all the "Hacking/IT Incidents" represents only 17 breaches, or a bit short of 8%, for a total of 532,230 records. The average number of records compromised through thievery is 64,388 and through IT Hacking 31,308.

There is already an interesting trend there: a breach is more likely to happen through a stolen device than through hacking and with more severe consequences. However, it is also important to keep in mind that the gigantic breach that affected more than 4 millions users drags that number way up. If it is removed, the average goes down to 20,815 records per breach on average, below the average for a breach resulting through IT Hacking.

Out of the 92 incidents that involved theft in a form of another, 52 of them mention that the location of the information was on a laptop, more than 56%. If we add to that the category "Other portable devices", the number rises to 57 (62%). On average, an incident involving the theft of a laptop resulted in the disclosure of 33,827 records. The maximum reported is 839,711 compromised records for such an event.

It is interesting to notice that these 52 incidents represents the vast majority of all the breaches involving laptops. The following graph shows the type of breach for all the events concerning a laptop.

Geographically, a breach through thievery happened 12 times in California (52% of all CA breaches), 8 times in Florida (53% of all FL breaches), 7 times in Texas (41% of all TX breaches), 6 times both in Ohio (55% of all OH breaches) and Georgia (55% of all GA breaches). It is interesting to notice that the proportion of breaches through theft amounts to half of the reported breaches, at least for the top 5.

But laptops and mobile devices are not the only ones susceptible to be stolen. These devices represent 57% of the stolen containers. The following graph shows the distribution for the non-laptop stolen devices/containers that led to a breach.

"Desktop Computer" and "Paper"represent the top two categories. There is no explanation on how these were stolen, but one could safely assume this resulted from a burglary or break-in.

But thievery is not the only cause of data breaches. The second most cited cause is "Unauthorized Access" with 58 occurences (27% of all breaches). All together, "Theft" and "Unauthorized Access" represent 69% of all breaches. From a number of records perspective, 435,880 records were breached due to improper access.

The location of the breached information changes dramatically: if the laptops were the main location in the thievery scenario, in the unauthorized access the most cited location is paper with 16 occurrences (28%), then E-mail and "Network Server",tied, with 11 occurrences (19%). A note: some reasons include multiple reasons, I counted them for each category.

'Unauthorized Access' happened predominantly in Florida (8 incidents), Montana and North Carolina (5 incidents each), in California (4 incidents) and in Texas, Puerto Rico, Oregon, North Carolina and Illinois with 3 incidents each. These 9 states are responsible for about 60% of this type of breach.

The "unauthorized access" on paper information accounts for 32% of all breaches involving paper documents. Unfortunately, the main reason is often described as "Other", which means that the details are not available in the HHS database.

The "type of breach" represents the issue that permitted the breach. Several rows include multiple reasons, such as "Theft, Other". It is possible to extract seven "major themes":

Improper Disposal
Theft
Loss
Other
Hacking/IT Incident
Unauthorized Access/Disclosure
Unknown

The following figure presents the number of occurrences of each reason. A reason that includes multiple "simple" reasons will be counted for each category.

Clearly and as already described, "Theft" is biggest issue, then "Unauthorized Access/Disclosure." Unfortunately, the third one is "Other", which is not self explanatory. The "Hacking/IT Incident" comes fifth, between "Loss" and "Improper Disposal."

What can we conclude of this?

The Health industry ("HI") is still struggling with breaches, and more importantly, with "stupid" breaches such as theft and unauthorized access. Unfortunately, every time one happens, people's lives can be ruined. It is then of the uttermost importance that the HI gives the patient information the highest priority in terms of protection.

Almost a quarter of all breaches (in count or in number of affected individuals) results from the theft of a laptop. This is a lot! This points to the fact that some data is simply not meant to be carried on portable devices. However, it seems that the HI is still having difficulties with this concept. And this is not looking very promising in the light of the current BYOD craze...

This could be solved by adopting a number of simple rules, such as "if it touches the network of an hospital, it is encrypted. If it works for an hospital, it is encrypted. If it has an hospital in its client, it is encrypted." Yes, that means that lots of companies will have to invest in disk encryption technologies; I don't think this is a huge problem in 2014. This is more a no-brainer.

Monday, February 17, 2014

Phishing Techniques, Consequences and Protection Tips

Phishing is now a prevalent attack on the Internet, and several "big cases" started with someone being tricked into either providing information, or clicking on a link or a document.

Rohyt Belani, CEO at PhishMe, gave an interview to Help Net Security some time ago. This is very interesting.

Friday, February 7, 2014

BYOD anybody?

If there is a question that periodically comes back these days like a broken record, that's the Bring Your Own Device, or BYOD as it has been shortened to.

With the emergence of smart phones, tablets and affordable powerful laptops, employees have started demanding the right to use their personal gizmos at work: transporting and making presentations to client from a tablet, accessing the corporate contact list from a smart phone or using the "latest and super powerful" laptop to access corporate information systems. Or simply demanding to use the laptop "because the brand is different and I am more comfortable with it than with your corporate Windows 7 laptop."
Some employers also think this would be a great way to save money: the employee provides his own equipment, so there is no need to purchase a corporate laptop and a corporate phone for him, or to equip it with all the security measures normally taken with a corporate device.

That's where the endless list of issues starts.

First, let me present you the difference between my corporate laptop and my personal laptop. The former has been issued by my organisation's IT team, everything on it is patched through the corporate patch management tool. As it runs Windows 7, it is joined to the domain and I have to use my corporate account to access the internal resources. In addition, its local policies are pushed from the Active Directory infrastructure. Also, it has a full-disk encryption software, and an antivirus software.

My personal laptop is maintained by myself: I patch it when the update client pops up. It has an antivirus and I use two files as TrueCrypt containers for my personal data. It doesn't have any local policy besides the default and is not joined to my organisation's Windows Domain.

Of course, my personal preference is to use my own equipment: it has a keyboard I have been using since I turned 17 and got my first computer, but also it is far more powerful and has four times the RAM. Oh, and it runs a non Windows OS.

Yet, I accept the fact that I am not using it for work. Why?

Let's imagine I wanted to, and I am talking really working inside the network, not accessing a remote access solution such as Citrix. In order to protect the data at rest, I would need a full disk encryption solution, but who is going to pay for it? Myself or my company? Second, upon connection to the network, checks should be made to guarantee that my machine is up-to-date (AV, system and applications) and safe. This mandates the need for a NAC solution. While this is always a good idea, in practice I haven't seen it deployed in a large number of organisations, but this is changing, partly because that's usually my first recommendation.

Comes the issue of departing: it is always a sad moment in life when an employee and an employer part ways, but it happens and for different reasons: the employer terminates the contract, the employee terminates the contract or something happens that makes the employe unable to perform his duty, death being the obvious reason, but it can also be conviction, deportation or military duty.

So what happens in that case? For the "mobile" devices, namely phones and tablets, there are solutions to remotely wipe the device, the question of whether you'd accept losing your vacation pictures because you may have a contact list from your job is still being debated. But for the laptops or the devices that can't remotely be accessed? Usually, the BYOD contract specifies you agree to delete the corporate data should you stop working for it. But that presupposes that you are willing to comply. When everything works fine and everybody is happy, not a big deal. When the sky gets cloudy, different story.

Both Apple and Android products permit the synchronisation to a cloud service. When you get an e-mail or add a contact, a backup copy can be made on the vendor's service. This means that if you have all the corporate contacts on your phone and it is remotely wiped, you may still have the contacts in your backups, possibly accessible from a different device or even the same device after being reinstalled.

Different vendors have come up with a containerised solution: the corporate applications run into their own mini-environment and the data is kept there as well. That solves the encryption and backup-to-the-cloud issues, but that creates new demands, such as being able to work with the native device's applications. Egg or chicken?

Second, there is the risk of the out-of-band communications: if I am allowed to use a personal device as a work device, I may consider it a work device and use it for work communications outside of the normal channels. This is especially true with phone: if you are allow to use your phone for your corporate e-mail, why not call a client with it? or text him?

Certain industries, such as the financial industry, have very strict rules when it comes to communication and requires that certain types of discussion be filed. If an employee uses his own device, what are the chances he will drop the personal device, get his corporate phone and send a text? In order to be compliant with the SEC rules, all text messages from the personal device now have to go through a corporate gateway to be analysed before filing.

Lastly, there is the confidence factor: how many of us would feel safe or protected if a doctor were to told us that "all your medical information is on my google account" or "is stored on my iPad"? While I do trust Google and Apple to do an awesome job at securing their systems, I don't trust the people when it comes to choosing strong passwords.

In conclusion, in my views BYOD is an aberration, it is a sore mistake and it is a very bad trend. It falls on the corporate managements to make sure that this trend is reversed, that employees are not allowed to use their personal devices. Combined, the Target and Neiman Marcus breaches totalled more than 50 million records. Let's not prepare for the next 100 million records breach.

Wednesday, February 5, 2014

"Steve Jobs Shows the Mac", 1984

A nice piece of history: Steve Jobs showing the Mac at the Boston Computer Society in 1984. Some of the engineers answered questions from the public, and Steve "Woz" Wozniak joined.

The video is here.

Monday, February 3, 2014

"Senators Introduce Bill to Protect Against Data Breaches"

Senator Dianne Feinstein (D-Calif.) and three other senators have introduced a bill that would, if it passes, try to address the issue of companies being less than serious with personal information.

Following the breaches of Target and Neiman Marcus, it became clear that the current controls in place are far from being adequate in an increasingly adverse world. I am interested in the rules the FTC will develop.

More here.

Friday, January 31, 2014

Yahoo prompts users to change passwords

Yahoo prompted its users to change their password after a database of usernames and passwords was accessed by unnamed attackers. Yahoo claims that its own systems were not compromised, but that a third-party was.

More here.

Monday, January 20, 2014

The worst passwords of 2013

SplashData has compiled a list of the worst passwords for 2013 (okay, this is subjective). No comment.

Friday, January 17, 2014

IBM to invest some serious money into Watson

Do you remember Watson, IBM's Jeopardy winner? Well, after its triumphal apparition in the game show, IBM tried to place it as a medical advisor, but so far, success hasn't been there.

Recently, the (big) blue company announced it would pour $1 billion into the business development, to help place the cyber doctor/advisor. A few reasons are presented for why sales have not skyrocketed.

This is interesting, as there were a number of initiative to bind machine learning with medicine. In several cases, the machine was able to find a better, i.e. more efficient or cheaper, than its flesh-and-bone counterpart. The underlying, unsaid reason (in my views) is that a machine doesn't partake in "sales" politics: it doesn't favour a specific brand nor does it try to "treat without curing".

Anyways, I really wish Watson become more of a success: with the explosion of diseases, such as autoimmune diseases or cancers, we really need to have all the brainpower we can have, both hardware and wetware.

Wednesday, January 15, 2014

An introduction to Firmware Analysis [30c3]

For many, the term "firmware" refers to some kind of black box software that no one really has access to. This talk explains how to analyse such an image. For example, that's how recently it was found that certain consumer routers have a default hardcoded username/password, or that some administrative pages were accessible without authentication.

A very good talk from Stefan Widmann. Enjoy!

Monday, January 13, 2014

Target breach worse than initially thought

I guessed the Target breach would prove worse than initially thought, but that worse? Woaw! No.

In addition to the 40 million credit and debit cards records stolen, it seems that "at least 70 million PII records were also accessed." The Star Tribune also mentions the opinion of Jack Tomarchio, attorney specialized in cybersecurity and data protection, who claims that if the credit and debit cards breach was bad, the PII one is even worse: the banks can quickly revoke a credit or debit card, but people are usually unwilling to change where they live or their name.

And to have a good start for 2014, not only Target and Neiman Marcus were hit, but it appears that several other retailers suffered the same type of breach.

2014 already announces itself as the Year of the Permanent Credit Card Monitoring.