Linear Algebra Concepts in Machine Learning and Data Mining

In an earlier blog post we had pointed out the importance of a strong understanding of statistics concepts for people working in the field of Data Science. 

In this post, we highlight the importance of understanding of concepts from another field of mathematics, viz. Linear Algebra.

Each data instance can be represented as a vector of d-dimensions, where the d components of the vector are the attributes/features of the data instance.

The Vector Space Model, used extensively in Information Retrieval (ranked retrieval),  recommendations, and text mining applications, has it’s foundations in the following concepts from Linear Algebra:

Vector Spaces

Euclidean Norm

Cosine Similarity

Distance Measures

Dimensionality Reduction techniques such as Singular Value Decomposition (SVD) and Principal Components Analysis (PCA), used in machine learning and data mining, are based on concepts of Eigenvalues and Eigenvectors (spectral methods).

The books mentioned in the references below are good starting points.


1. Gilbert Strang, Introduction to Linear Algebra,  4th edition, Wellesley-Cambridge, 2009.

2. Philip N. Klein, Coding the Matrix: Linear Algebra Through Applications to Computer Science, Newtonian Press, 2013.


The Importance of Statistics in School Math Curriculum

With the explosion in applications of machine learning and applied statistics to Big Data Analytics, Predictive Analytics, Recommendation Engines, Genetics, Bio-informatics, Data Mining etc., we frequently hear about a shortage of Data Scientists.  Data Science is at the intersection of system building and applied statistics/machine learning.  System building here refers to efficient means to store large volumes of data in a database (SQL/NoSQL), as well as programming paradigms to efficiently retrieve and process the large volumes of data.  People with system building skills often lack the necessary inferential statistics/machine learning background.  Unfortunately, gaining expertise in inferential statistics/machine learning is not simple, and requires concerted effort over a period of time.

In order for people to work effectively on Data Science, a strong grounding in probability and statistics is necessary.  The shortage of people with such skills can be traced to the way mathematics is taught at the school and college level.  Currently, concepts in numbers, algebra, geometry, and calculus are built up in a systematic manner over the school and college years.  These concepts stay with us long after we have completed school/college.  A similar emphasis needs to be placed on probability and statistics concepts from an early stage. Probability and statistics needs to be made an integral part of the school mathematics curriculum, and needs lot greater attention.

Some topics which need emphasis, and need to built up in a systematic manner over the school (and college years) include:

Sample Space, Events, and Probability

Conditional Probability, Independence and Bayes Theorem

Random Variables

Probability Distributions

Populations, Samples, Statistics, and Sampling Distributions

Central Limit Theorem

Covariance and Correlation


Hypothesis Testing

Linear Models and Regression

Clearly, the mathematics curriculum at both the school and college level needs to change, in order to keep up with the needs of changing times.


1. William Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd Edition, John Wiley, 1968.

2. Alexander M. Mood, Franklin A. Graybill, Duane C. Boes, Introduction to the Theory of Statistics, 3rd Edition, McGraw Hill, 1974.

3. Douglas C. Montgomery and George C. Runger, Applied Statistics and Probability for Engineers, 5th Edition, Wiley, 2011.


Message Passing in Smalltalk

Message passing (or method invocation) in Ruby has many similarities with message passing in Smalltalk.

A message is a (synchronous) method call on an object, and this call can take zero or more parameters.  The term “message passing” is typically used in concurrent/actor based computations, although this is not the computational model for Smalltalk (or Ruby for that matter).  There is some discussion here on how the term “message passing” came to be used for Smalltalk method invocations. 

All communication in Smalltalk has the form: receiver selector argument(s)

The receiver is the object to which the message is being sent.

The message selector indicates the operation to be performed.

When a message is sent to an object, the Smalltalk system looks to see if a method with that name exists for the object. If there is a method, it is executed. If no method of that name is defined in the object’s class, the system looks in the method dictionary for its immediate superclass. If there is no method with that name in the superclass it looks in the superclass’s superclass and so on.

Smalltalk has no keywords, and all Smalltalk control constructs are implemented by message passing.


Classes in Smalltalk

All Smalltalk objects are instances of classes. Classes themselves are instances of a metaclass.

Smalltalk objects follow 5 simple rules (described in an earlier post), which are restated here:

1. Everything is an object.

2. Every object is an instance of a class.

3. Every class has a superclass.

4. A class defines it’s behavior via public methods, and the structure of its instances via instance variables which are private to the instances.

5. Objects only communicate via message passing (i.e., method invocation).

A Class definition has four parts:

1. Class name.

2. Superclass name.

3. Declaration of local (instance) variables.

4. A set of methods that define how instances will respond to messages.

Classes are used to:

1. Create new instances.

2. Define what the instances of a class do.

3. Hold class information as class variables.

An object has:

1. A pointer to the class (to access the instance methods).

2. Values (instance variables).

Every class is an instance of a metaclass. A metaclass is:

1. Implicit and anonymous.

2. Created automatically when a class is defined.

3. Can be referred to through their instances.

The metaclass provides methods for:

1. Instance creation.

2. Initialization of class variables.

3. Class information (Inheritance links, instance variables etc.).

Since classes are also objects, they can have their own instance variables, referred to as class variables, and their own methods, referred to as class methods.  Class instance variables and class methods are really no different from ordinary instance variables and instance methods.  Class instance variables are just instance variables defined by a metaclass, and class methods are just methods defined by the metaclass.

Another consequence of classes being objects is that they can be sent messages just like any other ordinary object. Method lookup follows the inheritance chain.

Metaclasses follow 5 simple rules (described in an earlier post), and restated here:

1. Every class is an instance of a metaclass.

2. The metaclass hierarchy parallels the class hierarchy.

3. Every metaclass inherits from Class and Behavior.

4. Every metaclass is an instance of Metaclass.

5. The metaclass of Metaclass is an instance of Metaclass.


1. Pharo by Example: The Smalltalk Object Model

2. Pharo by Example: Classes and metaclasses

3. Classes and Metaclasses in Smalltalk by Jim Althoff


Smalltalk’s Object Model

Ruby’s object model draws upon many ideas from the Smalltalk object model.

To better understand the Ruby object model, it is instructional to look at the Smalltalk object model in some detail.

The Smalltalk object model is based on a set of simple rules that are applied

Rule 1. Everything is an object.

Rule 2. Every object is an instance of a class.

Rule 3. Every class has a superclass.

Rule 4: A class defines it’s behavior via public methods, and the structure of its instances via instance variables which are private to the instances.

Rule 5. Objects only communicate via message passing (i.e., method invocation). When an object receives a message, the corresponding method is looked up in the class of the receiver. If the method is not found in this class, the search continues in the receiver class’s superclasses.  In other words, method lookup follows the inheritance chain.

Rule 6. Every class is an instance of a metaclass.  A metaclass is created automatically whenever a class is created.  Metaclasses are implicit, since the programmer does not need to do anything explicit with them.  A class and its metaclass are two separate classes, even though the class is an instance of the metaclass.  The metaclass has exactly one instance.  Metaclasses are anonymous. However, they can be referred to through the child class.

Rule 7. The metaclass hierarchy parallels the class hierarchy.  For example, assume that a class X inherits from class Y.  Then the class X is the sole instance of the metaclass of X and class Y is the sole instance of the metaclass of Y.  In addition, the metaclass of X inherits from the metaclass of Y.

Rule 8. Every metaclass inherits from Class and Behavior.

Rule 9. Every metaclass is an instance of Metaclass.

Rule 10. The metaclass of Metaclass is an instance of Metaclass.

More details can be found in this very readable book: Squeak by Example


Ruby Class Hierarchy (core classes)


Programming Language Paradigms in Ruby

The Ruby programming language has very elegantly, and in a clever manner, weaved together a number of well-known, and well-understood programming paradigms.  We list below some of these paradigms.  In a series of following posts, we highlight some of the ruby language features as viewed within the context of these paradigms.

1. Imperative Paradigm:

Statements: The program is a sequence of statements, where each statement operates on the current state of the program (represented by variables), and transforms the state. Statements include assignments, control flow (conditionals, loops), and function/procedure calls.

Expressions: Values, variables, and functions connected using operators and operator precedence rules.

Assignments: The value of an expression is assigned to a variable.  The assignment operator is typically either “=” or “:=”.
Control Flow: Includes conditionals (if-then-else, case) and loops (while, for).

Procedures: A construct to structure the statements of a program.  The statements within the procedure are executed by calling the procedure with one or more parameters, and returning one or more values.  Procedures can be with and without side-effects.

Scope: Determines the visibility and accessibility of variables during the execution of the program.

2. Dynamic Typing:

Type concept: The type of a value/variable constrains the values and operations on the value/variable.

Static Typing: The values which a variable can hold are constrained by the type of that variable through a type declaration.  The language implementation enforces the type constraints, and these constraints can be determined at compile time.

Dynamic Typing: Values have types. Variables do not have any type declarations and the variable can hold objects of different types.  Type checking occurs at runtime.

Strong and Weak Typing: In a strongly typed language, almost all operations are checked for type correctness.  In a weakly typed language the type checking is less strict, and usually supports some form of type conversion and overloading.

Method (or Dynamic) Dispatch: Method invocations are viewed as messages sent to an object.  The actual method to invoke is determined at runtime (also referred to as dynamic binding).

Duck Typing: Type of an object is determined by what it can do, and not by it’s class.   A statement calling a method “m” on an object does not rely on the declared type of the object; only that the object, of whatever type, must implement the method “m”.

3. Object-Oriented Paradigm:

Class: A class defines a structure and common behavior of a group of instances.  In other words, a class is a template for it’s instances.  A class has data definitions (class and instance variable) and method definitions (class and instance methods).  In some languages (e.g., Ruby, Python), classes are also objects.

Object: An instance of a class. 

Metaclass: A metaclass is a class whose instances are classes.  Just as an object is an instance of its class, a class is an instance of its metaclass.

Class Inheritance:  A sub-class can inherit properties from a superclass.

Mixin:  A mixin or module is a collection of methods and instance variables.  Mixins are not instantiable.  Typically, mixins will not have instance variables.  A mixin that requires its own state should be written as a class.  Class can include one or more mixins.  Mixins get inserted into the class hierarchy just above the including class.  Since a mixin acts like a sort of super class, a mixin cannot override a method in the class that includes it.  Mixin allows for code-reuse, without some of the problems of multiple inheritance

4. Functional Paradigm:

Functions: Function (y = f(x)) takes inputs, and returns output.  Functions don’t change state, and there are no side-effects (do not modify input parameters, do not modify global variables).  A function called again with the same inputs, will return the same output.  Reuse code by function composition (eg: tan(x) = sin(x) / cos(x)).

First-class objects: An entity that can be assigned into a variable, constructed at runtime (dynamic creation), passed as a parameter, and returned from a subroutine.  For example, scalar datatypes such as integer, float, string, symbol, array, hash are first class objects.  In functional programming languages, functions are first-class objects.  In some languages, classes can also be first-class objects.

Closures: First class functions which can close over variables in their surrounding environment at creation time.  Closure = <function, environment>.

Map: Takes a sequence/collection, transforms each item in the sequence with a conversion function, returns another sequence.

Filter: Takes a sequence/collection, returns another sequence which is filtered by a predicate.

Reduce: Takes a sequence/collection, iterate through each item in the sequence, and build up a result value.


Locale Errors on Ubuntu

Frequently, the following locale errors are seen on Ubuntu:

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = “en_US.UTF-8”
are supported and installed on your system.
perl: warning: Falling back to the standard locale (“C”).

The solution is to install the appropriate language pack (see: http://packages.ubuntu.com/maverick/translations/, replace maverick with whichever release you are on), if not already installed.  Then, generate the locale for the appropriate language(s).

For example:

$ sudo apt-get install language-pack-en
$ sudo locale-gen en_US.UTF-8


Ruby 1.9.2 Install from Source on Ubuntu

Ruby 1.9.2 became available recently (http://www.ruby-lang.org/en/news/2010/08/18/ruby-1-9.2-released/).  The performance of rails 3 with ruby 1.9.2 is great.  Here are the steps to install ruby 1.9.2. from source on Ubuntu. 

We installed on Ubuntu 10.04 LTS Lucid at Slicehost.  These steps should work fine on other versions of Ubuntu as well:

$ sudo apt-get install build-essential readline-common libreadline-dev libssl-dev libpcre3-dev libxml2 libxml2-dev libxslt-dev

$ wget http://ftp.ruby-lang.org/pub/ruby/1.9/ruby-1.9.2-p0.tar.gz
$ sudo mv ruby-1.9.2-p0.tar.gz /usr/local/src
$ cd /usr/local/src
$ sudo tar xvf ruby-1.9.2-p0.tar.gz
$ cd ruby-1.9.2-p0
$ sudo ./configure
$ sudo make
$ sudo make install

$ gem env
RubyGems Environment:
  - RUBY VERSION: 1.9.2 (2010-08-18 patchlevel 0) [x86_64-linux]
  - INSTALLATION DIRECTORY: /usr/local/lib/ruby/gems/1.9.1
  - RUBY EXECUTABLE: /usr/local/bin/ruby
  - EXECUTABLE DIRECTORY: /usr/local/bin
    - ruby
    - x86_64-linux
     - /usr/local/lib/ruby/gems/1.9.1
     - /home/admin/.gem/ruby/1.9.1
     - :update_sources => true
     - :verbose => true
     - :benchmark => false
     - :backtrace => false
     - :bulk_threshold => 1000
     - http://rubygems.org/

To install rails, simply do:

$ sudo gem install —no-rdoc —no-ri rails

Some good posts on the new features in Ruby 1.9:





Missing the Rails 2.3.8 gem (with thin)

We started getting this strange error on a development server where we had recently updated the gems.  We are using thin (http://code.macournoyer.com/thin/) as the app server.  It turned out that rack 1.2 was installed, and this does not play well with rails 2.3.8.  Since thin also depends on rack, starting thin loads the most recent version of the rack gem (1.2).  This causes rails to fail.  The solution is to remove all versions of the rack gem (gem uninstall -a rack), and then just install version 1.1.0 of the rack gem (gem install rack -v 1.1.0).