Lecture Notes on Least Squares and Data Fitting  (Friday, October 20)

We are given measured data about cancer rates and meat consumption in various countries:
 
       Country          Meat Consumption            Cancer Rate
                       (grams/day/person)        (per 100,000 women/year)
--------------------------------------------------------------------------------
       Japan                  26                        7.5
      Finland                101                        9.8
       Israel                124                       16.4
    Great Britain            205                       23.3
        US                   284                        34
How closely related is meat consumption to cancer rate?  At first glance, it looks as if there may be some connection.  It looks like as meat consumption goes up, there is a corresponding increase in cancer rates.  We should take a closer look.  We can use MATLAB to plot the data:
 
     >> x = [ 26, 101, 124, 205, 284 ]
     >> y = [ 7.5, 9.8, 16.4, 23.3, 34]
     >> plot(x,y,'*',x,y)


Based on this graph, is it reasonable to think that the data should lie along a straight line?  Adjusting somehow for experimental error could we fit this data onto a straight line?  We get a better idea if adjust the data so that our theoretical line would run through the origin.  To do this we compare to the means, that is we compute the average meat consumption and the average cancer rate and for each country we look at  how far above or below the average it is.    Again MATLAB is a help:

     >> xm = x - sum(x)/5
     >> ym = y - sum(y)/5
tells us that in Japan meat consumption per person per day is 122 grams below the average for all 5 countries.  In general we get a new table
 
 
       Country          Meat Consumption            Cancer Rate
                       (grams/day/person)        (per 100,000 women/year)
--------------------------------------------------------------------------------
       Japan                -122                        -10.7
      Finland                -47                         -8.4
       Israel                -24                         -1.8
    Great Britain             57                          5.1
        US                   136                         15.8


Question:  Is the rise in cancer rates we see here proportional to the increase in meat consumption?

If so then these two vectors in 5-space would be multiples of each other.   This is not the case -- 122/10.7 = 11.4019 whereas 47/8.4 = 5.5952 and 24/1.8 = 13.3333.  On the other hand, it is a bit much to expect that measured data would demonstrate this relationship perfectly.   Mathematically we can ask, how close do these vectors come to being multiples of each other.  To answer this question we can compute the angle between the two vectors.  If this angle is small, then the vectors are nearly multiples of each other.   We can determine this angle by computing the dot product of the vectors and dividing out their lengths and finding the inverse cosine:

     >> costheta = sum( xm.* ym)/(norm(xm)*norm(ym))
     % Note that xm.* ym multiplies the vectors xm and ym componentwise.
     % Applying sum to this product vector gives the dot product of xm and ym
     % As we might expect norm(xm) is just the length of xm.
     >> theta = acos( costheta )
In this way we find that the angle between these two vectors in 5-space is about 12 degrees and so we can view this data as evidence that there is a direct relationship between meat consumption and cancer rates.   Based on our data what should the constant of proportionality be?  If there is no meat consumption at all would we still expect to see occurrences of cancer?  What can we deduce from this very limited data?

Assume that their is a linear relationship between meat consumption and cancer rates. that is
 

              Cancer Rate = a*(Meat Consumption Rate) + b
for some constants a and b.  This translates into a matrix equation
              [ 26    1 ]               [  7.5 ]
              [ 101   1 ] * [ a ]   =   [  9.8 ]
              [ 124   1 ]   [ b ]       [ 16.4 ]
              [ 205   1 ]               [ 23.3 ]
              [ 284   1 ]               [  34  ]
This system is inconsistent as we already know.  Otherwise the data points (26,7.5), (101,9.8), etc would all have fallen on a line of slope a and intercept b when we plotted them.  We are assuming that their failure to lie on a line is due to experimental error.  We want to correct for that error by finding a slope a and an intercept b that comes as close as possible to our measured data.  How can we do this?

More generally, if we wish to solve Ax = b but we find that it has no solutions, we want to find the best approximate solution.  We must consider all possible choices of x and pick out the one so that Ax comes closest to b.

Claim:  The best approximation is found when we choose x so that Ax is the orthogonal projection of b onto the image of A.

Explanation:  Let v be any vector in the image of A.  Let p be the orthogonal projection of b onto the image of A.  Then b - p is in the orthogonal complement of the image of A.  Note that

              b - p is orthogonal to p since p is in the image of A.
              b - p is orthogonal to v since v is in the image of A.
              b - p is therefore orthogonal to p - v as well.
Thus we have a right triangle whose edges are the vectors b - p and p - v and their vector sum b - v.  By the Pythagorean theorem we know that
       || b - v ||2 = || b - p ||2 + || p - v ||2
        Therefore, || b - v ||2 is greater than or equal to || b - p || for all v in image(A).
 The vector v in the image of A that makes the norm of b - v smallest is therefore the projection vector p.

Upshot:  The best approximation to b in the image of A is the orthogonal projection of b onto the image of A.  We must choose our approximate solution x* so that

                     Ax* = p
How do we do that?  If x* makes Ax* =  p then b - Ax* is orthogonal to the image of A.  Now we claim that  the orthogonal complement of the image of A is just the kernel of AT .
     A = [ v1 | v2 | ...| vn]    where vi is the ith column of A.
 Since the image of A is just the span of the columns of A, a vector x is in the orthogonal complement of the image of A if and only if its dot product with any linear combination of the vi's is 0.  But this is true if and only if the dot product of x with all the individual vi's is zero.  In matrix terms this is exactly the same as saying that x is in the kernel of the matrix whose ith row is vi.  Thus x is in the orthogonal complement of the image of A if and only if x is in the kernel of AT .

Going back to our problem of finding the best approximate solution to Ax = b, we need to choose x* so that b - Ax* will be in the kernel of AT .  Thus
 

          AT (b-Ax*) = 0  ----->  ATb = ATA x*
In our example we find that although
 [ 26   1 ]             [  7.5 ]
 [101   1 ] *[ a ]  =   [  9.8 ]
 [124   1 ]  [ b ]      [ 16.4 ]
 [205   1 ]             [ 23.3 ]
 [284   1 ]             [  34  ]
has no solution, we can find the best approximate solution by solving
                                  [ 26   1 ]                                           [  7.5 ]
 [ 26   101   124   205   284 ]   [101   1 ] *[ a ]  = [ 26   101   124   205   284 ] *[  9.8 ]
 [  1     1    1     1     1  ] * [124   1 ]  [ b ]    [  1     1    1     1     1  ]  [ 16.4 ]
                                  [205   1 ]                                           [ 23.3 ]
                                  [284   1 ]                                           [  34  ]


   [ 148934       740 ] *[ a ]   =  [ 17650.9 ]
   [    740         5 ]  [ b ]      [    91   ]
instead.   This we could solve by hand or in MATLAB to get that the line of best fit to the original data is
                 y = 0.1061*x + 2.4932
In general we see that if we have n data points (xi,yi) and we want to find the line of best fit y = ax + b, then we will end up with a 2 x 2 linear system of the form
            [  sum of xi^2      sum of xi   ]*[ a ] = [ sum of xi*yi  ]
            [  sum of xi           n        ] [ b ]   [ sum of yi     ]
and in MATLAB we could set up and solve the above system using
     >> M = [ sum(x.*x)    sum(x)   sum(x.*y) ;
              sum(x)        5       sum(y)    ]
     >> rref(M)
 A final question -- will this always work out?  Will there always be a unique line of best fit and will the method above give it to us.  This is the same as asking whether the system
                       ATb = ATA x*
will always have a unique solution.   The matrix ATA will be square, so the equation above will have exactly one solution x* if and only if this matrix is invertible.  But we claim that this matrix will be invertible if and only if the columns of A are independent.

Theorem:  The matrix ATA is invertible if and only if the columns of A are independent.
Proof:
First notice that ATA is invertible if and only if the kernel of ATA  is the zero vector and similarly the columns of A are independent if and only if the kernel of A is the zero vector.  So it would suffice to show that

                     kernel(ATA) = kernel(A)
If x is in the kernel of A, then Ax = 0 and therefore ATA x = 0.  Thus the kernel of A is a subset of the kernel of ATA.  Conversely, if x is in the kernel of ATA then ATA x = 0 and therefore  xTATA x = xT0 = 0.  But the left hand side of this equation is the same as Ax dotted with itself.  In other words if x is in the kernel of ATA, then Ax must be a vector whose dot product with itself is 0 so Ax must be the zero vector and x is in the kernel of A.  Thus the kernel of ATA is a subset of the kernel of A.  We conclude that they are equal.