M571 Fall 2014, Lecture 6

1. Agenda

  1. Least Squares: theory and algorithms (part I)
  2. Least Squares: application of QR factorization (and others)

In this section, we discuss in more depth solving least squares problems. You may remember that we discussed least squares problems briefly when we discussed the Moore-Penrose pseudo-inverse of a matrix and motivated why we might need an orthonormal basis for the range of a matrix. We will discuss least squares problems first from a theoretical perspective and then discuss how to solve them directly using the QR matrix factorization. This will not be our only discussion of least squares problems and algorithms but this is our first real algorithm for the problem.

Continue reading

M571 Fall 2014: Lecture 5



1. Agenda

  1. QR Factorization, Gram-Schmidt (classical and modified)
  2. Householder QR and reflectors

We ended our discussion of projectors with several important observations:

  • If we have a collection of orthonormal vectors onto which we want an orthonormal projector , then we can write as

    This decomposition is not just mathematical decoration—this format is easy computationally to apply to vectors. To multiply , we first multiply (i.e., take the dot products of each with ) and then multiply (i.e., form the linear combination of with the vector of dot products as the coefficients). We don’t form explicitly and then multiply that matrix by .

  • If we need an orthoprojector onto the span of some linearly independent but not orthonormal vectors , we need an algorithm for extracting one.

To that end, we discuss Gram-Schmidt, an algorithm for finding a collection of orthonormal vectors. We will discuss several variants, their properties, and their uses.

Continue reading

M571 Fall 2014, Lecture 4



1. Agenda.

  1. Projectors and orthogonal projections
  2. Moore-Penrose generalized inverse

In this lecture, we discuss projectors and orthogonal projectors. These are matrices that project vectors onto subspaces in certain geometric fashions. We begin with basic properties of projectors and then move to orthogonal projectors (or ortho-projectors). We toggle back and forth between the algebraic properties of these maps and the more geometric ones. We finish with a proof that demonstrates an important algorithmic idea that we will exploit in the next lecture (and in many subsequent ones).

Continue reading

Strata Conference 2014

I spoke in the Hardcore Data Science session today at the O’Reilly Strata Conference (http://strataconf.com/) “where cutting-edge data science and new business fundamentals intersect—and merge.” It was a fascinating experience and a great session. I’ll post more on the talks later. In the mean time, here are my slides from the session Big Data: Efficient collection and processing.

A little background on the Keith Haring images. They are from his book entitled “Big.” In the early 2000’s, I was working on streaming algorithms for Big Data. My daughter was a baby and I bought this book both to read to her and to entertain myself with the lovely artwork and synonyms for “big.” Next time someone talks about Big Data, think, “Gigantic extra-long sleeves.”


Finally, no rats were harmed in the making of this presentation.

M571 Fall 2014, Lecture 3



1. Agenda

  1. Spectral radius and the norm of a matrix
  2. SVD: a key matrix factorization

We ended the last lecture with a procedure for computing the and norms of a matrix. Unfortunately, we do not have such straightforward results for the norm. We do have a relationship between the spectral radius of a matrix and its norm (or more generally, any induced matrix norm) and, in the case of a Hermitian matrix, we can refine this relationship even more.

In the second part of the lecture, we discuss the Singular Value Decomposition (SVD), a key matrix factorization that we will use repeatedly both in algorithms and analysis thereof. It’s important to understand conceptually what the SVD does and to do so from many different perspectives as it used everywhere in NLA.

Continue reading

Math 571 Fall 2014, Lecture 2



1. Agenda

  1. Orthogonal matrices and vectors
  2. Norms

In this lecture, we cover some basic but powerful definitions. Hiding amongst these definitions are powerful relationships amongst the norms of a vector and the (induced) norms of a matrix. These relationships are used repeatedly in the analysis of NLA algorithms. It is quite useful to develop an intuition about the geometry of the different unit balls in high dimensions. We start with the basic definition of a dot-product and what it means for vectors and matrices to be orthogonal. Again, there are a number of fundamental results from linear algebra that we will cover quickly today. Please refresh your memory if these concepts are not immediately familiar to you. We will use them repeatedly throughout the course.

Continue reading

M571 Fall 2014, Lecture 1


1. Agenda

  1. Introduction
  2. Class overview and syllabus
  3. PageRank
  4. Matrix-vector and Matrix-matrix multiplication
  5. Note: This material has been edited from a previous post M571 Fall 2011, Lecture 1.

2. PageRank

Imagine a library with 25 billion documents but no centralized organization and no librarians. Anyone can add a document at any time without telling anyone else. You are sure that one of the documents has a piece of information that is vitally important to you. Goal: find the document that contains this piece of information and find it fast!

Our library is a reasonable model for the world wide web—it is huge, is highly disorganized, contains files in many different formats, and it is distributed.

Solution: PageRank algorithm. Actually, the algorithm we’re going to describe solves a slight variant on the problem above. Careful description below illustrates differences. Let’s specify our problem a bit more. We want to find the “best” document that contains the phrase “numerical linear algebra.” Here is a rough outline of how Google or some information retrieval engine might go about this.

  1. Google runs spiders that crawl WWW, retrieve pages, index words in document, store information highly efficiently,
  2. user types in phrase “numerical linear algebra,”
  3. search engine determines all the pages in database that contain phrase,
  4. engine ranks pages automatically,
  5. returns the highest ranked page.

Question: (Assuming that you already have a fast way of determining all the pages that contain phrase of interest,) how do you determine how important a page is? In other words, how do you rank the pages? (Addressing the first assumption is, in itself, an interesting and challenging algorithmic question.)

Answer: When you create a page, you frequently include links to other pages that contain valuable, reliable information. This affirms the importance of page that you link to. We summarize this intuition with the following definition. The importance of a page is judged by the number of pages linking to it and the importance of the refereeing pages.


  • = page ,
  • importance of , and
  • number of links (outgoing from page )
  • idea is that if links to , then it should confer of its importance to

Set set of all pages linking to .

This definition presents a bit of a dilemma; how do we determine without determining first?

To figure out how to define properly, let us construct the hyperlink matrix.

Definition 1 hyperlink matrix

Properties of :

  1. has all non-negative entries
  2. sum of entries in each column (unless the page has no links!)

Let us set vector of of page ranks or importances and we conclude

or, is an eigenvector of (with eigenvalue 1).

Challenge: Given , compute . Unfortunately, has some good and bad properties that make computing the eigenvector associated with eigenvalue 1 challenging.

  1. is square and 25 million 25 million
  2. many entries are 0 (i.e., it’s sparse)
  3. webpages have on average 10 links

Solution: The solution we will use is called the Power Method and it’s a simple algorithm to describe (simply).

  • 1. pick an initial vector
  • 2. iterate:

General principle: ; i.e., we have a stationary vector.

Questions: At this point, you should have a number of questions about what we just described. In fact, a large component of this class is to learn what questions are appropriate to “ask” of an algorithm, what answers one should expect or desire, and what questions one can answer (and with what degree of rigor).

  1. Does the sequence always converge?
  2. Does the limit depend on initial guess?
  3. Do the importance rankings contain the information we want?
  4. What is the computational cost of this method? What are the bottlenecks?

Actually, the algorithm above (as stated) fails the first three questions. We have to fix this up — more on this later!

These are the types of questions/solutions/algorithms that we’ll look at in this class. Along with implementation and applications to modern computing. Frequently, we’ll discuss both mathematical or analytical issues and computational or implementation issues all at the same time.

Fixing up the algorithm. The problem with the algorithm/set-up above is that if a page has no links, we have a “dangling” node.

Example 1 The set of pages with the simple graph below


If , then

Whatever importance conveys to is absorbed by and isn’t re-directed to any other pages. It drains importance from the rest of the web.

Probabilistic model of web-surfing. We need a different formulation of the hyperlink matrix which gives rise to a different model of traversing the web. In fact, there are a number of different models and the random web surfer below is just one simple one.

  • Surf the web at random: when on a page, randomly follow link on page to another; i.e., suppose page has (outgoing) links. Choose one outgoing link uniformly at random.
  • For those pages that have no outgoing links, pick any other node uniformly at random. In the hyperlink matrix , we replace “empty” columns with equal probabilities .

The effect of changing our model is that the matrix has eigenvector which suggests that page is twice as important as page . The entries of are non-negative and column sums are 1; i.e., is a stochastic matrix. Note: there are still some issues with using the power method that we will discuss later when we cover it.

3. Matrix-vector and Matrix-matrix multiplication.

Definition 2 Matrix-vector product

  • matrix (either over or )
  • vector of length
  • vector of length
  • for

  • Note that is a linear combination of columns of , the coefficients in the linear combination are the entries in



Definition 3 Matrix-matrix product

  • is a matrix-matrix product
  • dimensions:
  • entries in are given by

    or, in pictures,


  • each column of is a linear combination of columns of , the coefficients in the linear combination given by the entries in the particular column of


3.1. Critical theorem from linear algebra.

We will repeatedly make use of the following theorem from linear algebra. Please re-familiarize yourself with it.

Theorem 4 Let be an matrix. TFAE

  1. is invertible; i.e., there exists s.t. .
  2. .
  3. rangex.
  4. rank dim range .
  5. null
  6. the eigenvalues of are non-zero.



  • When we write , it’s important to think of as the unique vector that solves .
  • That is, is the vector of coefficients in the unique linear expansion of in the basis of columns of . is a change of basis operation.