- Class overview and syllabus
- Matrix-vector and Matrix-matrix multiplication
- Note: This material has been edited from a previous post M571 Fall 2011, Lecture 1.
Imagine a library with 25 billion documents but no centralized organization and no librarians. Anyone can add a document at any time without telling anyone else. You are sure that one of the documents has a piece of information that is vitally important to you. Goal: find the document that contains this piece of information and find it fast!
Our library is a reasonable model for the world wide web—it is huge, is highly disorganized, contains files in many different formats, and it is distributed.
Solution: PageRank algorithm. Actually, the algorithm we’re going to describe solves a slight variant on the problem above. Careful description below illustrates differences. Let’s specify our problem a bit more. We want to find the “best” document that contains the phrase “numerical linear algebra.” Here is a rough outline of how Google or some information retrieval engine might go about this.
- Google runs spiders that crawl WWW, retrieve pages, index words in document, store information highly efficiently,
- user types in phrase “numerical linear algebra,”
- search engine determines all the pages in database that contain phrase,
- engine ranks pages automatically,
- returns the highest ranked page.
Question: (Assuming that you already have a fast way of determining all the pages that contain phrase of interest,) how do you determine how important a page is? In other words, how do you rank the pages? (Addressing the first assumption is, in itself, an interesting and challenging algorithmic question.)
Answer: When you create a page, you frequently include links to other pages that contain valuable, reliable information. This affirms the importance of page that you link to. We summarize this intuition with the following definition. The importance of a page is judged by the number of pages linking to it and the importance of the refereeing pages.
- = page ,
- importance of , and
- number of links (outgoing from page )
- idea is that if links to , then it should confer of its importance to
Set set of all pages linking to .
This definition presents a bit of a dilemma; how do we determine without determining first?
To figure out how to define properly, let us construct the hyperlink matrix.
Definition 1 hyperlink matrix
Properties of :
- has all non-negative entries
- sum of entries in each column (unless the page has no links!)
Let us set vector of of page ranks or importances and we conclude
or, is an eigenvector of (with eigenvalue 1).
Challenge: Given , compute . Unfortunately, has some good and bad properties that make computing the eigenvector associated with eigenvalue 1 challenging.
- is square and 25 million 25 million
- many entries are 0 (i.e., it’s sparse)
- webpages have on average 10 links
Solution: The solution we will use is called the Power Method and it’s a simple algorithm to describe (simply).
- 1. pick an initial vector
- 2. iterate:
General principle: ; i.e., we have a stationary vector.
Questions: At this point, you should have a number of questions about what we just described. In fact, a large component of this class is to learn what questions are appropriate to “ask” of an algorithm, what answers one should expect or desire, and what questions one can answer (and with what degree of rigor).
- Does the sequence always converge?
- Does the limit depend on initial guess?
- Do the importance rankings contain the information we want?
- What is the computational cost of this method? What are the bottlenecks?
Actually, the algorithm above (as stated) fails the first three questions. We have to fix this up — more on this later!
These are the types of questions/solutions/algorithms that we’ll look at in this class. Along with implementation and applications to modern computing. Frequently, we’ll discuss both mathematical or analytical issues and computational or implementation issues all at the same time.
Fixing up the algorithm. The problem with the algorithm/set-up above is that if a page has no links, we have a “dangling” node.
Example 1 The set of pages with the simple graph below
If , then
Whatever importance conveys to is absorbed by and isn’t re-directed to any other pages. It drains importance from the rest of the web.
Probabilistic model of web-surfing. We need a different formulation of the hyperlink matrix which gives rise to a different model of traversing the web. In fact, there are a number of different models and the random web surfer below is just one simple one.
The effect of changing our model is that the matrix has eigenvector which suggests that page is twice as important as page . The entries of are non-negative and column sums are 1; i.e., is a stochastic matrix. Note: there are still some issues with using the power method that we will discuss later when we cover it.
3. Matrix-vector and Matrix-matrix multiplication.
Definition 2 Matrix-vector product
Definition 3 Matrix-matrix product
3.1. Critical theorem from linear algebra.
We will repeatedly make use of the following theorem from linear algebra. Please re-familiarize yourself with it.
Theorem 4 Let be an matrix. TFAE
- is invertible; i.e., there exists s.t. .
- rank dim range .
- the eigenvalues of are non-zero.
- When we write , it’s important to think of as the unique vector that solves .
- That is, is the vector of coefficients in the unique linear expansion of in the basis of columns of . is a change of basis operation.