Linear algebra comes up in almost every area of modern research, but what applications have you made most often? Is it SVD for latent semantic analysis or computing eigenvector for PageRank or something else? Discuss.
I apologize for threadjacking but you guys might be able to help.
I have a different problem - I would like to compute an approximate SVD of a very large sparse matrix, (for spectral clustering) but I can't find a good implementation which works for datasets too large to fit in core. This is a hadoop scale problem. What's the best way to do this?
Of course, finding all the singular values/vectors is out of the question, but I just need the top hundred or so.
Does anyone here have any suggestions for how to do this? The obvious strategy is just to construct a rank-100 approximation and optimize the singular values and vectors so that they get as close as possible to the real matrix. I guess gradient descent or something like that would work. Are there existing packages that do this with hadoop?
Gradient descent is a good solution for approximate SVD, I'm using it as part of my data mining final project (working on the netflix prize). I'm using this guy's code: http://www.timelydevelopment.com/demos/NetflixPrize.aspx, modified to print out the singular vectors when it finishes. It took about 32 hours (can't quite remember) to find the first 64 singular values* on the netflix dataset (480000x18000, 1.2% non-zero (or is it 1.8%?)) using a 2.2GHz Opteron and ~2 gigs of ram.
I'm sure there are better methods, but this one is easy and is producing great results. If you have any questions, you can shoot me an email at sbuss at cise dot ufl dot edu.
As for hadoop, I don't know of any parallel implementations of this that exist, but I don't think it would be /too/ hard to parallelize the gradient descent approach. Just split up the error calculation into several smaller chunks. If you get it running in parallel, let me know.
*edit: changed "vectors" to "values" in first paragraph.
I've been using PROPACK lately to perform SVD on a gigantic matrix, and it kicks ass. It's a fortran and/or MATLAB package for SVD of large sparse or "structured" matrices. http://soi.stanford.edu/~rmunk/PROPACK/
You just have to provide a matrix-vector product function, specify a few parameters (how many singular values to find, should it compute the singular vectors, maximum number of iterations, etc) and it takes care of the rest. It uses the Lanczos iteration approach mentioned in sibling comments, and it seems like a far nicer implementation than SVDPACK and SVDLIBC.
Let me know if you want a copy of my C interface to PROPACK.
As long as the 100 vectors fit in memory, it shouldn't be too bad. I think the commonly used algorithm to find the first singular vectors is Lanczos iteration, where the key operation is multiplying the 100 vectors by the large matrix. I don't know if there's an existing library for this though...
I've used this, it works pretty quickly and produces exact results, unlike a gradient descent approach. Unfortunately, assuming I did everything correctly, it doesn't ignore the zeros in the data. I think this because after centering the data (subtracting the mean and dividing by the standard deviation for each row in the matrix), it just doesn't finish. I must have left it running for two days before I gave up on it. My assumption is that it was trying to approximate all the zeros which, due to centering, now were seen as the average rating for that user. I'm sure you could modify it to ignore those values, though.
edit: Please note that I could be way off in that my interpretation of the program not finishing. If someone knows I'm wrong, please tell me.
Maybe you encountered some numerical instability in SVDLIBC, but I would guess that there probably was a bug in the parameters you passed to SVDLIBC. Might want to try it again with a smaller matrix and compare it against MATLAB or something to flush out any bugs.
I'm not in software, but rather algorithm development for determining radio positions (like GPS but including other methods for cell phones). I have used eigenanalysis a number of times, a lot of orthogonal basis shifting of various kinds, implicilty a lot of matrix inversions and pseudoinversions to solve things. I do most of my R&D using matlab, and have learned a lot of linear algebra reading their helpfiles. I use hilbert spaces and what I learned about them 30 years ago almost constantly.
You can save a ton of time by formulating a linear algebra problem as a shortest path problem; you can end up using dijkstra's algorithm to enumerate the possible solutions.
He's saying basically, instead of manipulating a bunch of data algebraically to get the exactly provably right answer you can use this other algo to get you a damn good answer faster. (I think)
I probably use multi-linear regression most often, but occasionally I'll encounter a Linear Programming problem.
I'm currently looking for a good Partial Least Squares algorithm in C/C++. Any suggestions? I know R has a popular PLS algorithm, but I was hoping to avoid the learning curve.
Generating Jacobians for Implicit Euler Integration in a numerical simulation project. SVD for rigid body center-of-mass calculations. Conjugate Gradient for solving Ax=b on 9000x9000 matrices.
I've poked around the CGAffineTransform classes that Apple uses in their Core Animation framework a bit, but unless you're doing something out of the ordinary, you don't really need to know what's going on under the hood.