Ask HN: what linear algebra do you use most often for practical problems?

jey · on April 10, 2009

I'm using SVD in building a porn recommendation engine. (Very NSFW, and the recommendation engine itself isn't live yet: http://fapseek.com)

redrobot5050 · on April 10, 2009

You hiring?

jey · on April 12, 2009

Email me.

krakensden · on April 10, 2009

Now there is a quality name for a porn recommendation engine...

vang3lis · on April 10, 2009

maybe you could opensource your dataset? ;) (no, no, I'm after votes not images)

jey · on April 10, 2009

Not anytime soon, but I might eventually provide a representative subset and run something like the Netflix Prize.

kurtosis · on April 10, 2009

I apologize for threadjacking but you guys might be able to help.

I have a different problem - I would like to compute an approximate SVD of a very large sparse matrix, (for spectral clustering) but I can't find a good implementation which works for datasets too large to fit in core. This is a hadoop scale problem. What's the best way to do this?

Of course, finding all the singular values/vectors is out of the question, but I just need the top hundred or so.

Does anyone here have any suggestions for how to do this? The obvious strategy is just to construct a rank-100 approximation and optimize the singular values and vectors so that they get as close as possible to the real matrix. I guess gradient descent or something like that would work. Are there existing packages that do this with hadoop?

buss · on April 10, 2009

Gradient descent is a good solution for approximate SVD, I'm using it as part of my data mining final project (working on the netflix prize). I'm using this guy's code: http://www.timelydevelopment.com/demos/NetflixPrize.aspx, modified to print out the singular vectors when it finishes. It took about 32 hours (can't quite remember) to find the first 64 singular values* on the netflix dataset (480000x18000, 1.2% non-zero (or is it 1.8%?)) using a 2.2GHz Opteron and ~2 gigs of ram.

I'm sure there are better methods, but this one is easy and is producing great results. If you have any questions, you can shoot me an email at sbuss at cise dot ufl dot edu.

As for hadoop, I don't know of any parallel implementations of this that exist, but I don't think it would be /too/ hard to parallelize the gradient descent approach. Just split up the error calculation into several smaller chunks. If you get it running in parallel, let me know.

*edit: changed "vectors" to "values" in first paragraph.

kurtosis · on April 10, 2009

Thanks for the link - these results are encouraging. I'll think about it - if I come up with anything I'll let you know. Good luck.

pskomoroch · on April 10, 2009

Distributed, Large-Scale Latent Semantic Analysis by Index Interpolation. Sebastiano Vigna

http://tinyurl.com/d99779 [PDF]

kurtosis · on April 10, 2009

thanks to all, you guys are awesome

jey · on April 10, 2009

I've been using PROPACK lately to perform SVD on a gigantic matrix, and it kicks ass. It's a fortran and/or MATLAB package for SVD of large sparse or "structured" matrices. http://soi.stanford.edu/~rmunk/PROPACK/

You just have to provide a matrix-vector product function, specify a few parameters (how many singular values to find, should it compute the singular vectors, maximum number of iterations, etc) and it takes care of the rest. It uses the Lanczos iteration approach mentioned in sibling comments, and it seems like a far nicer implementation than SVDPACK and SVDLIBC.

Let me know if you want a copy of my C interface to PROPACK.

buss · on April 10, 2009

I'd be interested in trying it out, can you send it my way a la sbuss a cise d ufl d edu

Cheers

physcab · on April 10, 2009

Hey, I'm at UFL too! I'm taking Dr. Gader's class in machine learning right now. Are you a grad student?

kobs · on April 10, 2009

I think there are a couple of gators on HN.

miloshh · on April 10, 2009

As long as the 100 vectors fit in memory, it shouldn't be too bad. I think the commonly used algorithm to find the first singular vectors is Lanczos iteration, where the key operation is multiplying the 100 vectors by the large matrix. I don't know if there's an existing library for this though...

buss · on April 10, 2009

There is indeed, http://tedlab.mit.edu/~dr/svdlibc/ this is a sparse svd solver using the Lanczos approach.

I've used this, it works pretty quickly and produces exact results, unlike a gradient descent approach. Unfortunately, assuming I did everything correctly, it doesn't ignore the zeros in the data. I think this because after centering the data (subtracting the mean and dividing by the standard deviation for each row in the matrix), it just doesn't finish. I must have left it running for two days before I gave up on it. My assumption is that it was trying to approximate all the zeros which, due to centering, now were seen as the average rating for that user. I'm sure you could modify it to ignore those values, though.

edit: Please note that I could be way off in that my interpretation of the program not finishing. If someone knows I'm wrong, please tell me.

jey · on April 10, 2009

Maybe you encountered some numerical instability in SVDLIBC, but I would guess that there probably was a bug in the parameters you passed to SVDLIBC. Might want to try it again with a smaller matrix and compare it against MATLAB or something to flush out any bugs.

Raplh · on April 10, 2009

I'm not in software, but rather algorithm development for determining radio positions (like GPS but including other methods for cell phones). I have used eigenanalysis a number of times, a lot of orthogonal basis shifting of various kinds, implicilty a lot of matrix inversions and pseudoinversions to solve things. I do most of my R&D using matlab, and have learned a lot of linear algebra reading their helpfiles. I use hilbert spaces and what I learned about them 30 years ago almost constantly.

Tangurena · on April 10, 2009

Simplex algorithm, for resource optimization solver.

http://en.wikipedia.org/wiki/Simplex_algorithm

thomaspaine · on April 10, 2009

Non-negative matrix factorization for text clustering: http://www.procoders.net/?p=128

biohacker42 · on April 10, 2009

SVD, PCA, all kinds of matrix transformations, all for bioinformatics.

omarish · on April 10, 2009

You can save a ton of time by formulating a linear algebra problem as a shortest path problem; you can end up using dijkstra's algorithm to enumerate the possible solutions.

http://en.wikipedia.org/wiki/Knapsack_problem

jey · on April 10, 2009

I don't get it (at all). Can you elaborate?

darkxanthos · on April 10, 2009

He's saying basically, instead of manipulating a bunch of data algebraically to get the exactly provably right answer you can use this other algo to get you a damn good answer faster. (I think)

omarish · on May 16, 2009

accurate

biotech · on April 10, 2009

I probably use multi-linear regression most often, but occasionally I'll encounter a Linear Programming problem.

I'm currently looking for a good Partial Least Squares algorithm in C/C++. Any suggestions? I know R has a popular PLS algorithm, but I was hoping to avoid the learning curve.

njoubert · on April 10, 2009

Generating Jacobians for Implicit Euler Integration in a numerical simulation project. SVD for rigid body center-of-mass calculations. Conjugate Gradient for solving Ax=b on 9000x9000 matrices.

keefe · on April 10, 2009

It's not a very interesting answer, but technically linear algebra : transformation matrices for rotating and scaling graphics...

frankus · on April 10, 2009

I've poked around the CGAffineTransform classes that Apple uses in their Core Animation framework a bit, but unless you're doing something out of the ordinary, you don't really need to know what's going on under the hood.