Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: what linear algebra do you use most often for practical problems?
57 points by vang3lis on April 10, 2009 | hide | past | favorite | 30 comments
Linear algebra comes up in almost every area of modern research, but what applications have you made most often? Is it SVD for latent semantic analysis or computing eigenvector for PageRank or something else? Discuss.



I'm using SVD in building a porn recommendation engine. (Very NSFW, and the recommendation engine itself isn't live yet: http://fapseek.com)


You hiring?


Email me.


Now there is a quality name for a porn recommendation engine...


maybe you could opensource your dataset? ;) (no, no, I'm after votes not images)


Not anytime soon, but I might eventually provide a representative subset and run something like the Netflix Prize.


I apologize for threadjacking but you guys might be able to help.

I have a different problem - I would like to compute an approximate SVD of a very large sparse matrix, (for spectral clustering) but I can't find a good implementation which works for datasets too large to fit in core. This is a hadoop scale problem. What's the best way to do this?

Of course, finding all the singular values/vectors is out of the question, but I just need the top hundred or so.

Does anyone here have any suggestions for how to do this? The obvious strategy is just to construct a rank-100 approximation and optimize the singular values and vectors so that they get as close as possible to the real matrix. I guess gradient descent or something like that would work. Are there existing packages that do this with hadoop?


Gradient descent is a good solution for approximate SVD, I'm using it as part of my data mining final project (working on the netflix prize). I'm using this guy's code: http://www.timelydevelopment.com/demos/NetflixPrize.aspx, modified to print out the singular vectors when it finishes. It took about 32 hours (can't quite remember) to find the first 64 singular values* on the netflix dataset (480000x18000, 1.2% non-zero (or is it 1.8%?)) using a 2.2GHz Opteron and ~2 gigs of ram.

I'm sure there are better methods, but this one is easy and is producing great results. If you have any questions, you can shoot me an email at sbuss at cise dot ufl dot edu.

As for hadoop, I don't know of any parallel implementations of this that exist, but I don't think it would be /too/ hard to parallelize the gradient descent approach. Just split up the error calculation into several smaller chunks. If you get it running in parallel, let me know.

*edit: changed "vectors" to "values" in first paragraph.


Thanks for the link - these results are encouraging. I'll think about it - if I come up with anything I'll let you know. Good luck.


Distributed, Large-Scale Latent Semantic Analysis by Index Interpolation. Sebastiano Vigna

http://tinyurl.com/d99779 [PDF]


thanks to all, you guys are awesome


I've been using PROPACK lately to perform SVD on a gigantic matrix, and it kicks ass. It's a fortran and/or MATLAB package for SVD of large sparse or "structured" matrices. http://soi.stanford.edu/~rmunk/PROPACK/

You just have to provide a matrix-vector product function, specify a few parameters (how many singular values to find, should it compute the singular vectors, maximum number of iterations, etc) and it takes care of the rest. It uses the Lanczos iteration approach mentioned in sibling comments, and it seems like a far nicer implementation than SVDPACK and SVDLIBC.

Let me know if you want a copy of my C interface to PROPACK.


I'd be interested in trying it out, can you send it my way a la sbuss a cise d ufl d edu

Cheers


Hey, I'm at UFL too! I'm taking Dr. Gader's class in machine learning right now. Are you a grad student?


I think there are a couple of gators on HN.


As long as the 100 vectors fit in memory, it shouldn't be too bad. I think the commonly used algorithm to find the first singular vectors is Lanczos iteration, where the key operation is multiplying the 100 vectors by the large matrix. I don't know if there's an existing library for this though...


There is indeed, http://tedlab.mit.edu/~dr/svdlibc/ this is a sparse svd solver using the Lanczos approach.

I've used this, it works pretty quickly and produces exact results, unlike a gradient descent approach. Unfortunately, assuming I did everything correctly, it doesn't ignore the zeros in the data. I think this because after centering the data (subtracting the mean and dividing by the standard deviation for each row in the matrix), it just doesn't finish. I must have left it running for two days before I gave up on it. My assumption is that it was trying to approximate all the zeros which, due to centering, now were seen as the average rating for that user. I'm sure you could modify it to ignore those values, though.

edit: Please note that I could be way off in that my interpretation of the program not finishing. If someone knows I'm wrong, please tell me.


Maybe you encountered some numerical instability in SVDLIBC, but I would guess that there probably was a bug in the parameters you passed to SVDLIBC. Might want to try it again with a smaller matrix and compare it against MATLAB or something to flush out any bugs.


I'm not in software, but rather algorithm development for determining radio positions (like GPS but including other methods for cell phones). I have used eigenanalysis a number of times, a lot of orthogonal basis shifting of various kinds, implicilty a lot of matrix inversions and pseudoinversions to solve things. I do most of my R&D using matlab, and have learned a lot of linear algebra reading their helpfiles. I use hilbert spaces and what I learned about them 30 years ago almost constantly.


Simplex algorithm, for resource optimization solver.

http://en.wikipedia.org/wiki/Simplex_algorithm


Non-negative matrix factorization for text clustering: http://www.procoders.net/?p=128


SVD, PCA, all kinds of matrix transformations, all for bioinformatics.


You can save a ton of time by formulating a linear algebra problem as a shortest path problem; you can end up using dijkstra's algorithm to enumerate the possible solutions.

http://en.wikipedia.org/wiki/Knapsack_problem


I don't get it (at all). Can you elaborate?


He's saying basically, instead of manipulating a bunch of data algebraically to get the exactly provably right answer you can use this other algo to get you a damn good answer faster. (I think)


accurate


I probably use multi-linear regression most often, but occasionally I'll encounter a Linear Programming problem.

I'm currently looking for a good Partial Least Squares algorithm in C/C++. Any suggestions? I know R has a popular PLS algorithm, but I was hoping to avoid the learning curve.


Generating Jacobians for Implicit Euler Integration in a numerical simulation project. SVD for rigid body center-of-mass calculations. Conjugate Gradient for solving Ax=b on 9000x9000 matrices.


It's not a very interesting answer, but technically linear algebra : transformation matrices for rotating and scaling graphics...


I've poked around the CGAffineTransform classes that Apple uses in their Core Animation framework a bit, but unless you're doing something out of the ordinary, you don't really need to know what's going on under the hood.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: