Mining social networks

akshayubhat · on Sept 4, 2010

Interesting. Sadly it does not mentions any practical tools for doing social network analysis:

Here are few that I know:

For small networks (up to a million or two million nodes such as Wikipedia Link graph from 2009)

Following libraries provide code to handle and manipulate Network datasets:

1: SNAP by Prof. Jure Leskovec [ http://snap.stanford.edu ] written in C++

2: Networkx by Lanl [ http://networkx.lanl.gov/ ] written in Python, esp. good for fast prototyping

There are few Databases for storing networks, e.g. Neo4J http://neo4j.org/ .

Additionally there is a Graph Processing Language called as Gremlin

http://wiki.github.com/tinkerpop/gremlin/ .

For Large networks with millions and billions of nodes, one can use Hadoop / Map-Reduce or Apache Hama [still in nascent stage]. Google has a special system known as Pregel which it uses to perform scalable computations over large networks.

Elite · on Sept 5, 2010

I'm watching one of the 2 hours videos from the Stanford professor. He looks like a college freshman! No offense intended.

Do you have any resources on how to use these tools (especially the python library), and examples of implementation? I'm interested in learning more and using these tools on real data, but don't necessarily want to spend the time learning all of the theory behind it.

akshayubhat · on Sept 5, 2010

Yup he is young, he did a post doc for a year at Cornell [under Prof. Kleinberg] after his PhD, and directly became Asst. Prof at Stanford.

Well you can start with reading this book http://www.cs.cornell.edu/home/kleinber/networks-book/ for overview, regarding application of these techniques are considered you can look for papers at recent WWW, NIPS,ICML conferences. The most popular and well studied areas are Link Prediction and Community Detection. The SNAP library comes with some good example code. You can also have a look at Divisi project at MIT Media Lab if you are interested in reasoning/ analogy over the networks.

For real datasets, there are a lot of encyclopedic datasets such as Wikipedia / DBPedia /Semantic Web/ Music Brainz, as well as social ones such as Twitter follower network dataset. If you are in a university, you can even get full Web Graph from Yahoo [for research use alone].

iamwil · on Sept 4, 2010

So this was back in 2008, but I read this from Bruce Schneider:

http://www.schneier.com/blog/archives/2008/10/data_mining_fo...

<quote>But the authors conclude the type of data mining that government bureaucrats would like to do--perhaps inspired by watching too many episodes of the Fox series 24--can't work. "If it were possible to automatically find the digital tracks of terrorists and automatically monitor only the communications of terrorists, public policy choices in this domain would be much simpler. But it is not possible to do so."</quote>

So did something change in the last two years? I mean Palantir is making a crap ton for presumably something similar.

Was this ever a problem, or was it overcome?

Alex3917 · on Sept 4, 2010

"Was this ever a problem, or was it overcome?"

The problems Schneier is talking about have very different characteristics than the ones in this article. So the problems Schneier is talking about are still very real. But solving little pieces of the problem is definitely possible, especially with a good amount of human mediation.

doron · on Sept 4, 2010

Human mediatiation is key. Skilled intelligence analysis made by well versed operators. even then it is very hard.

As one intelligence analyst once told me " it is looking for a needle in stack of needles"

iamwil · on Sept 4, 2010

Ahh, so difference in data mining is that it implies there's no human in the loop.

Whereas palantir situates itself as a tool for intelligence, rather than as an oracle.

elblanco · on Sept 6, 2010

Most Palantir installations don't automatically generate or analyze the graphs. It's installed as essentially a very large, hand curated, semantic graph store and retrieval system with a nice client interface. Most of the work done with Palantir is manual.

If you watch some demos on their site you'll see that the workflow is almost entirely manual save for the moments when they retrieve sections of the semantic graph from the datastore.

Also notice that they never talk about the excruciatingly long process of how that data got jammed into the system. Imagine the target goal for an entire day's work for one person is to import 10 news articles into the system and you'll understand why they aren't doing the kinds of things Schneider was talking about (even if their brilliant marketing and sales departments sell it otherwise).

edit: word on the street is that they are still cashflow negative which is very worrisome considering all the fundraising they've been doing + they've almost saturated the government market

iamelgringo · on Sept 5, 2010

Finding out "Who is a terrorist, and what are they going to bomb?" is a much tougher problem than, "What information can I glean from Facebook, MySpace and Twitter to indicate this person might be a higher credit risk?"

For instance, I imagine a high number of updates on a Facebook profile containing the word "drunk" might lead to a higher risk profile for auto insurance.

rjurney · on Sept 4, 2010

Reading a social network analysis textbook, one is struck that all the examples involve rigorous data collection by social scientists with clipboards. Hundreds and thousands of hours of work for a small graph. Modern social networks have given us an abundance of data against which to leverage a backlog of techniques that were previously constrained by manual data collection.

Phone companies use this stuff in ways that may be unethical, but it can just as easily be used in ways to empower you to leverage your own network. There are exciting potentials we are just beginning to see.

loup-vaillant · on Sept 4, 2010

So, phone companies spy on us to make money, help the police (probably without a warrant), and spread propaganda.

<sarcasm> How surprising. </sarcasm>

Seriously, it's not like we couldn't have foreseen: http://www.softwarefreedom.org/news/2010/feb/01/freedom-clou...

theprodigy · on Sept 5, 2010

This is why facebook is worth billions. They can mine so much social data and deliver targeted ads to the proper influencers.

dsplittgerber · on Sept 4, 2010

does anyone know of a secure, non-free alternative to gmail , where one wouldn't have (as much) to worry about having your privacy violated?

cageface · on Sept 4, 2010

Keep in mind that your privacy is only as secure as that of your correspondents. I kind of gave up on the idea of keeping email private when I realized that most of the people I write are non-techy and use GMail, Facebook etc. naively.

evgen · on Sept 4, 2010

This is only true if your correspondents are all using the same service. If you use gmail then google knows the entirety of your email network. If you host your own mail then google only knows your links to your correspondents who use gmail, but not your links to those using hotmail, facebook, etc. (unless that info is revealed to google by putting multiple recipients in the to or cc entries in the message.)

dododo · on Sept 4, 2010

http://www.hushmail.com/ perhaps?

although sadly: http://en.wikipedia.org/wiki/Hushmail#Controversy

dsplittgerber · on Sept 4, 2010

thanks, I knew about hushmail, but their usability and search function are not up to par. It's funny - I'd gladly pay for a technogically sophisticated alternative to gmail, just not to have all email data with google.

hushmail · on Sept 5, 2010

If you'll give us another chance re usability, make an account & fill out this beta invite form: https://www.hushmail.com/contact/preview/

We have a new + faster + easier-to-use webmail coming in the next few months. It's a huge improvement over what we have now.

Search is still in the current folder only though, so that might not be quite what you're looking for.

(btw if anyone wants to help us build a better webmail, we're hiring PHP devs in Vancouver)

SageRaven · on Sept 4, 2010

There's Lavabit.com -- a nice little place with a strong stance supporting end user privacy. Not the most full-featured web interface, but it's a nice service if you like to access with IMAP or POP.

bobfunk · on Sept 4, 2010

Rackspace has a solid email hosting for businesses.

Part of their sales-pitch is "Ad-free and Privacy Guaranteed," but eventually it always comes down to a question of trust.