For Large networks with millions and billions of nodes, one can use Hadoop / Map-Reduce or Apache Hama [still in nascent stage]. Google has a special system known as Pregel which it uses to perform scalable computations over large networks.
I'm watching one of the 2 hours videos from the Stanford professor. He looks like a college freshman! No offense intended.
Do you have any resources on how to use these tools (especially the python library), and examples of implementation? I'm interested in learning more and using these tools on real data, but don't necessarily want to spend the time learning all of the theory behind it.
Yup he is young, he did a post doc for a year at Cornell [under Prof. Kleinberg] after his PhD, and directly became Asst. Prof at Stanford.
Well you can start with reading this book http://www.cs.cornell.edu/home/kleinber/networks-book/ for overview, regarding application of these techniques are considered you can look for papers at recent WWW, NIPS,ICML conferences. The most popular and well studied areas are Link Prediction and Community Detection. The SNAP library comes with some good example code. You can also have a look at Divisi project at MIT Media Lab if you are interested in reasoning/ analogy over the networks.
For real datasets, there are a lot of encyclopedic datasets such as Wikipedia / DBPedia /Semantic Web/ Music Brainz, as well as social ones such as Twitter follower network dataset. If you are in a university, you can even get full Web Graph from Yahoo [for research use alone].
<quote>But the authors conclude the type of data mining that government bureaucrats would like to do--perhaps inspired by watching too many episodes of the Fox series 24--can't work. "If it were possible to automatically find the digital tracks of terrorists and automatically monitor only the communications of terrorists, public policy choices in this domain would be much simpler. But it is not possible to do so."</quote>
So did something change in the last two years? I mean Palantir is making a crap ton for presumably something similar.
The problems Schneier is talking about have very different characteristics than the ones in this article. So the problems Schneier is talking about are still very real. But solving little pieces of the problem is definitely possible, especially with a good amount of human mediation.
Most Palantir installations don't automatically generate or analyze the graphs. It's installed as essentially a very large, hand curated, semantic graph store and retrieval system with a nice client interface. Most of the work done with Palantir is manual.
If you watch some demos on their site you'll see that the workflow is almost entirely manual save for the moments when they retrieve sections of the semantic graph from the datastore.
Also notice that they never talk about the excruciatingly long process of how that data got jammed into the system. Imagine the target goal for an entire day's work for one person is to import 10 news articles into the system and you'll understand why they aren't doing the kinds of things Schneider was talking about (even if their brilliant marketing and sales departments sell it otherwise).
edit: word on the street is that they are still cashflow negative which is very worrisome considering all the fundraising they've been doing + they've almost saturated the government market
Finding out "Who is a terrorist, and what are they going to bomb?" is a much tougher problem than, "What information can I glean from Facebook, MySpace and Twitter to indicate this person might be a higher credit risk?"
For instance, I imagine a high number of updates on a Facebook profile containing the word "drunk" might lead to a higher risk profile for auto insurance.
Reading a social network analysis textbook, one is struck that all the examples involve rigorous data collection by social scientists with clipboards. Hundreds and thousands of hours of work for a small graph. Modern social networks have given us an abundance of data against which to leverage a backlog of techniques that were previously constrained by manual data collection.
Phone companies use this stuff in ways that may be unethical, but it can just as easily be used in ways to empower you to leverage your own network. There are exciting potentials we are just beginning to see.
Keep in mind that your privacy is only as secure as that of your correspondents. I kind of gave up on the idea of keeping email private when I realized that most of the people I write are non-techy and use GMail, Facebook etc. naively.
This is only true if your correspondents are all using the same service. If you use gmail then google knows the entirety of your email network. If you host your own mail then google only knows your links to your correspondents who use gmail, but not your links to those using hotmail, facebook, etc. (unless that info is revealed to google by putting multiple recipients in the to or cc entries in the message.)
thanks, I knew about hushmail, but their usability and search function are not up to par. It's funny - I'd gladly pay for a technogically sophisticated alternative to gmail, just not to have all email data with google.
There's Lavabit.com -- a nice little place with a strong stance supporting end user privacy. Not the most full-featured web interface, but it's a nice service if you like to access with IMAP or POP.
Here are few that I know:
For small networks (up to a million or two million nodes such as Wikipedia Link graph from 2009)
Following libraries provide code to handle and manipulate Network datasets:
1: SNAP by Prof. Jure Leskovec [ http://snap.stanford.edu ] written in C++
2: Networkx by Lanl [ http://networkx.lanl.gov/ ] written in Python, esp. good for fast prototyping
There are few Databases for storing networks, e.g. Neo4J http://neo4j.org/ .
Additionally there is a Graph Processing Language called as Gremlin
http://wiki.github.com/tinkerpop/gremlin/ .
For Large networks with millions and billions of nodes, one can use Hadoop / Map-Reduce or Apache Hama [still in nascent stage]. Google has a special system known as Pregel which it uses to perform scalable computations over large networks.