US Patent #7,650,331: System & method for efficient large-scale data processing

profquail · on Jan 19, 2010

The software patent situation is getting more and more ridiculous by the day. This patent is so general, it describes basically any data-mining software (that may or may not be distributed over multiple computers.)

I think that all patents should have to go through the peer review process, unless there is some kind of extra-ordinary reason that it can't be made public:

http://www.peertopatent.org/

cperciva · on Jan 19, 2010

This patent is so general, it describes basically any data-mining software (that may or may not be distributed over multiple computers.)

Bullshit. Here are the two independent claims:

1. A system for large-scale processing of data, comprising: a plurality of processes executing on a plurality of interconnected processors; the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes; the master process, in response to a request to perform the data processing job, assigning input data blocks of the set of input data to respective ones of the worker processes; each of a first plurality of the worker processes including an application-independent map module for retrieving a respective input data block assigned to the worker process by the master process and applying an application-specific map operation to the respective input data block to produce intermediate data values, wherein at least a subset of the intermediate data values each comprises a key/value pair, and wherein at least two of the first plurality of the worker processes operate simultaneously so as to perform the application-specific map operation in parallel on distinct, respective input data blocks; a partition operator for processing the produced intermediate data values to produce a plurality of intermediate data sets, wherein each respective intermediate data set includes all key/value pairs for a distinct set of respective keys, and wherein at least one of the respective intermediate data sets includes respective ones of the key/value pairs produced by a plurality of the first plurality of the worker processes; and each of a second plurality of the worker processes including an application-independent reduce module for retrieving data, the retrieved data comprising at least a subset of the key/value pairs from a respective intermediate data set of the plurality of intermediate data sets and applying an application-specific reduce operation to the retrieved data to produce final output data corresponding to the distinct set of respective keys in the respective intermediate data set of the plurality of intermediate data sets, and wherein at least two of the second plurality of the worker processes operate simultaneously so as to perform the application-specific reduce operation in parallel on multiple respective subsets of the produced intermediate data values.

9. A method of performing a large-scale data processing job, comprising: executing a plurality of processes on a plurality of interconnected processors, the plurality of processes including a master process for coordinating the large-scale data processing job for processing a set of input data, and worker processes; in the master process, in response to a request to perform the large-scale data processing job, assigning input data blocks of the set of input data to respective ones of the worker processes; in each of a first plurality of the worker processes, executing an application-independent map module to retrieve a respective input data block assigned to the worker process by the master process and to apply an application-specific map operation to the respective input data block to produce intermediate data values, wherein at least a subset of the intermediate data values each comprises a key/value pair, and wherein at least two of the first plurality of the worker processes operate simultaneously so as to perform the application-specific map operation in parallel on distinct, respective input data blocks; using a partition operator to process the produced intermediate data values to produce a plurality of intermediate data sets, wherein each respective intermediate data set includes all key/value pairs for a distinct set of respective keys, and wherein at least one of the respective intermediate data sets includes respective ones of the key/value pairs produced by a plurality of the first plurality of the worker processes; and in each of a second plurality of the worker processes, executing an application-independent reduce module to retrieve data, the retrieved data comprising at least a subset of the key/value pairs from a respective intermediate data set of the plurality of intermediate data sets and applying an application-specific reduce operation to the retrieved data to produce final output data corresponding to the distinct set of respective keys in the respective intermediate data set of the plurality of intermediate data sets, and wherein at least two of the second plurality of the worker processes operate simultaneously so as to perform the application-specific reduce operation in parallel on multiple respective subsets of the produced intermediate data values.

This very clearly only covers computations distributed over multiple systems (a plurality of processes executing on a plurality of interconnected processors) and is narrow enough that there are lots of parallel systems which don't fall under those claims.

tumult · on Jan 19, 2010

It looks like doing a foldr with key pairs and then a foldl violates that patent. In parallel Haskell, it can run on multiple processors, with the assistance of the RTS ('master process').

It's bullshit.

ajross · on Jan 19, 2010

No offense, but shouldn't the standard be a little higher than "there are lots of parallel systems which don't fall under those claims"? It looks quite shockingly broad to me.

cperciva · on Jan 20, 2010

The standard for patenting is higher than that. I was responding to the claim that the patent "describes basically any data-mining software", which is simply untrue.

profquail · on Jan 19, 2010

"a plurality of processes executing on a plurality of interconnected processors" could describe both interconnected machines, and multicore processors. For example, does this patent cover doing a data reduction on a GPU? It sounds like it would -- for example, the Thrust library:

http://code.google.com/p/thrust/

does basically those two "claims", just on a GPU (which falls under your definition).

In any case, my complaint is more about the wording than the actual technology that the patent covers. Patents are granted for such generic technology these days, that it's going to scare away people from trying to create new things (lest it already be covered under a generic patent that really doesn't relate to the new product.)

tghw · on Jan 19, 2010

They are patenting MapReduce. Seems pretty specific to me.

loup-vaillant · on Jan 19, 2010

The Idea behind Map Reduce itself is extremely general (and not brand new):

    map  :: (a -> b) -> [a] -> b
    fold :: (a -> a -> a) -> a -> [a] -> a  -- the first argument must be associative if you want a parallel fold.
    
    mapReduce :: (a -> b) -> (b -> b -> b) -> b -> [a] -> b
    mapReduce f g e l = fold g e (map f l)  -- conceptually. You have to write an actually parallel version.

If the patent isn't more specific than this (by being limited to MapReduce, for instance), I bet it can be overruled by prior art. Now, I admit I don't have the courage to check.

neilc · on Jan 19, 2010

A cursory examination of the patent would tell you that it says very little about map and reduce functions as such -- the emphasis is on the system for executing such functions in parallel for large-scale data transformation.

ramchip · on Jan 20, 2010

the emphasis is on the system for executing such functions in parallel for large-scale data transformation

As someone mentioned: the Glasgow Haskell Compiler and its RTS is such a system...

mark_l_watson · on Jan 19, 2010

Google could make themselves look better right now by quickly granting a free perpetual nonexclusive patent license to the Apache Hadoop project.

cabalamat · on Jan 19, 2010

I think patent attorneys should be required to write patents in understandable English, on pain of receiving a plurality of punches in the face.

cperciva · on Jan 20, 2010

on pain of receiving a plurality of punches in the face

I think you mean Claim #1: A painful punishment; Claim #2: Claim #1 wherein the punishment includes physical force; Claim #3: Claim #2 wherein the physical force comprises multiple punches; Claim #4: Claim #3 wherein the punches are directed at the face of the patent attorney.

astrodust · on Jan 20, 2010

I'm pretty sure you could patent that!

jordyhoyt · on Jan 20, 2010

They are very aware of how they are speaking, and in person, they are the most precise people you will ever meet. I recently had occasion to speak with a patent attorney, and not only did he explain the patent in plain English, but I was floored at how unambiguous he was. We spoke over the phone and he was able to guide several people through the details of the patent quickly and clearly.

I do agree though, the way these are written is almost completely unreadable. Almost like how we use English words in programming languages, but without the domain knowledge, it is meaningless.

cabalamat · on Jan 20, 2010

Ptents in theory are meant to blanace a public bad (a monopoly) with a public good (teaching practitioners how to do something). I am a programmer, and I do not understand software patents. I'm sure I'm not the oonly one. If a typical practitioner of an art, when seeing a patent, says its hard to understand or confusingly worded, the patent should be void.

10ren · on Jan 20, 2010

I think there's a good argument for that approach being applied to the description of the patent. That's the part that a person skilled in the art should be able to use to make the invention.

However, it's not so applicable to the claims of the patent. This part is for lawyers to use to determine the exact extent of the legal protection conferred by the patent. The description is of just one embodiment of the invention; it is natural that slight variations from that embodiment should also be covered - but exactly how much variation is covered? How general (or how abstract) is the coverage? Especially when you consider that the given embodiment isn't necessarily the "center" of the inventions - it is not the embodiment, just an embodiment. This extent is difficult to specify, and special language is needed.

To use jordyhoyt's analogy, it would be like expecting Erlang to be readable by a layman. Of course, we can do better. The point is that it's hard to serve many masters.

cabalamat · on Jan 20, 2010

> To use jordyhoyt's analogy, it would be like expecting Erlang to be readable by a layman.

If lawyerese for a formally-defined language with a compiler and everything, I'd have less problems with it. But from whetre I'm standing, it just looks like obfuscated English.

ZachPruckowski · on Jan 20, 2010

I don't think the problem is that they are unreadable, it's that while your patent attorney friend can provide a clear explanation of the patent on narrow grounds and get it approved, some other attorney can come along a year later and file a suit based on a much looser interpretation. So a patent is dual-natured: it's strictly interpreted at approval, and loosely interpreted while enforced.

ajb · on Jan 20, 2010

Unfortunately, patent attorneys are required to write patents in understandable English. This is what 'understandable English' has turned into after decades of legal precedent on what that means.

For example, you will see numbers written out longhand in patents - eg, 'two hundred and fifty six', because that is supposed to be more understandable than 256.

notaddicted · on Jan 19, 2010

This book has 2 chapters on patents: http://oreilly.com/catalog/9780596517960

I read the first half of the book and found it enlightening.

mikeliu · on Jan 20, 2010

Patents should just be made with a set of interpretable expressions, that way no one will have to read it except a computer.

siculars · on Jan 19, 2010

Don't be Evil. Indeed. Well, you know, someone needs to keep IBM from a C&D against Hadoop.

lmkg · on Jan 19, 2010

While I'm normally the first to pile on and complain about Google, in this case I'm going to hold off unless/until they try to enforce their patent. The big companies like Google and Microsoft need to preemptively patent everything they can, in order to stave off the real patent trolls.

Of course, the necessity for defensive patenting itself just goes to show how silly the current patent system is.

houseabsolute · on Jan 19, 2010

Well, it's not a very good defense against patent trolls, because those guys typically don't exercise their own patents or anyone else's. However, it does help somewhat against companies like Apple or Nokia who are fine with litigating against anyone who tries to compete on an even footing. For example, if Yahoo or Microsoft sued Google for some search related patent, probably Google could use this patent against them.

siculars · on Jan 19, 2010

I agree. The defensive patent is becoming perhaps more important than the offensive patent, re trolls. Being able to use a patent in one area to not only defend against competitors locking you out of your own technology but also counter a potential patent suit in another technology is becoming all too common, re apple v nokia.

coderdude · on Jan 19, 2010

Has Hadoop been out since 2004?

abscondment · on Jan 19, 2010

No; it seems to have been inspired by Google papers as a solution to Nutch's scaling issues. See http://research.yahoo.com/files/cutting.pdf [pdf]

:(

10ren · on Jan 20, 2010

You can't patent something that has been publicly disclosed, including disclosure by yourself. Commercial use counts as disclosure.

This was filed June 18, 2004, so given the one year grace allowed under US law, if Google made commercial use of this system before June 18, 2003, then they cannot patent it.

I'm pretty sure Google used distributed map-reduce from the beginning... so I'm thinking that this patent must cover a refined method that differs in some way from their initial approach.

It's a lot of work to read and understand a patent (I spent a couple of days on one this week), so if anyone else wants to do so and let us know what the patented algorithm actually is...

Todd · on Jan 20, 2010

You may apply for a US patent within the first year after the invention was made publicly known (e.g., published). If you intend to file foreign patents, the invention must not be made publicly known before filing abroad. You have one year from the date of filing your US patent to file your foreign application(s). So public disclosure prior to filing will only stop foreign applications.

WalterGR · on Jan 20, 2010

> You can't patent something that has been publicly disclosed, including disclosure by yourself. Commercial use counts as disclosure.

Incorrect. IANAL, so I will not attempt to correct you. I suspect YANAL either.

cromulent · on Jan 20, 2010

It seems to be true in the US.

"If the invention has been described in a printed publication anywhere, or has been in public use or on sale in this country more than one year before the date on which an application for patent is filed in this country, a patent cannot be obtained."

http://www.uspto.gov/web/offices/pac/doc/general/index.html#...

tybris · on Jan 20, 2010

The original MapReduce paper was published in December 2004. 6 months after the patent was filed.

WalterGR · on Jan 20, 2010

Year-to-file is obviously different than cannot-file.

cromulent · on Jan 20, 2010

It's in the same section.

"if it was known or used by others in this country before the date that the applicant made his/her invention, a patent cannot be obtained".

Mind you, they don't seem to apply this very consistently.

WalterGR · on Jan 20, 2010

making an invention != disclosure != filing != obtaining a patent.

I was originally pointing out a fact lest others be led astray, but I'm not going to reply further.

cromulent · on Jan 20, 2010

I don't think anyone would doubt your inequation.

However, 10ren's statement seems to be backed up pretty well by the USPTO site - if you have already gone public or sold your invention, no patent for you. I'm not sure why you think it is incorrect.

Cheers

WalterGR · on Jan 20, 2010

Rereading the thread, I think I misunderstood 10ren's point. I think we are all actually in vehement agreement. :) Apologies.

alttab · on Jan 19, 2010

Even if they wanted to enforce it they'd have to prove it.

Otherwise patenting map reduce - which is practically one of the foundations of functional data processing - is down right silly

neilc · on Jan 19, 2010

The relevant prior art is not the idea of a "map" or "reduce" function as such (that isn't what is significant about MR, anyway).

The patent is about parallelizing M and R operators for large-scale data analysis; I think that the relevant prior art would include 1980s work on parallel database systems (e.g. Gamma, Bubba).

bitdiddle · on Jan 19, 2010

I agree completely. I would think there is enormous prior art in this area that makes this patent absurd.

sjunkin · on Jan 20, 2010

I wonder what AbInitio, Informatica, et al think about this patent, is the HN community aware of their claims to similar IP?

fexl · on Jan 19, 2010

μολον λαβε

jcl · on Jan 19, 2010

For the curious: http://en.wikipedia.org/wiki/Molon_labe

fexl · on Jan 19, 2010

Ah, I should have used omega there instead of omicron. I'm sure I got dinged for the defiant sentiment though.

zandorg · on Jan 19, 2010

Is that APL?

profquail · on Jan 19, 2010

Probably.