Hacker News new | past | comments | ask | show | jobs | submit login
I made an app that lets you split a file into horcruxes (github.com/jesseduffield)
243 points by jesseduffield on Aug 2, 2020 | hide | past | favorite | 80 comments



One related scheme is fountain codes, where you can split a file into a pseudo infinite stream of blocks such that finding N blocks will almost certainly allow reconstruction. Very useful in UDP / satellite transmission, where you can keep broadcasting these blocks and clients can listen at their convenience.

The state of the art codec is RaptorQ, I’ve got a Go library that uses the slightly older Raptor standard to do chunking

https://github.com/sudhirj/pump



What happens if you can't get N blocks, but say N-1 or N-2? Can you get a partial reconstruction, or nothing at all?


It's actually probabilistic, though the probabilities very quickly approach 0 and 1 on each side.

But to answer your question, yes, you could recover some of the file if you didn't get as many pieces as you needed. Been a year or so since I did any work with fountain codes, but I believe most implementations send all the chunks, followed by n error correction chunks, so it would depend on how many real chunks you got. The error corrections wouldn't get you anywhere though.


online codes [0] just generate random linear equations of the source chunks and send both generated right hand side and the seed that generated the equation. This way there's no state to keep (now, which equations did I send ?) Most generated equations are of degree 2 (so an equation with 2 ones) some (about 1% iirc) are of degree 1 (so pure data). Having less than the critical mass of equations will give you partial reconstruction. But all bets are off regarding how much you will get. They were used in storage by Amplidata in their Amplistor at some point.

[0] https://en.wikipedia.org/wiki/Online_codes


Fountain codes are seeing use in DNA storage encoding schemes. That's how I use them, at any rate.


I know nothing about this but it sounds super interesting. Care to share more?


I was almost posting a snarky comment about how this is just Shamir's Secret Sharing, which is in no way new.

But, hey, this is really cool. It was probably really fun to write, and luminates a cool scheme that too few know anything about.


Turns out the only other shamir secret sharing app I know of that's actually usable (ssss or somesuch name) only does the deed for keys, while this does it for files.

This is actually pretty useful, but sort of sad it drags the whole go ecosystem with it (who knows what go will look like in 20 years and if this app will still compile and work).


> This is actually pretty useful, but sort of sad it drags the whole go ecosystem with it (who knows what go will look like in 20 years and if this app will still compile and work).

This is something which the CS community needs to solve, imho. I.e. provide an executable language with a formal specification that is guaranteed to be available indefinitely. And to make this more useful, other programming languages should provide back-ends targeting this language.


For better and worse, isn't that what WASM is/has become/will become?


1. Write your will;

2. Split it into N + 1 horcruxes and distribute them to your N children; and a remaining piece to a lawyer;

3. Force them to all come together to decrypt the will for fairness.


Then what happens if one child who knows they won’t get anything, refuses to provide their piece?

BTW K of N cryptography is well known as Shamir’s Secret Sharing.

https://en.wikipedia.org/wiki/Shamir%27s_Secret_Sharing


Then what happens if one child who knows they won’t get anything, refuses to provide their piece?

Don’t give that child a piece in the first place!


Even worse, what if one or more children lose their bits and can’t provide their piece?


Have each child make horcuxes of their horcrux :)


Not all K of N schemes are "secret sharing" (the most well known ones aren't), though I agree that for this will case you probably want a secret sharing algorithm and not say, an erasure code (as you want the property of zero information transfer); however, certainly not all secret sharing algorithms are Shamir's specific secret sharing algorithm.


This is what is used in hashicorp vault


And in TFA


If you can trust the lawyers with a single piece, can't you trust them with the whole thing?


But that's not an overly complicated process with no benefit... In before block chain folks talk about wills


> 1. Write you will;

I thought this was a command!

> 3. Force them to all come together to decrypt the will for fairness.

This seems like a good recipe for ensuring integrity of the will, but it's hard (for me) to see how it ensures, or even promotes, fairness.

(Also, compelling presence seems like a strange side effect: the sort of person who would skip the reading of a will might not be motivated to ensure the fairness or integrity of the will anyway, so requiring their presence to verify it seems counterproductive.)


"fairness" as in nobody can claim somebody else has tempered with the will.

Note: thanks for spotting the typo. I corrected it.


You could just write your will and sign it with a known public key. I don't see any reason why you'd want to keep the will hidden, or require participation to read it


Getting the will is the easy part, no? Wills are public record in the UK at least.

A cautionary tale: Someone in my family died before getting his will sorted out, leaving a tangled mess of accounts and businesses. My mother was one of his executors, and they had to in effect remove themselves from the will because it was written before his children were born and they weren't mentioned (the lesson being if you have children or any one else you care about, get your will sorted out sooner rather than later - in this particularly case he was found dead in the morning without warning)


IANAL and this is anecdotal, but:

I heard of an individual who very specifically left one dollar to one of their children. The logic being, if said child hadn't been mentioned in the will at all, there could be a claim that they'd been inadvertently left out...

I'm super curious how that works in practice.


In France you have a fixed part of your estate you have to share between children.

For instance if you have two of them you can freely dispose of 1/3 of your estate, the rest is shared equally between your children and spouse.

Relevant law (in French) : https://www.service-public.fr/particuliers/vosdroits/F606


This is a truth, but usually you leave a large enough amount that people can't argue you were swayed. I've heard of $2000 to $4000 being an amount for a larger estate.


This was used as a plot line in Better Call Saul. I wasn't sure it was something real but my understanding is that most of the cases used in the show are either real cases or composites.


To be honest, that TV show is heralded as being very accurate about lawyers


To people in the US reading this, I used RocketLawyer to make a will valid in New York that specified that I currently had no children, but directed how my assets would be distributed among my future children.


Interesting. Feels like you could accomplish the same effect with regular encryption though.

Base64 encode the key, pad it with random data that matches the size of splitting into N-1 parts. Then split the encrypted file into N-1 b64 encoded parts. For lowish values of 'N', you could then just decrypt with each "key" until something readable emerges. The key size, algo, etc, could be prepended to each part in plaintext.

Or, if you want a variation where no parts are optional, a piece of the key in every split part, with a sufficiently long key.


You're looking for secret sharing algorithms, which is what this is using behind the hood. The classic (used in OP) is Shamir's secret sharing algorithm [1]. The general gist of it, simplified into the 2-of-N case, is:

1. the encryption key is the y-intercept of a line 2. each shared key ("horcrux" in the OP terminology) is a point along the line 3. more keys are simply more points on the line

Once you have any two keys, you can fully define the line and recover the Y intercept.

This gives you really strong guarantees because each point on its own reveals nothing about the y-intercept; without satisfying the M-of-N threshold, the value could still be anything.

Generalizing this to the M-of-N case simply involves increasing the order of the curve (ie line -> parabola -> ...)

[1] https://en.wikipedia.org/wiki/Shamir%27s_Secret_Sharing


Does your scheme support M-of-N recovery?


No, that's a good point. I was misled by some of the comments that assumed the posted scheme didn't either. However, I'm assuming "all parts present" is one of many desired use cases.


See also Dark Crystal[https://darkcrystal.pw], which uses Shamir's Secret Sharing to break your secret into "shards", allowing you to also set a number of friends that are needed to recreate the secret (less than the total number of shards). Sharing is done over your social network (currently Briar and Secure Scuttlebutt).


There is also the CLI tool ssss that uses shamir secret sharing to split data:

https://linux.die.net/man/1/ssss


Huh, so the predecessor to Horcrux is also an oblique Harry Potter reference, in that it is named in Parceltongue.


Boooo. Hiss.


> Q) This isn't really in line with how horcruxes work in the harry potter universe!

> A) It's pretty close! You can't allow any one horcrux to be used to resurrect the original file (and why would you that would be useless) but you can allow two horcruxes to do it (so only off by one). Checkmate HP fans.

Not buying it, and the fact that this is the first FAQ is evidence that the author doesn't really either. A better fit to Tom Riddle's horcrux would simply be a lossy compression copy of the file. Which would admittedly be pretty useless, maybe unless the copy contains a lossy copy of your soul.

But then Virgin Galactic is also a pretty good name even though they haven't yet left the solar system. That should be his defense: it's just a cool name.


Also I don’t think this program requires the murder of an innocent to create each horcrux.


Depending on your definition of “innocent”, neither does a Horcrux, really.


> A better fit to Tom Riddle's horcrux would simply be a lossy compression copy of the file.

Correct me if I'm wrong, but my memory is that a horcrux is literally a piece of your original, whole soul. So, if you create one horcrux (as seems to have been the standard practice prior to Tom Riddle), you continue living from then on with only half a soul. Voldemort's capacity for evil came partly from having only 1/7 of a soul in his body.

So I actually think the name kind of fits. Where it loses me a bit is that you can recreate the original file without all of the Horcruxes. Simply splitting the bytes into separate files would be closer IMO.


> Voldemort's capacity for evil came partly from having only 1/7 of a soul in his body.

He was plenty evil before he made horcruxes.

Splitting the bytes would be closer only for file types that work (maybe not to the fullest) as 1/n pieces. OP misunderstood and thought that Tom Riddles soul was in full when he reconstructed his body, but it was actually only 1/7, and terribly maimed.


> a lossy copy of your soul

Name of my next poetry anthology, right there.


This would be an absolutely perfect album name for an ambient music band called Digital Horcrux.


What don't you buy here? I'd say "It's pretty close!" is pretty accurate!


It'd be more accurate if the horcrux files were embedded into other files that had to be deleted using an arcane command from the horcrux program to resurrect the original file...while the original file remains in a half existing unreadable state until resurrected.

ETA: For even more accuracy, increase the 'readability' of the file(for text only I guess) with each horcrux deleted. Allow them to be added one by one so the original file slowly 'increases it's power'.


> even though they haven't yet left the solar system

Virgin Galactic hasn't even got to the Karman line - they've topped out at 90km.


Very cool! I built something similar for patent application about 5 years ago for proximity image encryption. Idea was that image was split into X number of encrypted pieces each still being a valid image (disorted or something custom). If you wanted to see the image again, you had to be in close proximity to other parties that have these parts. BLE beacon served as the proximity for the prototype.


A friend of mine has made something like this for a blockchain hackathon once, around 2 years ago. The technics he used where relatively simple. It stats with some Elliptic-curve cryptography math to split up a single main key into multiple keys. Every person would than have a full copy of the encrypted files and when enough people combine there keys on the blockchain they would get the main key to decrypt the files and from then on it would be public that the files have been opened and by who.

This app seams to use Shamir's Secret Sharing, this is something where I am not familiar with, but from how far I understand the Wikipedia article about it. it works roughly the same but it is more general.

I'm interested to see if people will actually use this. If anyone has some additional explanations about the differences between these algorithms then that would be very appreciated.


what is the difference between reed solomon erasure codes and shamir secret sharing?

i.e. if I just split the secret into a number of reed solomon error correcting blocks (where n blocks are sufficient to recover the full data), is that fundamentally different?


Yes, it's very different: if you split some error coded thing across several files, each individual file is going to give you some chunk of the data. It's not secret in any way, it's just that you've only got, like, 1/7th of it.

With Shamir's secret sharing, having anything less than the required number of files is useless, you can't decrypt any of the data unless you reach the required number.


I wouldn't say it's "very" different as long as you make the code non-systematic. In a systematic code the first n blocks are the input data, but there's no reason you need to have it this way.

Even the par2 format will let you do what you want here. Just toss out the original input data and give people 1/k the correct amount of parity blocks. Given a large enough file size [1], there is a vanishing probability that you can recover additional bits of the input with fewer blocks than needed. This also allows you to transmit less bits total (each target gets 1/k as many bits).

Of course, if you don't like this argument, you can always just encrypt the par2 blocks and use Shamir to share the key. This reduces the overhead to a multiple of the (hopefully much-shorter) key. EDIT: Commentary further down the thread indicates that a better idea is just to use an all-or-nothing transform.

[1] We have information theoretical security in the sense than you can't recover n bytes from (k-1)n/k bytes, so what I'm saying is that we need large enough n/k.


hmm, googling around after my Q found this

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80....

where the authors explicitly write: "Shamir's scheme for sharing secrets is closely related to Reed-Solomon coding schemes."

now, my eyes glaze over when it comes to the math, so I'll need to look it through a few times before I understand what they are saying.


You could also look at http://web.eecs.utk.edu/~plank/plank/papers/FAST-2011.pdf

You can combine the reconstruction properties of Reed Solomon (you need k of n pieces) with the All or Nothing Transform. Encrypting data so that you need all of the data to decrypt.

Then you can essentially do Shamir secret sharing witout storage overhead. A 1 MB file split so you need 5 pieces would have 200 KB pieces instead of 1mb pieces. This is at a cost of exchanging information theoretic security for computational security.


Presumably the secret sharing algorithm is more resistant to brute forcing if you have one less than the required pieces. They solve opposite problems


Could you make one equivalent to the other, by R-S encoding a key? And if you have less than the whole key (assuming it's large enough that the missing portion is still bruteforce-resistant), then even a partial reconstruction of the key still gives you none of the data.


Yes you can! Apply an All or Nothing Transform before RS encoding so that you need all of the original data to decrypt. Then you can only decrypt if you have k of n pieces. You save by a factor of k in piece size over SSS.

http://web.eecs.utk.edu/~plank/plank/papers/FAST-2011.pdf


so is this an alternative to multipar / quickpar / par2 from the usenet days?

Looks good. I always try to build redundancy into my offline backups, as if the given backup in my hand is the last backup that hasn't been cooked / melted / flooded etc. ...because one day, it just might! Talking about worst-worst-case scenario, with triple redundancy of online-offsite (can be ransomwared), offline-onsite (can be flooded/burned), and online-onsite (1st line of defense, ie syncthing or a nas)


Par2 is current, still receiving bugfixes, and is used every day as part of my file backup system.


rar files having a recovery record is also really helpful for archiving files. I add a 3% redundancy to any archive I create with it because I've had some byte errors in the past.


How efficient are these splits? If a 100mb file is split in 5, with 3 needed to recombine, we’d expect the pieces to be at least 333mb won’t we?


Each share is (roughly) the same size as the original. In effect, all of them are just encrypted versions of the original.


Ah, this is slightly disappointing.

For some reason, my gut was telling me each piece would be smaller than the original.

I wonder if this (horcruxes of size ~ 1/N) is actually possible.


Without some form of compression, I don't think it is.

In particular, consider your example (5 horcruxes with 3 needed to reconstruct). View the original file as the interval (0, N) and view it as a set covering problem. If each horcrux covers an interval of size N/3, then if any pair overlaps, there is no third horcrux that can complete the covering. This is a contradiction because 5 horcruxes of size N/3 must overlap somewhere.


If you want to split a file into 5 horcruxes _and_ you require all 5 horcruxes must be present to reconstitute the original file, then each horcrux will be one fifth the size of the original. However if you allow a subset of horcruxes to reconstitute the file then each horcrux will be as big as the file :)


It would be possible to just make the encrypted file publicly available and only distribute the key shares to each "horcrux". The shares themselves should be the size of the key that is used for encryption.


Chunks of size 1/M are possible. 1/N would violate information theory.

If your data is compressible, you should do that first.


Would be simpler to do RAID6 encoding and then encrypt the results.


Translation:

RAID 6 uses some linear algebra to allow m of n copies to reconstruct the original data, where typically m = n - 2, but the math works for any number you like. So if you split up the data using the exact same algorithm, and pre- or post-encrypt the blocks, you should get the same thing being attempted here, and only inflate storage by n/m


Dropbox does something similar for cold storage: https://dropbox.tech/infrastructure/how-we-optimized-magic-p...


I love when the universe throws my own ideas back at me but with a slightly better implementation

The backstory on this was me freaking out when I had a newborn coming and I wanted my legacy to be handed to him at the right age if something happened to me.



heh, you seem to have messed up the bracket order for the markdown links. I memorise it as the mnemonic "square bracket": first the square [], then the bracket ().


[I have claimed something](here is my proof)


Those brackets get me every time. Fixed :)


It would be a fun modification to require 6/7 or 5/7 files so that you needed to bring a certain number of pieces, but not every piece. Inspired by RAID 5 algorithm that has enough parity to allow one drive failure in a group of 3 or more.


It looks like this is exactly what the linked software does?


Right you are, it does support thresholds.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: