Show HN: Mandala – Automatically save, query and version Python computations

sitkack · 2024-07-11T23:06:35 1720739195

Does it survive restarts? You mention that they are exported as dataframes, can they be reimported? Does this mean we can run mandala on many machines, and merge data frames together to get collective memoization?

Do you support persisting into external stores?

You mention incpy in readme, have you discussed this project with Philip Guo? https://pg.ucsd.edu/

What is the memory and cpu overhead?

How does the framework handle dependencies on external libraries or system-level changes that might affect reproducibility?

How do you rollback state when it has memoized a broken computation? How does one decide which memoizations to invalidate vs keep?

amakelov · 2024-07-11T23:25:36 1720740336

In order,

1. Yes, you can choose to create a persistent storage by passing `db_path` to `Storage()`. The current implementation is just an SQLite file. To run on many machines, you don't really need to be able to re-import from a dataframe (dumping to a dataframe is meant to be an exit point from `mandala` so that you can do downstream analyses in a format more familiar than `ComputationFrame`) - `ComputationFrame`s can be merged via the union (`|`) operator, see here https://amakelov.github.io/mandala/blog/01_cf/#tidy-tools-me... for an example. Storages don't support merging yet, but it's certainly possible!

2. Already answered in 1.

3. Nope, but I'd be happy to (though I feel like `mandala` took memoization in a substantially different direction). Are you in a position to make an introduction?

4. This project is currently not optimized for performance, though I've used it in projects spanning millions of memoized calls. The typical use case is to decorate functions that take a long time to compute, so the overhead of memoization amortizes. A very quick benchmark on my laptop shows ~6ms per call for in-memory storage, ~9ms for a persistent storage, with a simple arithmetic function that otherwise takes ~0 time.

5. Great question - currently, the dependency tracer is restricted to user-chosen functions to avoid tracking function calls an imported library makes. You could use a bit of magic (import-time automatic decoration) to track all functions in a file or a directory (not implemented right now). The reasoning is that, for a typical multi-month ML project, you usually have a single conda environment so you want to ignore library changes. Similarly, system-level (e.g. environment variables) are also not tracked. I think a very useful feature would be to at least record the versions of each imported library, so that storages can be ported between environments with some guarantees (or warnings).

6. - If an `@op` call was memoized, the underlying Python function call succeeded, so in this sense it can't be "broken"; it's however possible that there was a bug. In this case, you can delete the affected calls and all values that depend on them (if you keep these values, you're left with "zombie" values that don't have a proper computational history). The `ComputationFrame` supports declarative deletion - you build a ComputationFrame that captures the calls you want to delete, and call `.delete_calls()` - though there's still no example of this in the tutorial :) Alternatively, you can change the affected function and mark this as a new version. Then you should be able to delete all calls using the previous version (though, not supported at this moment).

- How the cache is invalidated is detailed here: https://github.com/amakelov/mandala?tab=readme-ov-file#how-i...

ledauphin · 2024-07-12T00:44:25 1720745065

We have a framework at <job> to do memoization as well as distributed compute - in fact the memoization was mostly a happy side effect of the need to transfer serialized function arguments to other machines.

Your addition of code/runtime dependencies intrigues me. I will probably take a look at your code to try to understand this better.

I somehow doubt there's enough overlap for us to open source our work and try to merge with yours, but it's really cool to see other people working on similar concepts. I predict we'll see a lot more frameworks like these that lean on mathematical principles like functional purity in the future.

amakelov · 2024-07-12T00:49:11 1720745351

This blog port gives an overview of the core dependency tracking logic: https://amakelov.github.io/blog/deps/

ledauphin · 2024-07-12T00:58:54 1720745934

thank you!

breatheoften · 2024-07-12T14:59:03 1720796343

Is there a way to use this to capture just hashes of the code (transitive) that implements a particular function?

I can't buy into a whole framework in my current context -- but I would really like a way to roll my own content hashing for adhoc caching within an existing system -- where the hash will automatically incorporate the specific implementation logic involved in producing the data i want to cache (so that the hash automatically changes when the implementation does).

eg -- given python function foo -- i want a hash of the code that implements foo (transitive within my project is fine)

amakelov · 2024-07-12T18:27:22 1720808842

Great question! The versioning system does something essentially equivalent to what you describe. It currently works as follows:

- When a call to an `@op` is executed, it keeps a stack of @track-decorated functions that are called (you can add some magic - not implemented currently - via import-time decoration to automatically track all functions in your project; I've opted against this by default to make the system more explicit).

- The "version" of the call is a hash of the collection of hashes of the code of the `@op` itself and the code of the dependencies that were accessed

- Note that this is much more reliable than static analysis, and much more performant/foolproof than using `sys.settrace`; see this blog post for discussion: https://amakelov.github.io/blog/deps/

- When a new call is made, it is first checked against the saved calls. The inputs are hashed, and if the system finds a previous call on these hashes where all the dependencies had the same (or compatible) code with the current codebase, this call is re-used. *Assuming deterministic dependencies*, there can be at most 1 such call, so everything is well-defined. I don't think this is an unrealistic assumption, though you should keep it in mind - it is pretty fundamental to how this system works. Otherwise, disambiguating versions based on the state of the code alone is impossible.

- When a dependency changes, you're alerted which `@op`s' versions are affected, and given a choice to either ignore this change or not (in the latter case, only calls that actually used this dependency will be recomputed).

The versioning system is mostly a separate module (not in a way that it can be imported independently of the rest, but it should be pretty doable). I'd love to hear more about your use case. It may not be too difficult to export just versioning as its own thing - though from what you describe, it should also have some component of memoization? As in, you seem to be interested in using this information to invalidate/keep things in the cache?

amakelov · 2024-07-13T09:57:44 1720864664

Forgot to mention: yes, the dependency tracking is transitive, i.e. if your @op calls a @track-decorated function, which in turn calls another @track-decorated function, then both dependencies will show up, etc.

dangrie158 · 2024-07-12T16:12:29 1720800749

You can directly hash the code using inspect: https://docs.python.org/3/library/inspect.html#inspect.getso...

And do something similar for the arguments (maybe pickling to get a bytes object you can hash without relying on specific types). Using just the hash function could come with footguns for mutable objects

taliesinb · 2024-07-11T23:27:09 1720740429

Cool! Looks pretty professional.

I explored a similar idea once (also implemented in Python, via decorators) to help speed up some neuroscience research that involved a lot of hyperparameter sweeps. It's named after a Borges story about a man cursed to remember everything: https://github.com/taliesinb/funes

Maybe one day we'll have a global version of this, where all non-private computations are cached on a global distributed store somehow via content-based hashing.

amakelov · 2024-07-11T23:35:33 1720740933

Thanks! Indeed, the ultimate (but very ambitious from the point of view of coordination and infrastructure) vision would be to build the "planetary computer" where everyone can contribute and every computation is transparently reproducible. While researching `mandala`, I ran into some ideas along those lines from the folks at Protocol Labs, e.g. https://www.youtube.com/watch?v=PBo1lEZ_Iik, though I'm not aware of the current status.

amakelov · 2024-07-12T14:37:21 1720795041

Oh also totally missed the Borges mention the first time - I'm a big fan of his stories!

singhrac · 2024-07-12T19:50:36 1720813836

This is really nice. I'll take a much closer look when I get time later. I'm very interested in how you think about incremental computation - I find cache invalidation to be incredibly annoying and time-consuming when running experiments.

We wrote our own version of this (I think many or all quant firms do, I know such a thing existed at $prev_job) but we use type annotation inspection to make things efficient (I had ~1-2 days to write it, so had to keep the design simple as well). It's a contract: if you write the type annotation, we can store it. Socially this incentivizes all the complex code to becoming typed. We generally work with timeseries data, which makes some things easier and some things harder, and the output format is largely storable in parquet format (we handle objects, but inefficiently).

One interesting subproblem that is relevant to us is the idea of "computable now", which adds a kind of third variant from the usual None/Some (i.e. is something intrinsically missing, or is it just not knowable yet?). For example, if you call total_airline_flights(20240901), that should (a) return something like an NA, and (b) not cache that value, since we will presumably be able to compute it on 20240902. But if total_airline_flights(20240101) returns NA, we might want to cache that, so we don't pay the cost again.

We sidestep this problem in our own implementation (time constraints) but I think the version at $prevjob had a better solution to this to avoid wasting compute.

(side note: hey Alex! I took 125 under you at Harvard; very neat that you're working on this now)

amakelov · 2024-07-13T22:23:07 1720909387

Thanks Rachit! Great running into you after all these years!

Being aware of types is certainly a must in a more performance-critical implementation; this project is not at this stage though, opting for simplicity and genericity instead. I've found this best for maintenance until the core stabilizes; plus, it's not a major pain point in my ML projects yet.

Regarding incremental computation: the main idea is simple - if you call a function on some inputs, and this function was called on the same (modulo hashing) inputs in the past and used some set of dependencies that currently have the same code as before (or compatible code, if manually marked by the user), the past call's outputs are reused (key question here: will there be at most one such past call? yes, if functions invoke their dependencies deterministically).

You can probably add some tools to automatically delete old versions and everything that depends on them, but this is definitely a decision the user must make (e.g., you might want to be able to time-travel back to look at old results). I'm happy to answer more nuanced questions about the incremental computation implementation in `mandala` if you have any!

chudi · 2024-07-12T14:27:15 1720794435

This is great! Really good work. It reminds me of working in a SmallTalk environment and coding inside the debugger while an exception is being thrown and restarting the computation.

I believe that this path can be supported as it is right now, and the next step would be to store a computation on some server. If an uncaught exception is raised, store all the computation along with the state, transfer it to your local machine, and restore the state of the machine as it was when the exception was thrown. This way, you can debug the state of the program with all the live variables as it was being run.

amakelov · 2024-07-12T14:34:35 1720794875

Thanks! Indeed, despite the fact that the main goal is to track ML experiments, the approach taken in `mandala` has a lot in common with e.g. time-travel debugging (https://en.wikipedia.org/wiki/Time_travel_debugging).

In reality, there are many similarities between experiment tracking, debugging, and high-level computation graphs - they're all different ways of getting a handle on what a program did when it was ran.

pbronez · 2024-07-12T14:52:50 1720795970

How well does Mandala extend beyond the Numpy ecosystem?

I’m experimenting with Python CAD programming using the CadQuery and Build123d libraries. I’d like to speed up iteration time and intelligent caching would help.

These libraries are pretty opinionated, which make it a bit challenging to imagine how to squeeze cache decorators in there. They have a couple different APIs, all of which ultimately use the Open CASCADE (OCCT) kernel via https://github.com/CadQuery/OCP

CadQuery is a Fluent programming design that relies on long Method Chains [0]. It also has an experimental Free Function API [1].

Build123d iterates on CadQuery [2] with the goal of integrating better with the Python programming language. It has two supported APIs. The Builder API uses Python’s context manager (‘with’ blocks) heavily. The secondary Algebraic API is more functional, using arithmetic operators to define geometric operations [3].

The simplest way to integrate Mandala would probably be to use Build123d’s Algebraic API, wrapping subassemblies in functions decorated with @op.

However, it would be even better to proactively cache function/argument pairs provided by the underlying APIs. For example, if I change 50% of the edges passed to a Fillet() call, it would be nice to have it complete in half the time. I guess this would require me to fork the underlying library and integrate Mandala at that level.

[0] https://cadquery.readthedocs.io/en/latest/intro.html

[1] https://cadquery.readthedocs.io/en/latest/free-func.html

[2] https://build123d.readthedocs.io/en/latest/introduction.html...

[3] https://build123d.readthedocs.io/en/latest/key_concepts_alge...

vrtnis · 2024-07-12T00:10:21 1720743021

Very cool! looking forward to trying it out - the graphs reminded me of a toy project I'd done a while back to better understand deterministic and reproducible execution in python as seen in marimo.io notebooks https://github.com/vrtnis/python-notebook-simulator

amakelov · 2024-07-12T00:14:45 1720743285

Ah, yes, the notorious state problem in notebooks. In your project, do you find the dependencies statically or dynamically?

vrtnis · 2024-07-12T01:12:37 1720746757

Statically - basically just parsing the code into an AST and then walking through the tree to collect information about variable usage and definitions.

culebron21 · 2024-07-12T05:55:41 1720763741

Congratulations, good job! The chaos of notebooks needs some tracking indeed.

7 years ago I made a project with 100 calculation dependencies, (in Python & SQL scripts) and the only thing that allowed not to loose track was Makefile + GraphViz.

I wanted to make something similar in Rust -- a static visualized of dependencies between structs. Things turned out way harder than expected.

amakelov · 2024-07-12T14:57:48 1720796268

Thanks! Yes I think a caching solution like this is great for notebooks, because it makes it very cheap to re-run the whole thing (as long as you reasonably organized long-running computations into `@op`s), overcoming the notorious state problem.

Graphviz is indeed a lifesaver; and you can similarly think of `mandala` as a "build system for Python objects" (with all the cool and uncool things that come with that; serializing arbitrary Python objects with strong guarantees is hard https://amakelov.github.io/mandala/tutorials/gotchas/).

I've no experience with rust, but I'd be curious to hear about the difficulties that came up. I'd expect to see some overlap with Python!

culebron21 · 2024-07-12T18:15:00 1720808100

What I can remember immediately:

1. Imports are more complex than in Python, because a module can be just a block in code, not necessarily a separate file/folder. E.g. `pub mod my_module { <code with functions and constants> }` is a module inside another module, so you don't need a folder and `__init__` inside to have inner modules.

Also, `use something` may man an external crate.

`use super::something` means import from upper level of modules tree, but it's not necessarily a folder.

2. I can parse what types my functions require in their signatures, or structs require in their members (but I must have resolved where really those names point at), but there's also type elision -- i.e. you don't need to explicitly write the type of every var, it's deduced from what is assigned, for example `let my_var = some_func(...)` -- will make `my_var` have the return type of some_func. Now I must also keep track of all functions and what they return.

And then, there are generics:

    let my_var: Vec<MyType> = vec![...];

Vec is generic, and in this case it has MyType inside. Well, this may be enough to just register `MyType` on my deps list. But then I may do some more calls:

    let my_other_var = my_var.pop().unwrap().my_method();

Here, `pop()` returns `Option<MyType>`, unwrap returns `MyType`, and then in the end, my_method may return whatever, and I essentially need something like a compiler to figure out what it returns.

This seems big like a little compiler or language server.

nathanwilk7 · 2024-07-12T15:54:55 1720799695

Used something similar to this in the past: https://github.com/bmabey/provenance. Curious to see similarities/differences. Also reminds me of Unison at a conceptual level: https://github.com/unisonweb/unison

amakelov · 2024-07-12T18:44:11 1720809851

Thanks for sharing! This is a great project. It is quite close to the memoization part of `mandala` and I'll add it to the related work in the README. I think the similarities are:

- using `joblib` to hash arbitrary objects (which is a good fit for ML which inlcudes a lot of numpy arrays, which joblib is optimized for)

- how composition of decorated functions is emphasized - I think that's very important

- wrapping outputs of memoized functions in special objects: this encourages composition, and also makes it possible to run pipelines "lazily" by retracing memoized calls without actually loading large objects in memory

- versioning: in a past version of `mandala`, I used the solution you provided (which is now subsumed by the versioning system, but it still quite helpful)

The differences: - w.r.t. memoization, in `mandala` you can represent data structures in a way transparent to the system. E.g., you can have a memoized function return a list of things, and each thing will have an independent storage address and be usable by downstream memoized functions. Most importantly, this is tracked by the provenance system (though I'm not sure - maybe this is also possible in `provenance`?)

- one big finding (for me) while doing this project is that memoization on its own is not sufficient to manage a complex project; you need some declarative way to understand what has been computed. This is what all the `ComputationFrame` stuff is about.

- finally, the versioning system: as you mention in the `provenance` docs, it's almost impossible to figure out what a Python function depends on, but `mandala` bites this bullet in a restricted sense; you can read about it here: https://amakelov.github.io/blog/deps/

Re:Unison - yes definitely; it is mentioned in the related work on github! A major difference is that Unison hashes the AST of functions; `mandala` is not that smart (currently) and hashes the source code.

niviksha · 2024-07-12T16:16:05 1720800965

This is a very innovative take on infra for ML observability. Shreya Shankar and collaborators at Berkeley came up with https://github.com/loglabs/mltrace, which treads the same ground - perhaps you've looked at it already?

amakelov · 2024-07-13T22:33:06 1720909986

Thanks! And thanks for sharing the pointer - I think I've seen `mltrace` at some point in the past. The tool has some similarities, but seems different from `mandala` on a philosophical level - `mltrace` seems more opinionated and domain-specific to ML. `mandala`'s goal is to make persistence logic a more generic part of Python semantics, and there's much more emphasis on composing `@op`s freely in whatever complex ways you want. Python functions are great and everyone knows what they do - let's give them more superpowers!

If this is something you're interested in, I can probably give a more detailed comparison if I find the time.

omneity · 2024-07-11T23:30:02 1720740602

Very cool concept, thanks for sharing this.

Is mandala designed for notebook-style interactive environments only or also when running python scripts more traditionally, in which case could this be integrated in a gitops like environment? (push to git, new run in CI/CD, export log metrics as an artifact with an easy way to compare to previous runs)

amakelov · 2024-07-11T23:45:38 1720741538

Great question - personally, I mostly use it from notebooks, and I think it's a great fit for that. Bundling experiment tracking with incremental computation makes a lot of sense in a notebook (or any other interactive) environment, because it solves the problem of state: if all your computations are end-to-end memoized, re-running the entire notebook is cheap (I routinely do this "retracing" to just get to some results I want to look at).

That being said, nothing prevents you from running this in a script too, and there are benefits of doing this as well. If your script mostly composes `@op` calls (possibly with some light control flow logic as needed), you get resumability "for free" on the level of individual `@op` calls after a crash. However, the workflow you're describing may run into some features that aren't implemented (yet) if your runs write to different storages. `mandala` makes it easy to get a "holistic" picture of a single `Storage` and compare things there. Comparing across storages will be more awkward. But it shouldn't be too hard to write a function that merges storages (and it's a very natural thing to do, as they're basically big tables).

omneity · 2024-07-11T23:57:21 1720742241

Thanks for the in-depth explanation! I’ll keep an eye on it :)

Fine-grained crash recovery does sound like a great application as well.