Hacker News new | past | comments | ask | show | jobs | submit login
Temporal .NET – Deterministic Workflow Authoring in .NET (temporal.io)
152 points by kodablah on May 3, 2023 | hide | past | favorite | 49 comments



For those not familiar with workflows as code, a workflow is a method that is executed in a way that can't fail—each step the program takes is persisted, so that if execution is interrupted (the process crashes or machine loses power), execution will be continued on a new machine, from the same step, with all local/instance variables, threads, and the call stack intact. It also transparently retries network requests that fail.

So it's great for any code that you want to ensure reliably runs, but having methods that can't fail also opens up new possibilities, like you can:

- Write a method that implements a subscription, charging a card and sleeping for 30 days in a loop. The `await Workflow.DelayAsync(TimeSpan.FromDays(30))` is transparently translated into a persisted timer that will continue executing the method when it goes off, and in the meantime doesn't consume resources beyond the timer record in the database.

- Store data in variables instead of a database, because you can trust that the variables will be accurate for the duration of the method execution, and execution spans server restarts!

- Write methods that last indefinitely and model an entity, like a customer, that maintains their loyalty program points in an instance variable. (Workflows can receive RPCs called Signals and Queries for sending data to the method ("User just made a purchase for $30, so please add 300 loyalty points") and getting data out from the method ("What's the user's points total?").

- Write a saga that maintains consistency across services / data stores without manually setting up choreography or orchestration, with a simple try/catch statement. (A workflow method is like automatic orchestration.)


What happens if you have a workflow in progress that you want to change the implementation of? Eg. If I had a workflow that was waiting for 30 days but then decided I wanted that interval to be 7 days instead?


Temporal determines a workflow is non-deterministic if upon code replay the high-level commands don't match up with what happened during the original code execution. In this case, technically it's safe to change the code this way because both still result timer commands. But what the timer was first created with on existing runs is what applies (there are reset approaches though as mentioned in other comment). However if, say, you changed the implementation do start a child workflow or activity before the timer, the commands would mismatch and you'd get a non-determinism error upon replay.

There are multiple strategies to update live workflows, see https://community.temporal.io/t/workflow-versioning-strategi.... Most often for small changes, people would use the `Workflow.Patched` API.


After you deploy the new workflow code, you can reset [1] a workflow's execution state to before the DelayAsync statement was called, and then the workflow will sleep for 7 days from now.

That doesn’t take into account the time it’s already been waiting. For when you want to do something on a schedule and be able to edit the schedule in future, there’s a Schedules feature that allows you to periodically start a workflow, like a cron but more flexible. [2] In this case, the workflow code would be simpler, like just ChargeCustomer() and SendEmailNotification().

[1] https://docs.temporal.io/cli/workflow#reset

[2] https://docs.temporal.io/workflows#schedule


Ah, so that's what the technique is called. I've seen a similar approach used for point of sale systems that basically persisted on every button press, so if one crashed you could bring up exactly the same state on a different one simply by logging in.


It is a replica of Amazon Simple Workflow as far as I can see. Which is not a bad thing, SWF is great and does not get much attention.


That's no coincidence, Temporal is founded by the creators of Amazon Simple Workflow. See https://temporal.io/about.


What's the difference between workflows as code and using an eventbus? Or is the same? With RabbitMQ if a machine dies whilst processing message it will automatically requeue that message such that another consumer can process it.


The major differences between an event bus and a workflow engine is the workflow engine:

- Creates and updates events/messages for you

- Maintains consistency between the events and timers and the state of the processing flow. More info on why this is important for correctness/resilience: https://youtu.be/t524U9CixZ0

The difference between workflow code and using an event bus is that with the former, the above is done automatically for you, and in the latter, it's done manually, which can be a lot of code to write and get right, and is harder to track/visualize what happened in production and debug. It would also take a lot of events to get an equivalent degree of reliability—the message processor would need to do a single step and then write the next event to the bus. So a 10-line workflow-as-code function would translate to 10 different events in the bus route.

Also, the event bus route doesn't have the new possibilities I listed in the parent comment.


Seems like this would depend on some storage guarantees, but I can’t find anything about that


Correct, the workflow's guarantee to always complete executing independent of process/hardware failures is dependent on the database not losing data. You host your workflow code with Temporal's Worker library, which talks to an instance of the Temporal Server [1], which is an open-source set of services (hosted by you or by Temporal Cloud), backed by Cassandra, MySQL, or Postgres. [2] So for instance increasing Cassandra's replication factor increases your resilience to disk failure.

[1] https://github.com/temporalio/temporal

[2] https://docs.temporal.io/clusters#persistence


Workflows, that is generally speaking the "boxes with arrows between them" pseudo-state machine, really a turing machine, are an interesting thing historically in the "enterprise computing" realm.

Sure you can claim this is a fundamentally lower level mechanism, but really it's the same thing.

It is probably important, like all enterprise software quagmires, to consider how they are sold.

Your typical IT manager with low code skills has all his documented high level processes his group manages. Hey look, Visio diagrams with point to point flows of boxes with arrows, and ... maybe ... some interactions of state with databases and/or systems domains.

They know just enough that the actual nuts and bolts is buried in all this ... code. Actual unintelligible code to them, even if they have some inkling of coding.

To them, it's simple flow, and his PHBs above him can understand what he does with the simple flows.

Then some workflow vendor walks in the door, pops up some visual editor, and wows him with basically "you don't need all that code, you put it into the workflow tool and it will just work, and BOOM your coding environment IS your simple visio document".

WOW SIGN ME UP TAKE $$$$$$$! Then comes the pilot flows.

Error handling and retries?

Distributed State and even worse, Transactional update of Distributed state?

Load distribution?

Branching, looping?

Systems integration?

It goes back to the theory of computation. State machines have limitations in processing. The stack machine addresses some of that, but it eventually runs out, requiring ... the turing machine.

As it turns out, almost all enterprise data flows or processes require turing machines. That's why they are coded at some level by turning complete languages.

Superficially at a high level, you start to see a basic state machine model on top of that ... but it is an illusion.

You move the turing machine into the workflow engine (and the workflow engine IS a turing machine ... they all have them: state, looping, branching) and the "simple point to point" flow becomes spaghetti ... tool-locked in spaghetti, with fixed limits on ability to do things.

The current evolution to workflows is the "directed acyclic graph" workflow engine. This has been an improvement, mostly by constraining the actual use of workflow engines to task organizations that they can do, and trying to keep people from going "full Turing" in the workflow engine.

It still can loop ... most do it by recursive calls to subflows ... gets pretty spaghetti. And you still have the fundamental issue that all PHBs will want in the workflow. On error, retry, or have a recovery flow, or that type. Still a huge amount of complexity to properly get the workflow working.

And yet the visual editing workflow tool can have enormous value. Enormous. The Visual nature of the flow, ability to visually diagnose suspended / failed executions. And workflow are everywhere: batch processes, code builds, deployments, automated maintenance, backups / restores, etc.

And I haven't even gotten into the mess of automated rules-engine-based stuff.

The only value structuring low level code along the lines of what "enterprise workflow" has evolved into after decades (useful, but not a holy grail) is if it gives you a fundamentally better way to visualize the execution of the code, which can happen under constrained use of workflow engines.

UML was a massive disaster, another tangential relation to what appears to be being done here. There your "workflows" or code diagrams were code generated to code.

Alas, the final problem of workflow engines is their balkanization. XML standardizations (BPEL) failed miserably for all the usual corporate product standardizations (functioned as lockin for the existing players, lowest common denominator abilities, ugly, XKCD protocol+1).

If only... if only there was a good designed representation scheme and a wide variety of good open source visualization and execution engines. But there aren't.

I think what is discussed here is a step towards a potential solution: it comes from the IDE tooling, something that a workflow always was (in the vein of the now-defunct CASE/Computer Aided Software Engineering days). A standard tooling that coders demand and IDEs provide as a minimum barrier. But IDEs are single machine things, and workflows are distributed entities ... sigh, nevermind that thought.

Ok, maybe we just need a good visualization tool first that is more universal. Don't care about the creation, just something that can "plug in" and represent non-workflow system interactions AS workflows. "Enterprise execution visualization". A REALLY good system for that has never existed IMO, and is universally needed.


Temporal (and similar systems like Cadence, AWS SWF, Azure Durable Functions) allows you more expressiveness and better DX than defining DAGs in a UI or markup file. You can write (almost) arbitrary code, and the library translates the code's actions into workflow steps.

> The Visual nature of the flow, ability to visually diagnose suspended / failed executions.

Temporal has a web UI in which you can see which executions are failing, and see on which step they're failing:

https://temporal.io/blog/temporal-ui-beta


A lot of the things - especially the good visualization you are referring is already possible with Netflix Conductor. It's OSS and free and also has a company called Orkes backing it.

I think they also have C# SDK which I can't vouch for because I haven't used it.


We’ve been using temporal (the Python SDK) for some new projects at Internet Archive. It’s early days but we’re very excited. We run our own infrastructure, and we get more power-loss events or other intermittent issues than most. The durable execution of workflows that temporal promises seems like it was made for us. And once code is “temporalized” we get to eject a bunch of ad-hoc resiliency stuff into the sun, and what’s left is a lot clearer.

There’s a lot to learn with it. I’ve seen ramp-up take a few months per engineer, though we’re also making it a little harder on ourselves by self-hosting, being the early adopters on the Python SDK (Go, Java, and TypeScript are the most mature I think), and dealing with a mix of Python async and multiprocessing (a bunch of CPU bound activities in the mix). The docs are solid, and the team is responsive to community users.


Can I ask what issues or difficulties you faced going the self-hosting route? We’re a small outfit and we’re thinking of self-hosting temporal to save costs, would be interested to hear about your experience.


Our department is still in a world of ansible and VMs, so we can't yet take advantage of some of the work that's gone into making it easy to run in k8s. We're using Postgres for Temporal's persistence because we're already familiar with operating it (and it's pretty cool already you can have some choice in DB). We've hit a bug in the newest version of Temporal server that doesn't play nicely with pgbouncer in transaction pooling mode, but they've responded to our ticket and seem to have a solution. We're running on the previous version for now. Other than that, it's just been the up-front cost of building the ansible playbooks. We haven't pushed it to any kind of load limits yet, or built-out a real high availability deploy of it yet. Day to day operation at our utilization level has been no drama so far.


6 years ago we built a workflow engine similar to Temporal (which unfortunately is not open sourced), with the explicit goal to interweave as much of workflow async-ness into seemingly synchronous code. Engineers didn't need to write bespoke persistence, retry, wakeup, itempotency handling any longer, and we had seen great results, in term of dev productivity, while easily meeting the performance requirements for our workloads. On top of that we could easily trace the execution, scale up and down dynamically, assign a price to individual executions (and therefore customers).

Glad to see that Temporal follows a similar approach and gets you the same benefits. For every coder out there currently using AWS SWF: if your day work involves more than just handholding a handful workflows but building those, take a look at Temporal. You'll never look back.

(To be fair I am still grumpy that they still separate "deciders" and activitities, but I can see the benefits of that.)

If you want to use a GUI to design workflows: equally useful, but probably with a different target audience.


Temporal looks great but it takes a long time to find out what it is. The website has a lot of information but the navigation is not good.


I am currently scouting temporal for a personal project of mine, and after about two weeks reading documentation and trying out the samples, it's almost second nature to me.

The Typescript SDK is amazing. It strikes a nice balance, being straightforward to use without most of the common pitfalls related to non-determinism, thanks to the Temporal SDK using Node's VM module to execute code with patched sources of non-determinism, like `Date` and `math.random`.

The only caveat is running it (Temporal Server). After mulling it over, for this project I sticked with running everything in a single VPS. If for some reason I get more users one day, maybe consider a Kubernetes managed service. Temporal Cloud is also an option, just not for me at the moment (region constraints, plus it's a pet project so no money)

But the local developer experience is actually amazing. Temporal is a joy all around, to be honest.


I didn't really understand the part about 'What Problem Does 'Ref' Solve?'

What would be the downside of writing it such as:

`ExecuteActivityAsync<OneClickBuyWorkflow>(e => e.DoPurchaseAsync, etc...)`?


That's basically what is happening here, except you create `e` ahead of time instead of a do-anything-you-want lambda. But they essentially the same thing, though I think `ExecuteActivityAsync(MyActivities.Ref.DoPurchaseAsync, etc...)` is clearer that you are referencing an activity. And since it's just a delegate that an attribute is on, you can write a helper to do what you have instead.


If you're worried about someone doing something other than calling a method in the lambda it's pretty straightforward to use an Expression<Func<T, TResult>> and validate the contents of the expression. Seems like that would make for a much better dev experience than this static Ref property pattern that you don't see anywhere else in C#.


It's not just that worry, it's the other reasons on top of that detailed in the post. But note that GP's post doesn't _call_ the method, it just references it which is basically what we are doing in the post. But for things like Durable Entities which _do_ call the method, yes we can prevent multi-use of the lambda arg but there are other problems (forcing interface/virtual, can't allow other args, sync vs async, etc).


I'm not really arguing that your pattern _doesn't work_, I just think you've created an unnecessary new pattern to solve an already solved problem. Using expressions with lambdas is pretty standard in C# when you need to reference a method or property without calling it immediately. Entity Framework is an obvious example, mocking libraries like Moq, and at least one that I know of (Hangfire) uses expressions to serialize info about a method so that it can be invoked later, possibly on a different machine or even in a totally different app. All of the things that you mention can be validated in an expression, if you feel the need.


Thanks! Other commenter has also convinced me to investigate expression approach. I was hoping not to force people to create lambdas for each thing they want to invoke from a code readability POV, but it sounds like the ecosystem wants to force that.


The main reason I'm curious is because they write

'We solve this problem by allowing users to create instances of the class/interface without invoking anything on it. For classes this is done via FormatterServices.GetUninitializedObject and for interfaces this is done via Castle Dynamic Proxy. This lets us "reference" (hence the name Ref) methods on the objects without actually instantiating them with side effects. Method calls should never be made on these objects (and most wouldn't work anyways).'

Which sounds like a lot of heavy lifting. It seems something like

    public class WorkflowBuilder<T> where T : class
    {
        // Somehow workflow gets the real instance
        private T Instance;
    
        public async Task<TResult> ExecuteActivityAsync<TResult>(Func<T, Task<TResult>> func) => await func(Instance);
    
        public async Task<TResult> ExecuteActivityAsync<TResult>(Func<Task<TResult>> func) => await func();
    
        public async Task ExecuteActivityAsync(Func<Task> func) => await func();
    
        public async Task ExecuteActivityAsync(Func<T, Task> func) => await func(Instance);
    }
    
    public record Purchase(string ItemID, string UserID);
    
    public class PurchaseActivities
    {
        public static WorkflowBuilder<PurchaseActivities> OneClickBuyWorkflow => new WorkflowBuilder<PurchaseActivities>();
    
        public async Task DoPurchaseAsync(Purchase purchase)
        {
            await OneClickBuyWorkflow.ExecuteActivityAsync(e => e.DoPurchaseAsync(purchase));
        }
    
        public static async Task DoPurchaseAsyncStatic(Purchase purchase)
        {
            await OneClickBuyWorkflow.ExecuteActivityAsync(() => DoPurchaseAsyncStatic(purchase));
        }
    }

Would pretty much achieve the same thing.


> Which sounds like a lot of heavy lifting

How is `e` created for the `e => e.DoPurchaseAsync(purchase)` lambda? You're going to have to do that lifting anyways to create an instance for `e` that isn't really a usable instance. Unless you use source generators which we plan on doing.

I think what you have there is a lot more heavy lifting. Also note that workflows and activities are unrelated to each other. Workflow can invoke any activities. The code you have is a bit confusing because `PurchaseActivities` should be completely unrelated to workflows.


>You're going to have to do that lifting anyways to create an instance for `e` that isn't really a usable instance.

But this should never happen, these is no reason to create an unusable instance. The real instance should be resolved in the same way however the current workflow resolves it.


Activities may actually run on completely different systems than where the workflow calls execute on. All "execute activity" is from a workflow perspective is telling Temporal server to execute an activity with a certain string name and serializable arguments. Everything else is sugar. So if you're gonna use a type caller-side to refer to the name and argument types, you can't instantiate it fully (its constructor may have side effects for when it really runs).

You can jump through a bunch of hoops like requiring interfaces which is what some frameworks do. But in our case, we just decided to make it easy to reference the method without invoking it or its instance.


Lamba/Func/Expressions are doing exactly this in C#, there is no instantiation required. Creating a unusable Ref object is jumping through hoops.

You can parse an expression to serialize it and run it on a different server etc

See e.g. https://github.com/6bee/Remote.Linq


Hrmm, I will investigate using an expression tree for this (now's the time while it is alpha). I was hoping to avoid people having to create lambdas. I hope I don't run into overload ambiguity with the existing `ExecuteActivity` calls where you can just pass an existing method as Func<T, TResult> param. I will investigate this approach, thanks!

Of course the "Ref" pattern is user-choice/suggested-pattern, it's not a requirement in any of our calls that just take simple delegates however you can create those delegates. So I may be able to work it in there.


As another point of reference, Hangfire also uses expressions for a similar use-case and it seems to work quite well.

E.g. https://docs.hangfire.io/en/latest/background-methods/passin...

   BackgroundJob.Enqueue<EmailSender>(x => x.Send(13, "Hello!"));


Yup!!!


If it's possible to use both approaches without cluttering your API, that may be the best solution. I know a lot of more junior devs that have a difficult time wrapping their head around expressions in C#. Excellent library!


After looking at it, I am concerned it is not possible for a clean experience. We have to give guidance and samples and we have to choose a way of referencing methods in those. Having ambiguous approaches is a bit rough. We may have to just move to expressions.


I completely understand. Either way, I still think the library is great and appreciate the work done to create it.


Wonder how does it compare to https://github.com/UiPath/CoreWF


Hrmm, I admit not knowing a lot about Windows Workflow Foundation, but from a quick glance I suspect they both have roots in the same place - deterministic workflow authoring that invokes activities like Amazon SWF and Azure Durable Task Framework. So they are probably quite similar, yet CoreWF seems to prefer YAML or limited code-based DSL whereas Temporal prefers to provide the full language with determinism constraints.


I’m confused on something. What’s the difference between Temporal and something like Dagster/Airflow/Prefect? Is there a difference between ETL and workflows? Or is just branding?


From my limited understanding, Dagster and Airflow are limited DAG-based tools compared to Temporal that lets you write your logic in the programming language directly (loops, conditionals, coroutines, other libraries, etc) so long as you remain deterministic. Prefect I am also not that familiar with so I can't tell if their "flows" have limited logic they can do from Python, but they do not seem to support things like signals/queries at first glance.

One big difference is those three are Python only, and Temporal is Go, Java, JS/TypeScript, Python, PHP, and .NET. Usually another difference between most workflow tooling and Temporal (granted I have not checked with these) is the care they take in handling errors, retry, replay, cancel, etc so that'd be worth checking as well.


ETL tools target that specific use case, versus Temporal workflows are much more general purpose—for any backend code you want to run reliably. I wrote some more about this here: https://community.temporal.io/t/what-are-the-pros-and-cons-o...


Another library that is doing the same thing: https://github.com/Azure/durabletask


That framework was created by the cofounder of Temporal


Temporal has been amazing to work with so far. We've been looking forward to the .NET SDK for a while now so we can maintain parity with our Java and Node ecosystems.


For those people commenting here that don't seem to quite understand the point of a workflow library:

It's for writing code that has steps that involve humans. E.g.: sending a mail notification for someone to approve a document, that kind of thing.

These are inherently slow and asynchronous, because humans operate on timescales of hours or days, not milliseconds.


Traditional workflow systems are often used for specific types of business processes, particularly those that are async like human-in-the-loop.

Temporal, while it uses the "workflow" terminology, is a new type of thing. At a basic level, it's "do you want your backend code to run reliably?" If yes, and you're okay with the latency hit of each step getting persisted for you, then the answer is "use Temporal to write your backend code." It's a new programming model that lets you develop at a higher level of abstraction, where you don't have to be concerned about faults in the hardware or network or downstream services/3rd party APIs being temporarily down. Where you no longer have to code retries, timeouts, use task queues, or use a message bus to communicate between services. And oftentimes don't even need to use a database.


congrats Chad! great to see you ship this. I wonder if anyone familiar with Azure Durable Functions would be willing to give this SDK a side by side comparison!


One must imagine Sisyphus happy.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: