Durable Execution Is Moving Into Postgres

Microsoft's pg_durable, DBOS, and Temporal are pushing crash-proof, exactly-once workflows into Postgres in 2026. Here's how to choose.

Microsoft open-sourced a PostgreSQL extension this month that runs durable workflows entirely inside the database. No external orchestrator, no sidecar, no separate service to keep alive at 3am. That fact alone tells you where this category is heading.

For most of the last decade, "durable execution" (a function that can crash, get redeployed, or lose its network and still resume exactly where it left off, without repeating work) meant standing up a cluster. Temporal defined the pattern and proved it at real scale. It also set the operational price: a separate control plane, a persistence store, a worker fleet, and a mental model that lived outside your application. In 2026 that price is getting hard to justify for a lot of teams, because the thing they already run can now do the job. That thing is Postgres.

What Microsoft actually shipped

The new extension is called pg_durable, and it's in preview. Built with pgrx, it exposes a SQL DSL for defining function graphs and registers a Postgres background worker to execute them durably. Each step gets checkpointed, so after a crash, a restart, or a failed step, execution resumes from the last durable point instead of starting over.

Underneath it sits two Rust libraries: duroxide, the orchestration runtime that handles deterministic replay, checkpoints, sub-orchestrations, and timers, and duroxide-pg, a Postgres-backed state provider. The design choice worth noticing is that the workflow engine doesn't talk to Postgres. It is Postgres. Your workflow durability becomes your database's durability, full stop.

I'd treat preview as preview. An in-database engine from Microsoft is a loud signal about direction, but "background worker executing replayable graphs inside your primary database" is exactly the kind of thing you prototype on a scratch instance and watch, not the thing you put under production load this quarter. More on that below.

The "just use Postgres" argument got real teeth

pg_durable isn't alone. DBOS embeds durable execution as a library backed by any Postgres instance. Restate, Inngest, and Hatchet are all crowding into the space Temporal opened. And at the minimalist end, Armin Ronacher's Absurd Workflows showed you can get surprisingly far with a single table and SELECT ... FOR UPDATE SKIP LOCKED, no framework at all.

That last one matters more than it looks. Before you adopt any engine, it's worth reading Absurd Workflows, because it shows you the floor: what a durable, resumable workflow actually requires when you strip the branding away. Once you've seen the one-table version, you can judge what each engine is really selling you on top of it.

What changed in 2026 is the weight of evidence behind the database-first camp. DBOS shipped a Go SDK and, on April 7, announced a partnership with Databricks. For teams whose workflows are already Postgres-centric, the case for running a second distributed system just to get durability got noticeably weaker.

DBOS versus Temporal: not a tie, a fork in the road

These two get compared constantly, and the comparison usually goes wrong because people treat them as competitors at the same point on the curve. They aren't.

DBOS is a library. You import it, pass a Postgres connection string, and decorate functions as workflows or steps. State lives in plain Postgres tables. There's no cluster to deploy, because there's no cluster.

Temporal is a cluster. Frontend, History, and Matching services, a persistence store behind them, and a worker fleet you deploy separately. That architecture is the reason it scales the way it does, and it's also the reason it costs what it costs to run.

The scale numbers make the trade-off concrete. DBOS tops out at roughly a few thousand workflow state transitions per second before Postgres contention starts to bite, because the database is the ceiling. Temporal scales with cluster size into tens of thousands of state transitions per second and beyond. If your workflow volume lives comfortably under that DBOS ceiling (and most internal systems do), the cluster buys you headroom you won't use, paid for in operational surface you definitely will.

Exactly-once is two different promises

Here's the distinction that bites teams who skim the marketing, and the one I'd make every engineer on my platform team able to recite before they pick a tool.

DBOS gives you transactional exactly-once when your side effects write to the same Postgres that stores workflow state. The state update and the business write commit together or not at all. One transaction, one outcome. That's a genuinely strong guarantee, and it deletes a whole category of "did we already charge this card?" bugs by construction.

Temporal guarantees exactly-once replay of workflow decisions. Its activities, the steps that touch the outside world, are at-least-once by default. So every external API call still needs an idempotency key, because Temporal can and will re-run an activity it isn't sure completed.

In practice, this is the line where people get burned. I've watched a team adopt a durable execution framework, assume "exactly-once" meant their downstream payment call would fire once, and ship it. It doesn't mean that unless the side effect shares a transaction with the state. The framework was doing exactly what it promised; the team had bought the wrong promise. The lesson is cheap if you learn it from a blog post and expensive if you learn it from a duplicate-charge incident: know whether your engine gives you transactional exactly-once or replay exactly-once, because they ask completely different things of you.

Why agents are forcing the issue

The pressure behind all of this isn't classic backend sagas, though those benefit too. It's agentic AI. An autonomous agent that chains dozens of LLM and tool calls, taking real actions on the web or inside internal systems, cannot afford to lose its place when a pod restarts or an LLM call times out. Lost state in an agent isn't a slow retry. It's a half-finished action with no record of where it stopped.

The DBOS-Databricks partnership was aimed straight at this. DBOS checkpoints agent workflows into Databricks' serverless Postgres (Lakebase) in real time, so an agent resumes automatically after a failure with no data loss and no duplicated actions. The integration claims this comes "with no extra infrastructure or coding changes required." Yutori, which builds autonomous web agents, already runs DBOS on Databricks-managed Postgres to keep always-on agents resilient.

Read past the press release and the real claim is this: for production agents, reliability is becoming the differentiator, not raw model capability. A slightly weaker model that never loses its place beats a smarter one that forgets what it was doing every time infrastructure hiccups. That reframes the build decision. The durability layer is an architecture choice you make before you ship the agent, not a reliability patch you bolt on after the first lost-state postmortem.

So what should you actually do

Strip out the news cycle and you're left with a few decisions you can make on Monday.

Default to Postgres-native durable execution for most services. If your workflows already sit behind a Postgres-centric service and you're a small team, a library like DBOS (or pg_durable once it's past preview) gets you crash-resilient, exactly-once workflows with no new distributed system to run, no new on-call surface, and the option to commit workflow state and business data in the same transaction. That same-transaction property is the single biggest practical win here. Exploit it where you can.

Reach for Temporal when the shape of your problem genuinely demands a cluster: multiple workflow-heavy services, heavy external API fan-out, multi-region or hard multi-tenant isolation, or real tens-of-thousands-per-second throughput. Those are good reasons. "It's the name I recognize" is not.

If you pick a model whose external steps are at-least-once, invest in idempotency keys early, on day one, not after the first duplicate side effect. And if your team is currently maintaining bespoke retry-and-resume logic (status columns, cron sweepers, dead-letter queues, hand-rolled idempotency checks), the 2026 tooling is mature and lightweight enough that migrating to a real durable primitive is likely a net reduction in code and incidents. That's the rare migration that makes your system smaller.

The honest counterpoint: none of this is free, and the database-first approach concentrates risk. When your workflow engine is your primary Postgres, a workflow-volume spike is now a database-load spike on the same instance serving your application queries. Temporal's separate cluster is operational weight, but it's also a blast radius boundary. Decide which failure mode you'd rather own before the extension makes the choice for you.