Skip to content

Provisioning Flow

This document walks through how host-system tenants and users come to exist inside shiftagent — and how they stop existing when the host removes them. It covers the one-time bootstrap, the per-tenant cold path, the per-request warm path, the concurrency model that makes all of it safe without any client-side coordination, and the recovery semantics that heal partial failures.

The design serves one governing constraint: the adapter stores nothing. Every mapping between host identities and shiftagent resources lives in the shiftagent database, keyed by external_id. Provisioning is therefore just-in-time and convergent — any request can trigger it, any step can be replayed, and two concurrent replays always converge on the same state.

Normative source: the OpenAPI specification at openapi/openapi.yaml defines every request and response shape referenced here. This document names operations by operationId and never re-specifies schemas. Adapter-side duties (caching policy, sweep cadence, failure handling) are specified in the Adapter Design Spec; this document covers the API-side flow they drive.

LayerWho runs itWhenWhat it creates
One-time bootstrap (§2)Operator (or adapter, at install time)Once per deploymentCredential registry entry + repository registry entry
Cold path (§4)Adapter, automaticallyFirst request ever from a given host tenantTenant, repository attachment, role, user
Warm path (§5)Adapter, automaticallyEvery subsequent requestNothing — refreshes enrichment fields, then proceeds to business calls

The cold and warm paths are the same code path (§6). The adapter never asks “have I seen this tenant before?” — it has no memory to ask. It calls the upsert primitive and branches on the status code the API returns.

2. One-time bootstrap: credential and repository registry

Section titled “2. One-time bootstrap: credential and repository registry”

Repositories are top-level, pre-authenticated registry entries — they are registered once, with their git credential, and then assigned to tenants down the cascade. The cold path never creates repositories; it only attaches ones that already exist. That separation is deliberate: credential material is handled once, at install time, by an operator — never on the hot provisioning path.

Bootstrap is two calls:

StepOperationMethod / pathNotes
1createCredentialPOST /credentialsRegisters the git access token. The secret field is write-only — vaulted on arrival, never echoed by any response. Returns a crd_ handle.
2registerRepositoryPOST /repositoriesRegisters the repository referencing the credential by credential_id. Kicks off an asynchronous skill sync.
POST /credentials
Authorization: Bearer sk_int_...
Idempotency-Key: bootstrap-credential-git-main
{ "name": "git-main-token", "type": "git_pat", "secret": "<token>" }
POST /repositories
Authorization: Bearer sk_int_...
Idempotency-Key: bootstrap-repository-field-ops
{
"name": "field-ops",
"repo_url": "https://git.example.com/agent-skills/field-ops.git",
"branch": "main",
"provider": "generic",
"credential_id": "crd_01hzx8gitmain"
}

Registration returns 201 with sync.state: "pending" or "syncing". The sync scans the repository for skills (pending → syncing → ready, or error with sync.error set). Poll getRepository (GET /repositories/{repository_id}) until ready, then verify the catalog with listRepositorySkills (GET /repositories/{repository_id}/skills). Later re-scans go through syncRepository (POST /repositories/{repository_id}/sync), or pass ?refresh=true to listRepositorySkills for a synchronous fresh read.

Replay safety. Both bootstrap POSTs accept an Idempotency-Key header (§8), and both name fields are registry-unique: a crashed bootstrap that retries without its idempotency key gets a deterministic 409 name-conflict carrying conflicting_resource_id — fetch that resource and continue. A bootstrap script is therefore safe to re-run from the top at any time.

The adapter discovers the registry entry it should attach by configuration (DEFAULT_REPOSITORY_NAME), resolved once per process via listRepositories (GET /repositories?name=...) — see the Adapter Design Spec §4.2.

3. The provisioning primitive: PUT …/by-external-id/{external_id}

Section titled “3. The provisioning primitive: PUT …/by-external-id/{external_id}”

Two operations share identical semantics and together form the entire tenant/user provisioning surface:

  • upsertTenantByExternalIdPUT /tenants/by-external-id/{external_id}
  • upsertUserByExternalIdPUT /tenants/{tenant_id}/users/by-external-id/{external_id}

Both are idempotent get-or-create-or-refresh merge-upserts:

PropertySemantics
201The resource did not exist and was created. This is the cold-path trigger.
200The resource existed; provided fields were merged. This is the warm path.
Merge rulesPer field: provided → replaced; omitted → unchanged; explicit null → cleared (nullable fields only).
Empty body{} is valid — external-ID-only creation always succeeds (§9).
RacesConcurrent upserts of the same external ID collapse on a database uniqueness constraint: the winner gets 201, the loser gets 200 with the winner’s record. No 409 is possible on this path (§7.1).
No reactivationUpserting an existing suspended tenant or user returns 200 with status: "suspended" and does not reactivate it. Reactivation is an explicit updateTenant / updateUser call — never a side effect of provisioning (§10.3).
No Idempotency-Key neededPUT retries are inherently safe; the header exists for POSTs only.

External IDs are opaque to shiftagent (byte-exact comparison after trim, max 255 chars, case-sensitive) and MUST be namespaced by the adapter at derivation — {ns}:tenant:{id} and {ns}:user:{id} (e.g. acme:tenant:128231, acme:user:9f27c1) — so multiple host systems or environments never collide. Canonicalization (e.g. lowercasing GUIDs) is the adapter’s job; shiftagent never normalizes.

4. Cold path — first request from a host tenant

Section titled “4. Cold path — first request from a host tenant”

Triggered when upsertTenantByExternalId returns 201. Four steps, in fixed order, so that at any failure point the visible state is a strict prefix (never a user without a tenant, never a role assignment without a role):

StepOperationMethod / pathOutcome
1upsertTenantByExternalIdPUT /tenants/by-external-id/{external_id}201 — tenant created (body may be {})
2attachTenantRepositoryPUT /tenants/{tenant_id}/repositories/{repository_id}201 — registry repository attached with { "is_default": true }, which also sets tenant.default_repository_id
3createRolePOST /tenants/{tenant_id}/roles201 — role created with skill_access picked from listRepositorySkills (or { "mode": "all" })
4upsertUserByExternalIdPUT /tenants/{tenant_id}/users/by-external-id/{external_id}201 — user created with role_ids referencing step 3’s role; storage bucket auto-attached

Concretely:

PUT /tenants/by-external-id/acme:tenant:128231
{ "name": "Acme Field Services" }
201 { "object": "tenant", "id": "tnt_01hzx8acme001", "default_repository_id": null, ... }
PUT /tenants/tnt_01hzx8acme001/repositories/rep_01hzx8fieldops
{ "is_default": true }
201 { "object": "repository_attachment", "is_default": true, ... }
POST /tenants/tnt_01hzx8acme001/roles
Idempotency-Key: prov-acme:tenant:128231-role-csr
{
"name": "csr",
"description": "Customer service representative",
"skill_access": { "mode": "selected", "skill_ids": ["skl_01hzx8dispatch", "skl_01hzx8invoice"] }
}
201 { "object": "role", "id": "rol_01hzx8csr001", ... }
PUT /tenants/tnt_01hzx8acme001/users/by-external-id/acme:user:9f27c1
{ "email": "jane.doe@acme.example.com", "display_name": "Jane Doe", "role_ids": ["rol_01hzx8csr001"] }
201 { "object": "user", "id": "usr_01hzx8jane001",
"storage": { "provider": "platform", "bucket_uri": "s3://..." }, ... }

From here the adapter proceeds to tokenExchange (POST /auth/token-exchange) for a per-user platform JWT and on to conversation operations — the cold path costs one extra round of calls on exactly one request per host tenant, and never again.

Role assignment alternatives. Passing role_ids on the creation PUT (shown above) and assigning via assignUserRole (PUT /users/{user_id}/roles/{role_id}) after a 201 are both correct on the cold path. The Adapter Design Spec (§4.3) recommends the dedicated assignUserRole route as standing discipline, because it removes any possibility of the warm path clobbering role assignments — see the footgun below.

sequenceDiagram
    autonumber
    participant H as Host system
    participant A as Adapter (stateless)
    participant S as shiftagent Integration API

    H->>A: Request (host JWT)
    A->>A: deriveIdentity() → external tenant + user IDs
    A->>S: PUT /tenants/by-external-id/{eid} (upsertTenantByExternalId)
    alt 201 — tenant created (COLD PATH)
        S-->>A: 201 Tenant (default_repository_id: null)
        A->>S: PUT /tenants/{tid}/repositories/{rid} {is_default: true} (attachTenantRepository)
        S-->>A: 201 RepositoryAttachment
        A->>S: POST /tenants/{tid}/roles (createRole)
        S-->>A: 201 Role
        A->>S: PUT /tenants/{tid}/users/by-external-id/{eid} {email, display_name, role_ids} (upsertUserByExternalId)
        S-->>A: 201 User (storage auto-attached)
    else 200 — tenant existed (WARM PATH)
        S-->>A: 200 Tenant (provided fields merged)
        A->>S: PUT /tenants/{tid}/users/by-external-id/{eid} {email, display_name} — role_ids OMITTED (upsertUserByExternalId)
        S-->>A: 200 User (role assignments untouched)
    end
    A->>S: POST /auth/token-exchange (tokenExchange)
    S-->>A: 200 short-lived platform JWT
    A->>S: Conversation operations (listConversations / createConversation / createMessage)
    S-->>A: Responses / NDJSON stream
    A-->>H: Response

There is no separate warm-path implementation. The adapter runs the identical code; the API answers 200 instead of 201, and the tenant-bootstrap steps (attach repository, create role) simply don’t run because the 201 trigger never fired:

PUT /tenants/by-external-id/acme:tenant:128231
{ "name": "Acme Field Services" }
200 (tenant existed; name refreshed if it drifted)
PUT /tenants/tnt_01hzx8acme001/users/by-external-id/acme:user:9f27c1
{ "email": "jane.doe@acme.example.com", "display_name": "Jane Doe" }
200 (user existed; enrichment fields refreshed; role_ids untouched)

The two warm PUTs double as enrichment refresh: merge-upsert semantics let the latest name / email / display_name from the host flow into shiftagent on every provisioning pass without disturbing anything else. A new user appearing in an existing tenant is handled by the same path automatically — the tenant PUT returns 200 (skip bootstrap), the user PUT returns 201 (create the user, assign the default role).

⚠️ The role_ids footgun — read this twice.

upsertUserByExternalId uses merge semantics: omitted fields are left unchanged, provided fields are replaced — as a whole. role_ids is not additive; sending it replaces the user’s entire role set.

An adapter that naively re-sends the cold-path body (role_ids: ["rol_…"]) on every warm-path request will silently wipe every role an operator granted since — on every single request, forever, with 200 responses all the way. Nothing errors. The user just quietly loses access.

The rule: on the warm path, send only the fields you own — enrichment attributes (email, display_name). Never send role_ids, default_repository_id, storage, or metadata you did not set. Role membership changes go through the dedicated idempotent endpoints assignUserRole (PUT /users/{user_id}/roles/{role_id}) and unassignUserRole (DELETE same path), which touch exactly one assignment each.

The same rule applies to upsertTenantByExternalId: a warm-path body containing default_repository_id or metadata will overwrite operator changes just as silently.

6. One code path: the convergent provisioning algorithm

Section titled “6. One code path: the convergent provisioning algorithm”

Putting §4 and §5 together, the adapter’s entire provisioning logic is:

ensure_provisioned(external_tenant_id, external_user_id, enrichment):
tenant = PUT /tenants/by-external-id/{external_tenant_id} # body: enrichment we own
if tenant.status_code == 201: # cold path
bootstrap_tenant(tenant) # attach repo, ensure role
user = PUT /tenants/{tenant.id}/users/by-external-id/{external_user_id}
# body: email, display_name ONLY
if user.status_code == 201:
PUT /users/{user.id}/roles/{default_role.id} # idempotent assign
bootstrap_tenant(tenant):
PUT /tenants/{tenant.id}/repositories/{repo.id} {is_default: true} # idempotent
POST /tenants/{tenant.id}/roles {name: DEFAULT_ROLE_NAME, ...}
on 409 name-conflict: GET /roles/{conflicting_resource_id} # adopt, continue

Every step is idempotent or conflict-recoverable, and individually retryable. There is no transaction and no rollback; a partially provisioned tenant is not an error state, just a state the next pass converges from.

Re-entrancy triggers. Besides the 201-cold trigger, treat downstream signals of an incomplete bootstrap — 422 role-required on createConversation, a listRoles ?name= miss for the default role, or a user upsert response with empty role_ids — as a cue to re-run bootstrap_tenant from the top. Because every step is idempotent (PUT attach, 409-recoverable role create, PUT role assignment), replaying the chain is always safe. This is what heals half-completed cold paths: the crashed-winner scenario in §7.3 resolves without any adapter remembering anything.

Multiple adapter replicas (or multiple in-flight requests on one replica) can race to provision the same tenant or user. The design resolves every race server-side — there is no client-side coordination, locking, or leader election anywhere in this flow.

7.1 Upsert races: the database is the lock

Section titled “7.1 Upsert races: the database is the lock”

PUT …/by-external-id/{external_id} is backed by a uniqueness constraint (tenants.external_id globally; users(tenant_id, external_id) per tenant). Concurrent upserts of the same external ID collapse on that constraint:

  • exactly one caller creates the resource and receives 201;
  • every other caller receives 200 with the winner’s record — indistinguishable from an ordinary warm path;
  • neither errors. There is no 409 on this path, by design: a get-or-create client with no state has nothing useful to do with a conflict error, so the API never raises one.

The status code split is not just informational — it is the election mechanism: the 201 winner is thereby elected to run the tenant bootstrap, and every 200 loser skips it. One create, one bootstrap, zero coordination.

7.2 Named sub-resource races: 409 + fetch-and-continue

Section titled “7.2 Named sub-resource races: 409 + fetch-and-continue”

POST-created resources with unique names — roles (createRole, unique per tenant), repositories (registerRepository), skills (createRepositorySkill), credentials (createCredential, all registry-unique) — race differently: the loser gets a deterministic 409 name-conflict whose problem body carries conflicting_resource_id, the ID of the resource that won.

Recovery is mechanical: fetch the conflicting resource by ID and continue as if you had created it.

POST /tenants/tnt_01hzx8acme001/roles
{ "name": "csr", ... }
409 application/problem+json
{ "type": ".../problems/name-conflict", "status": 409,
"conflicting_resource_id": "rol_01hzx8csr001", ... }
GET /roles/rol_01hzx8csr001 # adopt the winner's role
200 { "object": "role", "id": "rol_01hzx8csr001", ... }

If the conflicting ID is ever lost (e.g. recovering out-of-band, after a crash between the 409 and the fetch), listRoles (GET /tenants/{tenant_id}/roles?name=csr) and listRepositories (GET /repositories?name=field-ops) provide exact-match name filters for the same lookup.

This is why role names being unique per tenant is load-bearing: uniqueness is what makes the 409 deterministic and the recovery unambiguous. The conflict is not a failure — it is the API telling a replayed or racing cold path “this step is already done, here’s the result.”

7.3 The interleaving edge: loser outruns winner

Section titled “7.3 The interleaving edge: loser outruns winner”

The one genuinely subtle race: replica A wins the tenant upsert (201) and starts the bootstrap; replica B loses (200), skips the bootstrap, and races ahead to its user upsert before A has created the role. B’s user is created with role_ids: [] (the spec explicitly tolerates this), and B’s createConversation then fails 422 role-required — which is exactly the §6 re-entrancy trigger. B re-runs the bootstrap chain itself, collides with A’s (by now created) role via 409 name-conflict, adopts it, assigns it, and proceeds. The same recovery handles the worse variant where A crashed mid-bootstrap and nobody finished it.

sequenceDiagram
    autonumber
    participant A as Adapter replica A
    participant B as Adapter replica B
    participant S as shiftagent Integration API

    par Both replicas provision the same host tenant
        A->>S: PUT /tenants/by-external-id/acme:tenant:128231
    and
        B->>S: PUT /tenants/by-external-id/acme:tenant:128231
    end
    Note over S: Unique index on external_id —<br/>the database is the lock
    S-->>A: 201 Tenant (winner — elected to bootstrap)
    S-->>B: 200 Tenant (loser — same record, skips bootstrap)

    A->>S: PUT .../repositories/{rid} {is_default: true} (attachTenantRepository)
    S-->>A: 201 RepositoryAttachment

    B->>S: PUT .../users/by-external-id/acme:user:9f27c1 (upsertUserByExternalId)
    S-->>B: 201 User (role_ids: [] — role does not exist yet, tolerated)
    B->>S: POST /conversations (createConversation)
    S-->>B: 422 role-required (bootstrap incomplete)

    Note over B: Re-entrancy trigger —<br/>re-run the bootstrap chain (all steps replay-safe)
    B->>S: PUT .../repositories/{rid} {is_default: true}
    S-->>B: 200 (already attached — idempotent no-op)

    A->>S: POST .../roles {name: "csr", ...} (createRole)
    S-->>A: 201 Role rol_01hzx8csr001
    B->>S: POST .../roles {name: "csr", ...} (createRole)
    S-->>B: 409 name-conflict {conflicting_resource_id: rol_01hzx8csr001}
    B->>S: GET /roles/rol_01hzx8csr001 (getRole)
    S-->>B: 200 Role — adopted, continue as if created

    B->>S: PUT /users/{uid}/roles/rol_01hzx8csr001 (assignUserRole)
    S-->>B: 204 (idempotent)
    B->>S: POST /conversations (createConversation)
    S-->>B: 201 Conversation — converged

Every arrow in the recovery branch is idempotent or conflict-recoverable, so it does not matter how many replicas run it, in what order, or how many times.

Mock-server note: this exact interleaving is scripted as a validation scenario (examples/05-race-double-provision), asserting exactly one tenant creation and full convergence for both racers. Because the mock runs on a single-threaded runtime, the scenario exercises async interleaving rather than true parallelism — the semantics under test are identical.

PUT upserts and DELETEs are idempotent by construction. Every POST operation additionally accepts an optional Idempotency-Key header (any unique string, max 255 chars) for network-level retry safety:

  • Responses are cached 24 hours per (key principal, operation, idempotency key).
  • A replay returns the original status and body, flagged with the response header Idempotency-Replayed: true.
  • The same key with a different request payload responds 409 idempotency-key-conflict — a hard client bug signal, never retried.

Recommended key discipline:

POSTKey shapeRationale
Provisioning steps (createRole, registerRepository, createCredential, createRepositorySkill)Deterministic — derived from the step and natural key, e.g. prov-{external_tenant_id}-role-{name}Any replica retrying the same logical step replays the original 201 instead of racing to a 409
User-action POSTs (createConversation, createMessage)Random UUID per user actionEach action is distinct; the key protects only against transport-level retries of that action

The two safety nets are complementary, not redundant: Idempotency-Key makes pure retries (same caller, same key) exact replays, while the 409 + conflicting_resource_id contract (§7.2) handles independent callers (different replicas, or a caller that lost its key in a crash) converging on the same named resource. A robust adapter uses both.

9. Bare-minimum provisioning and later enrichment

Section titled “9. Bare-minimum provisioning and later enrichment”

Both upsert primitives accept an empty body — creation requires nothing but the external ID in the path:

PUT /tenants/by-external-id/acme:tenant:128231
{}
201 (tenant exists; name null, no repository, default settings)
PUT /tenants/tnt_01hzx8acme001/users/by-external-id/acme:user:9f27c1
{}
201 (user exists; storage bucket still auto-attached)

This is a guarantee, not a degraded mode: provisioning cannot fail for lack of profile data. A host JWT that yields only identifiers still provisions successfully; whatever enrichment the host can supply arrives whenever it arrives:

  • Merge-upsert enrichment — later PUTs with name / email / display_name fill fields in place (200, provided-replaces-omitted-unchanged). This happens organically on every warm pass.
  • Explicit updatesupdateTenant (PATCH /tenants/{tenant_id}) and updateUser (PATCH /users/{user_id}) cover fields the upsert body deliberately excludes: status changes, default_repository_id, linking a host-owned storage bucket over the auto-attached platform one.

Everything a user needs to function is attached server-side regardless of body content: the storage bucket on creation, the tenant’s default repository and role via the bootstrap chain. The only hard prerequisite for conversations is at least one role (§6’s role-required trigger).

10. Lifecycle reconciliation — how things stop existing

Section titled “10. Lifecycle reconciliation — how things stop existing”

Provisioning is lazy; deprovisioning cannot be — a tenant offboarded from the host must stop working in shiftagent without anyone remembering to clean up. Three mechanisms compose (the adapter implements all three; full duties, cadences, guardrails, and the suspend-vs-delete policy are specified in the Adapter Design Spec §5):

MechanismLatencyRole
Periodic sweepHours (cadence-bound)Safety net — catches everything, eventually
Host webhook pushSecondsOptimization — immediate, but delivery is lossy
Lazy enforcementNext requestBackstop — revoked identities cannot obtain tokens

The list endpoints return external_id on every item, so a stateless diff needs nothing else:

  1. Page listTenants (GET /tenants) collecting tenant external_ids; per tenant, page listTenantUsers (GET /tenants/{tenant_id}/users) collecting user external_ids. (Cross-tenant listUsersGET /users?tenant_id= — is an equivalent alternative.)
  2. Diff against the host system’s live directory.
  3. Deprovision what shiftagent has and the host no longer does:
    • tenant gone → deleteTenantByExternalId (DELETE /tenants/by-external-id/{external_id}) — one call cascades: archives conversations, deactivates users, detaches repositories, destroys vaulted conversation secrets, soft-deletes the tenant; or suspend first via updateTenant (status: "suspended") per policy
    • user gone → deactivateUser (DELETE /users/{user_id}) — soft; conversations and audit trails are preserved, access is cut

All deprovision calls are idempotent in effect, so webhook-triggered and sweep-triggered deprovisioning may overlap freely — the webhook path maps host lifecycle events onto exactly the same calls.

A deprovisioned tenant is gone: a second deleteTenantByExternalId returns 404, and a later upsertTenantByExternalId for the same external ID creates a new tenant with a new tnt_ ID — it does not resurrect the old record, its conversations, or its secrets. Offboarding is final by design; anything worth keeping must be exported before the sweep’s grace window closes.

The invariant that keeps lazy JIT provisioning from undoing an offboarding:

  • Absent — no record for the external ID → the upsert creates one. JIT applies.
  • Deactivated/suspended — a record exists with status: "suspended" → the upsert returns 200 with that status unchanged. It never reactivates. tokenExchange (POST /auth/token-exchange) fails 403 for suspended users, and getUserByExternalId (GET /tenants/{tenant_id}/users/by-external-id/{external_id}) makes the distinction visible by returning the record with its status rather than 404.

An adapter must treat a suspended status as terminal for that identity until an operator (or a host lifecycle event) explicitly reactivates it — re-provisioning around it would silently undo a revocation.

RFC 9457 application/problem+json throughout; every problem carries request_id. The subset relevant to this flow:

Problem typeStatusRaised byRecovery
validation-error422Any malformed body/paramsFix the request; not retryable as-is
name-conflict409createRole, registerRepository, createCredential, createRepositorySkillFetch conflicting_resource_id, continue (§7.2)
external-id-conflict409createTenant (plain create) with a taken external_idUse the upsert primitive instead, or adopt conflicting_resource_id
role-required422createConversation for a user with no/multiple roles and no explicit role_idRe-run the bootstrap chain (§6) or pass role_id
resource-in-use409Guarded deletes (deleteRepository, detachTenantRepository, deleteRole, deleteCredential)Reassign or detach dependents first
idempotency-key-conflict409Any POST reusing a key with a different payloadClient bug — fix key derivation, do not retry
tenant-suspended403Conversation/message writes on a suspended tenantExpected post-offboarding state; do not re-provision
not-found404Lookups (getTenantByExternalId, getUserByExternalId) for absent resourcesOn the provisioning path, absence is what the upsert primitive handles — prefer PUT
  • Adapter Design Spec — the adapter’s side of everything here: identity derivation, cache policy, the request-lifecycle state machine, reconciliation duties, failure modes.
  • Integration Guide — the architecture frame, auth model, and external-ID conventions this flow builds on.
  • Streaming Contract — what happens after provisioning: the NDJSON conversation stream the cold and warm paths both lead into.
  • openapi/openapi.yaml — the normative spec for every operation referenced above.