`tare upgrade`

Upgrade an installed TARS dataplane to the version embedded in the current tare binary, with zero downtime when the data plane is HA.

Synopsis

tare upgrade <identity-file> [flags]

Description

tare upgrade performs the operator workflow for moving an existing TARS dataplane release to a newer version. Unlike tare install, it does not fresh-install and does not re-prompt for configuration the operator already supplied at install time, it reads the live release's Helm values via helm get values and carries forward the operator-supplied config.

Phases:

Preflight (Helm 3+, kubectl, cluster connectivity; warns when metrics-server is absent and HA is on)
Discover existing release in the target namespace (refuses to run if none is found: that path is tare install)
Read existing release values and carry forward registry, customer, environment, deployment mode, serve URL, image pull secret name, and the OTel collector enable + endpoint + auth-header config
HA preflight: read the EnvoyProxy CRD's envoyHpa.minReplicas; refuse to upgrade if it is < 2 unless --allow-downtime is set
Apply CRDs from the embedded chart (Helm's crds/ directory is install-only, so an upgrade has to upsert CRDs explicitly)
helm upgrade --install --atomic --wait against the embedded chart

helm and kubectl are downloaded automatically on first run into ~/.tare/tools/ and reused on subsequent runs.

The new tare binary's embedded chart supplies the new chart defaults: notably the HA-by-default flip and the EnvoyProxy.spec.shutdown block. Existing operator-supplied values override chart defaults where they overlap.

Image sync is out of scope. tare upgrade does not pull new images into your registry. Sync first, then upgrade:
tare install identity.json --image-sync <REG> --sync-only
tare upgrade identity.json

Usage

Standard upgrade

After downloading a newer tare binary and placing it in ~/.tare/bin/:

tare upgrade identity.json

tare upgrade reads the existing release values, carries them forward, applies CRD changes from the new chart, and runs helm upgrade --install --atomic --wait. A failed rollout auto-rolls back to the previous revision.

Single-replica install (lab / non-production)

If the existing install is single-replica (the pre-ADR-041 default, or tare install --ha=false), tare upgrade refuses to proceed because rolling the single data-plane Envoy pod would drop in-flight LLM streams:

data-plane Envoy is single-replica (envoyHpa.minReplicas=1).
Upgrading now will RST in-flight LLM streams.

Fix HA first (idempotent, takes ~1m):
  tare install --ha identity.json

Or force this upgrade with downtime:
  tare upgrade identity.json --allow-downtime

The recommended path is to migrate to HA first via tare install --ha (which is idempotent: it calls helm upgrade --install and bumps the HPA minReplicas from 1 to 2). That migration step itself causes a brief one-time disruption while the second pod comes up, but every subsequent tare upgrade is zero-downtime.

Override carried-forward values

CLI flags take precedence over carryover. Override only what you need to change, e.g., to move to a new image registry as part of the upgrade:

# 1. Sync to the new registry
tare install identity.json --image-sync newregistry.example.com --sync-only

# 2. Upgrade using the new registry
tare upgrade identity.json --image-registry newregistry.example.com

Disable atomic rollback

When debugging a stuck release, run without --atomic to leave the half-deployed release in place:

tare upgrade identity.json --no-atomic

The default --atomic is the safer behavior for production. Use this flag only when you need to inspect a failed rollout.

Flags

Main

Flag	Default	Description
`--allow-downtime`	`false`	Proceed even when the data plane is single-replica. In-flight LLM streams will be dropped.
`--ha`	`true`	HA-safe defaults for the data-plane Envoy proxy (HPA min 2, PDB min 1). Pass `--ha=false` to keep single-replica.
`--drain-timeout-seconds`	`300`	`EnvoyProxy.spec.shutdown.drainTimeout` (seconds). Maximum time Envoy waits for in-flight requests (long LLM streams) to finish before SIGKILL.
`--no-atomic`	`false`	Disable `helm --atomic` (no auto-rollback on failure). Leaves stuck releases in place for debugging.
`--timeout`	`10m`	`helm upgrade --wait` timeout. Should exceed `drainTimeout × replicas` to allow serial drain.

Telemetry

Flag	Default	Description
`--enable-otel-collector`	carry-forward	Override the OTel collector enable state. Default carries the existing release's value.
`--otel-collector-endpoint <URL>`	carry-forward	Override OTel OTLP endpoint.
`--otel-exporter-auth-headers <H>`	carry-forward	Override OTel Authorization header value.

Advanced / hidden

Hidden from --help but accepted for advanced use; defaults all come from carryover unless overridden.

Flag	Default	Description
`--release-name`	`tars`	Helm release name.
`--system-namespace`	`tars-system`	Kubernetes system namespace.
`--dataplane-namespace`	`tars-dataplane`	Kubernetes dataplane namespace.
`--chart-path`	embedded	Path or OCI/HTTP reference to chart.
`--chart-version`	none	Chart version (for remote/OCI charts).
`--image-registry`	carry-forward	Image registry override.
`--image-tag`	embedded manifest	Image tag override.
`--deployment-mode`	carry-forward	`enterprise` or `saas`.
`--customer`	carry-forward	Customer label override.
`--environment`	carry-forward	Environment label override.
`--serve-url`	carry-forward	Serve URL override.
`--redis-type`	`cluster`	`cluster` or `hosted`.
`--no-rate-limiting`	`false`	Disable rate limiting.
`--skip-preflight`	`false`	Skip Helm/kubectl/cluster preflight (also `TARE_SKIP_PREFLIGHT=1`).

Carryover behavior

tare upgrade reads the existing release's user-supplied values (helm get values returns only what was passed via -f or --set at install time, not chart defaults). The following fields are carried forward into the regenerated values:

Helm values path	CLI flag override
`global.imageRegistry`	`--image-registry`
`global.customer`	`--customer`
`global.environment`	`--environment`
`global.deploymentMode`	`--deployment-mode`
`global.serveUrl`	`--serve-url`
`global.imagePullSecretName`	(no override; lives in operator-supplied pull secret)
`otelCollector.enabled`	`--enable-otel-collector`
`otelCollector.exporters.otlp.endpoint`	`--otel-collector-endpoint`
`otelCollector.exporters.otlp.headers.Authorization`	`--otel-exporter-auth-headers`

Fields not in the carryover list fall back to embedded chart defaults (from the new tare binary). Pass the corresponding flag to override.

Single-replica preflight

tare upgrade reads spec.provider.kubernetes.envoyHpa.minReplicas from the EnvoyProxy CRD in the system namespace. The check has three outcomes:

minReplicas ≥ 2: proceed.
minReplicas < 2 and --allow-downtime not set: abort with the migration message.
EnvoyProxy resource not found, or kubectl call fails: print a warning and skip the preflight (operator's responsibility to verify HA).

The preflight does not currently inspect live Deployment.spec.replicas because the EnvoyProxy CRD HPA configuration is the source-of-truth for how the data plane scales over time. Inspecting both would tighten the check; that is a future enhancement.

Atomic rollback

The default helm upgrade --atomic makes the upgrade transactional: if any pod fails its readiness probe within --timeout, Helm rolls the release back to the previous revision and exits non-zero. The data plane returns to its previous state without operator intervention.

--no-atomic disables this: useful for inspecting a stuck release mid-rollout, at the cost of leaving traffic potentially affected.

Output

tare upgrade writes phase-prefixed status lines ([i/N] <phase>) to stderr. Final lines include the rollback hint and inspection commands:

[1/5] Preflight (Helm / kubectl / cluster)
[2/5] Discover existing release
  release tars found in tars-system
  carried forward: registry="gcr.io/acme/tars" customer="acme" env="" otel=true
[3/5] HA preflight
  EnvoyProxy minReplicas=2 (HA)
   Using embedded chart
[4/5] Apply / update CRDs
  3 created, 12 updated, 4 skipped
[5/5] Helm upgrade
...
Upgrade completed.

Rollback (if needed):
  helm rollback tars -n tars-system
Inspect:
  kubectl get pods -n tars-system
  kubectl get pods -n tars-dataplane

Verification

# Pods in both namespaces should be Running with the new image tag
kubectl get pods -n tars-system
kubectl get pods -n tars-dataplane

# EnvoyProxy HA configuration
kubectl get envoyproxy -n tars-system -o yaml | yq .spec.provider.kubernetes.envoyHpa

# Active rollout status
kubectl get deploy -n tars-dataplane -o wide

Management-plane event reporting

tare upgrade reports the upgrade as an event to the management plane via the same in-cluster ConfigMap outbox that tare install uses. After a successful helm upgrade, the CLI writes a ConfigMap labelled tars.io/component=install-event to tars-system carrying operation=upgrade and the new Helm revision. The tare-doctor CronJob forwards it on its next tick (default every 5 minutes); the dashboard's deployment-history timeline shows the row at helm_revision = N+1 with from_version and to_version populated when chart versions differ.

The event POST never blocks the upgrade and never changes the exit code.

Failed atomic-rollback upgrades (helm upgrade --atomic that triggers a rollback) produce two dashboard rows:

One for the failed upgrade attempt: operation: upgrade, status: failed, error_summary populated.
One for the auto-rollback: operation: rollback, status: success, on the new helm revision the rollback created. This is the tare-doctor reconciler recognising helm's "Rollback to N" description and labelling the artefact correctly, instead of attributing the system reversion to the operator who triggered the upgrade.

Rollback

tare upgrade defaults to --atomic, so failed upgrades roll back automatically. To roll back a successful but problematic upgrade:

# View revision history
helm history tars -n tars-system

# Roll back to the previous revision
helm rollback tars -n tars-system

# Or to a specific revision number
helm rollback tars 5 -n tars-system

Note: helm rollback does not roll back CRDs. Any CRD schema changes applied during the upgrade remain in place. This is rarely an issue in practice because CRD changes are additive (new fields, no removed fields) in supported upgrade paths.

Migration: single-replica → HA

Customers whose tare install ran before ADR 041 default flip are on single-replica installs. The migration path is one idempotent install call that bumps the EnvoyProxy CRD's HPA minReplicas from 1 to 2:

tare install --ha identity.json

This calls helm upgrade --install, so it does not require uninstall and does not change any other configuration. The Envoy Gateway controller scales the data-plane Deployment up to 2 pods (a one-time brief disruption while the second pod comes up).

After this migration, tare upgrade is zero-downtime.

Where to go next

tare install

Fresh-install the dataplane and sync images before upgrading.

tare uninstall

Tear down the dataplane when an upgrade is not the goal.

Synopsis​

Description​

Usage​

Standard upgrade​

Single-replica install (lab / non-production)​

Override carried-forward values​

Disable atomic rollback​

Flags​

Main​

Telemetry​

Advanced / hidden​

Carryover behavior​

Single-replica preflight​

Atomic rollback​

Output​

Verification​

Management-plane event reporting​

Rollback​

Migration: single-replica → HA​