AI Dev Tools
·6 min read·best practice

Stop Burning Cash: How to Automate GCP Idle Resource Cleanup

Learn how to automate gcp idle resource cleanup using TypeScript and Cloud Scheduler. Stop paying for neglected staging databases and idle Compute Engine instances.

The scenario is painfully familiar. A developer spins up a 16-core Compute Engine instance or an over-provisioned Cloud SQL database to test a heavy migration or debug a production issue on a staging branch. They finish their work, close their laptop on Friday afternoon, and go enjoy the weekend.

By Monday morning, that idle database has burned through hundreds of dollars. Multiply this across a team of fifty engineers, and you are wasting thousands of dollars every month on idle staging environments, orphaned disks, and forgotten GKE node pools.

Relying on human discipline to clean up temporary infrastructure is a losing strategy. "I'll delete it later" is the single most expensive lie in cloud engineering. If you want to stop flushing your engineering budget down the drain, you must build programmatic guardrails.

The Cost of Neglect: Why Manual GCP Idle Resource Cleanup Fails

The root of the problem is that cloud providers make provisioning frictionless but decommissioning complex. In Google Cloud Platform (GCP), resources are often deeply coupled. Deleting a VM sounds simple, but did you also delete its boot disk? Did you release its static external IP address? Did you tear down the associated Cloud SQL replica?

When engineers attempt manual cleanups, they often miss these secondary resources. Worse, many teams try to write brittle, unmaintained Bash scripts triggered by local cron jobs to clean up environments. These scripts suffer from several critical flaws:

  1. Silent Failures: A gcloud command fails due to an expired OAuth token, and the script exits silently without alerting anyone.
  2. Lack of Granularity: Bash scripts often use blunt instruments, like shutting down all VMs in a project at 6:00 PM, breaking the workflow of remote engineers in different time zones.
  3. No Safety Nets: One bad regex match in a grep statement can accidentally target production resources.
  4. No Audit Trail: You have no central log of why a resource was stopped, who owned it, or if it can be safely permanently deleted.

To solve this permanently, we need to automate gcp idle resource cleanup using a declarative, label-driven architecture.


The Pattern: Label-Driven Lifecycle Management

Instead of guessing which resources are safe to turn off, we shift the responsibility of declaring resource lifespans to the creator, enforced by automated policies. Every non-production resource must be tagged with metadata that defines its owner, its purpose, and its lifespan.

We define three core labels for every resource:

  • env: Must be development, staging, or testing (production is explicitly ignored by our cleanup engine).
  • ttl: A Unix timestamp or ISO string indicating when the resource is safe to delete or stop.
  • idle-action: Either stop (for VMs and Cloud SQL) or terminate (for ephemeral test runners).

Our cleanup engine runs on a serverless schedule. It queries the GCP APIs, filters out any resource missing these labels (or violating policy), evaluates the current time against the ttl label, and executes the designated idle-action.

This approach decouples policy execution from human memory. If an engineer needs a resource to persist longer, they simply update the ttl label on the resource. If they fail to do so, the engine safely shuts it down.


Implementing the Cleanup Engine in TypeScript

Let's write a production-grade TypeScript application designed to run inside a GCP Cloud Function (Gen 2). It uses the official @google-cloud/compute and @google-cloud/sql client libraries to safely scan and stop idle resources.

To avoid the performance bottlenecks of serial API requests, we will leverage structured concurrency patterns. When handling a large fleet of resources, executing these API requests sequentially is incredibly slow. However, blindly using Promise.all can trigger GCP API rate limits. For a deeper dive on managing high-throughput asynchronous execution safely, see our guide on advanced TypeScript async/await patterns.

The Before: Brittle Bash Script (Do Not Do This)

bash
#!/bin/bash
# A naive, dangerous cleanup script that engineers run on cron
echo "Stopping all dev VMs..."
for vm in $(gcloud compute instances list --filter="name~'dev-'" --format="get(name)"); do
  # Brutal shutdown with no safety checks, no TTL validation, and no logging
  gcloud compute instances stop "$vm" --zone="us-central1-a" --quiet
done

The After: Production-Grade TypeScript Cleanup Engine

First, install the required dependencies:

bash
npm install @google-cloud/compute @google-cloud/sql-admin dotenv
npm install --save-dev typescript @types/node

Here is the robust, dry-run-capable TypeScript cleanup implementation:

typescript
import { InstancesClient } from '@google-cloud/compute';
import { google } from 'googleapis';
import { JWT } from 'google-auth-library';
 
// Configuration interface
interface EngineConfig {
  dryRun: boolean;
  targetEnv: string[];
}
 
const config: EngineConfig = {
  dryRun: process.env.DRY_RUN === 'true',
  targetEnv: ['development', 'staging', 'testing'],
};
 
// Initialize clients
const computeClient = new InstancesClient();
 
/**
 * Parses and validates the TTL label.
 * Returns true if the resource has expired.
 */
function isExpired(ttlValue?: string): boolean {
  if (!ttlValue) return false;
  
  // Support both Unix timestamps and ISO strings
  const ttlParsed = isNaN(Number(ttlValue)) ? Date.parse(ttlValue) : Number(ttlValue) * 1000;
  
  if (isNaN(ttlParsed)) {
    console.warn(`[Validation Warning] Invalid TTL format: ${ttlValue}`);
    return false;
  }
  
  return Date.now() > ttlParsed;
}
 
/**
 * Scans and manages Compute Engine (VM) Instances
 */
async function processComputeInstances(project: string, zone: string) {
  console.log(`Scanning Compute instances in project: ${project}, zone: ${zone}`);
  
  const [instances] = await computeClient.list({
    project,
    zone,
  });
 
  for (const instance of instances) {
    const name = instance.name || 'unknown';
    const labels = instance.labels || {};
    const env = labels['env'] || '';
    const ttl = labels['ttl'];
    const action = labels['idle-action'] || 'stop';
 
    // Guardrail: Never touch production or resources missing target environment labels
    if (!config.targetEnv.includes(env)) {
      continue;
    }
 
    if (isExpired(ttl)) {
      if (instance.status === 'RUNNING') {
        console.log(`[EXPIRED] VM "${name}" (Env: ${env}) has expired (TTL: ${ttl}). Action: ${action}`);
        
        if (config.dryRun) {
          console.log(`[DRY RUN] Would execute "${action}" on VM "${name}"`);
          continue;
        }
 
        try {
          if (action === 'terminate') {
            console.log(`Deleting VM "${name}"...`);
            await computeClient.delete({ project, zone, instance: name });
          } else {
            console.log(`Stopping VM "${name}"...`);
            await computeClient.stop({ project, zone, instance: name });
          }
        } catch (error) {
          console.error(`Failed to execute action on VM "${name}":`, error);
        }
      }
    }
  }
}
 
/**
 * Scans and manages Cloud SQL Instances
 */
async function processSqlInstances(project: string) {
  console.log(`Scanning Cloud SQL instances in project: ${project}`);
  
  // Use googleapis for Cloud SQL Admin API auth
  const auth = new JWT({
    scopes: ['https://www.googleapis.com/auth/cloud-platform'],
  });
  const sqladmin = google.sqladmin({ version: 'v1beta4', auth });
 
  try {
    const res = await sqladmin.instances.list({ project });
    const instances = res.data.items || [];
 
    for (const instance of instances) {
      const name = instance.name || 'unknown';
      const settings = instance.settings || {};
      const labels = settings.userLabels || {};
      const env = labels['env'] || '';
      const ttl = labels['ttl'];
 
      if (!config.targetEnv.includes(env)) {
        continue;
      }
 
      if (isExpired(ttl)) {
        // Check if DB is currently running (ACTIVATION_POLICY = ALWAYS)
        if (settings.activationPolicy === 'ALWAYS') {
          console.log(`[EXPIRED] SQL Instance "${name}" (Env: ${env}) has expired (TTL: ${ttl}). Stopping instance.`);
 
          if (config.dryRun) {
            console.log(`[DRY RUN] Would stop Cloud SQL instance "${name}"`);
            continue;
          }
 
          try {
            // Stopping Cloud SQL is done by setting activationPolicy to NEVER
            await sqladmin.instances.patch({
              project,
              instance: name,
              requestBody: {
                settings: {
                  activationPolicy: 'NEVER',
                },
              },
            });
            console.log(`Successfully stopped Cloud SQL instance "${name}"`);
          } catch (error) {
            console.error(`Failed to stop Cloud SQL instance "${name}":`, error);
          }
        }
      }
    }
  } catch (error) {
    console.error('Error listing Cloud SQL instances:', error);
  }
}
 
/**
 * Main execution entry point for the Cloud Function
 */
export async function runCleanup() {
  const project = process.env.GCP_PROJECT_ID;
  const zone = process.env.GCP_ZONE || 'us-central1-a';
 
  if (!project) {
    throw new Error('GCP_PROJECT_ID environment variable is required.');
  }
 
  console.log(`Starting cleanup run. Dry-run mode: ${config.dryRun}`);
  
  await processComputeInstances(project, zone);
  await processSqlInstances(project);
  
  console.log('Cleanup run completed.');
}

Enforcing Least Privilege Access Control

Running a script with sweeping permissions like roles/editor is a massive security risk. If your cleanup function is compromised, an attacker could wipe out your entire cloud infrastructure.

To prevent this, you must run this Cloud Function under a dedicated Google Service Account (GSA) bound strictly to the minimal permissions required to read and modify VM and SQL states.

When designing your deployment pipeline, make sure to audit these configurations. For a comprehensive security assessment of your cloud-connected backends, review our production-grade backend API security checklist.

Create a custom IAM role with only these permissions:

yaml
title: "Custom Idle Resource Cleanup Role"
description: "Allows stopping and deleting expired non-production resources"
stage: "GA"
includedPermissions:
  - compute.instances.list
  - compute.instances.stop
  - compute.instances.delete
  - cloudsql.instances.list
  - cloudsql.instances.update

Bind this custom role to the service account executing your Cloud Function, and restrict its scope to non-production GCP folders or projects.


Trade-offs and Operational Edge Cases

Automating resource destruction is highly effective, but it introduces operational risks that must be carefully managed.

1. The Monday Morning Cold Start

When you stop developers' databases and VMs over the weekend, Monday morning can become a bottleneck. Engineers arrive at 9:00 AM only to find their development environments offline, forcing them to manually restart resources via the console or CLI.

The Solution: Implement a "Warm-Up" companion job. Just as Cloud Scheduler triggers a shutdown at Friday 8:00 PM, configure a companion schedule to trigger an automated startup at Monday 7:00 AM for any resource that has an always-on-during-business-hours label.

2. State Preservation vs. Cost Savings

Stopping a Compute Engine instance retains its boot disk and any attached local SSD state, but you still pay for the underlying storage (SSD persistent disk space is billed whether the VM is running or not). If you choose terminate (deletion) to achieve absolute zero cost, you lose the local state.

When reviewing pull requests for infrastructure-as-code (IaC), ensure your team defines clear boundaries for what is ephemeral. For tips on establishing these guardrails during team code reviews, see what senior engineers actually look for in code reviews.

3. Missing Metadata Safeguards

What happens if a developer deploys an unlabelled resource? If your cleanup script only targets resources with an expired ttl, an unlabelled resource will run forever, bypassing your cost-saving measures entirely.

The Recommended Policy: Implement a "Default to Expire" policy. If a resource in a development project is missing the ttl label, the engine should automatically assign it a default 24-hour TTL and notify the owner via Slack or email. If no owner is identified, the resource should be stopped automatically.

| Resource Type | Stop Cost Savings | Delete Cost Savings | Recovery Effort | | :--- | :--- | :--- | :--- | | Compute Engine (VM) | High (~75% savings, storage still billed) | 100% savings | High (Must rebuild from IaC/images) | | Cloud SQL | High (~85% savings, storage still billed) | 100% savings | Extreme (Data loss if unbacked) | | GKE Node Pools | High (Scale to 0 nodes) | N/A | Low (Kubernetes schedules pods on demand) |


Moving Beyond Scripts to Guardrails

Writing a cleanup script is only half the battle; you must build a culture where resources are born with an expiration date.

The most effective way to enforce this is to integrate labeling directly into your Terraform or Pulumi templates. Block any deployment to development environments that does not specify a valid ttl and env label.

By automating the detection and shutdown of idle resources, you transition from playing a reactive game of "cloud cost whack-a-mole" to running an optimized, self-cleaning cloud platform that only charges you for what you actually use.

ShareTweet

Related Posts