Dagster Storage Optimization

The Problem: Excessive Storage Buildup

During development, Dagster can accumulate massive amounts of storage data if not properly configured. This can lead to:

212,000+ run directories in tmp/
63GB+ of accumulated storage
System performance degradation
Disk space exhaustion

Root Causes

No retention policies configured by default
High concurrent runs (25+ simultaneous) creating many artifacts
Indefinite storage of compute logs and run metadata
No automatic cleanup of old runs
Large temp directories without size limits

Solutions Implemented

1. Configuration Improvements

Updated `dagster.yaml`

retention:
  schedule:
    purge_after_days: 3  # Reduced from 7
  sensor:
    purge_after_days:
      skipped: 1   # Keep only 1 day of skipped runs  
      failure: 3   # Keep 3 days of failures for debugging
      success: 1   # Keep only 1 day of successful runs

Updated `dagster_docker.yaml`

run_coordinator:
  config:
    max_concurrent_runs: 10  # Reduced from 25

# Added retention policies
retention:
  schedule:
    purge_after_days: 3
  sensor:
    purge_after_days:
      skipped: 1
      failure: 3  
      success: 1

# Added run monitoring
run_monitoring:
  enabled: true
  start_timeout_seconds: 300
  cancel_timeout_seconds: 180
  max_runtime_seconds: 900
  poll_interval_seconds: 60

# Disabled telemetry to reduce disk writes
telemetry:
  enabled: false

2. Automated Cleanup Tools

Cleanup Script: `scripts/utils/cleanup_dagster_storage.sh`

Interactive Menu:

./scripts/utils/cleanup_dagster_storage.sh
# or
make dagster-cleanup-menu

Direct Commands:

# Check current status
make dagster-cleanup-status

# Safe cleanup (recommended)
make dagster-cleanup-minimal

# Remove old runs (30+ days)
make dagster-cleanup-standard

# Aggressive cleanup (7+ days)
make dagster-cleanup-aggressive

Cleanup Levels:

Minimal (🔧): Remove old logs only - safe for production
Standard (🧹): Remove runs older than 30 days
Aggressive (🔥): Remove runs older than 7 days
Nuclear (☢️): Remove all but last 24 hours
CLI-based (🛠️): Use Dagster's built-in cleanup commands

3. Environment Variable Updates

Updated .example.env with lightweight defaults:

# Lightweight storage paths
ANOMSTACK_DAGSTER_LOCAL_ARTIFACT_STORAGE_DIR=tmp_light/artifacts
ANOMSTACK_DAGSTER_LOCAL_COMPUTE_LOG_MANAGER_DIRECTORY=tmp_light/compute_logs
ANOMSTACK_DAGSTER_SQLITE_STORAGE_BASE_DIR=tmp_light/storage

# Reduced concurrency
ANOMSTACK_DAGSTER_OVERALL_CONCURRENCY_LIMIT=5  # Reduced from 10
ANOMSTACK_DAGSTER_DEQUEUE_NUM_WORKERS=2        # Reduced from 4

Best Practices

For Developers

Regular Monitoring:

make dagster-cleanup-status  # Check storage weekly

Routine Cleanup:

make dagster-cleanup-minimal  # Weekly log cleanup
make dagster-cleanup-standard # Monthly run cleanup

Emergency Cleanup:

make dagster-cleanup-aggressive  # When disk space is low

For Production

Configure retention policies in dagster_docker.yaml
Limit concurrent runs to reasonable numbers (5-15)
Enable run monitoring to detect stuck runs

Set up automated cleanup using cron jobs:

# Weekly cleanup cron job
0 2 * * 0 /path/to/cleanup_dagster_storage.sh minimal

# Monthly aggressive cleanup
0 3 1 * * /path/to/cleanup_dagster_storage.sh standard

For CI/CD

Use ephemeral storage when possible
Clean up after tests:
```
make dagster-cleanup-aggressive
```
Monitor disk usage in build scripts

Storage Size Guidelines

Storage Level	Recommended Action
< 1GB	✅ Healthy - continue monitoring
1-5GB	⚠️ Consider weekly cleanup
5-20GB	🔄 Run standard cleanup monthly
20-50GB	🔥 Run aggressive cleanup
> 50GB	☢️ Emergency cleanup required

Troubleshooting

Issue: "212,000+ run directories"

Solution: Run nuclear cleanup, then configure retention policies

Issue: "Disk space full"

Solution:

Run make dagster-cleanup-aggressive
If still full, run make reset-nuclear
Configure retention policies before restarting

Issue: "Slow Dagster performance"

Solution:

Check storage with make dagster-cleanup-status
Run appropriate cleanup level
Reduce max_concurrent_runs

Issue: "Old runs not being cleaned up"

Solution:

Verify retention policies in dagster_docker.yaml
Ensure Dagster daemon is running
Check database connectivity for PostgreSQL storage

Prevention Checklist

Retention policies configured in dagster_docker.yaml
Run monitoring enabled
Concurrent runs limited (≤ 15)
Regular cleanup scheduled (weekly/monthly)
Storage monitoring in place
Telemetry disabled in production
Environment variables optimized

Additional Resources

Interactive Cleanup: make dagster-cleanup-menu
Makefile Documentation: Available in the project root Makefile.md#dagster-storage-cleanup
Reset Scripts: Available in scripts/utils/ directory
Dagster Retention Docs: Official Documentation

💡 Key Takeaway: Proactive storage management prevents the 63GB+ buildup problem. Regular monitoring and cleanup are essential for healthy Dagster deployments.

The Problem: Excessive Storage Buildup​

Root Causes​

Solutions Implemented​

1. Configuration Improvements​

Updated dagster.yaml​

Updated dagster_docker.yaml​

2. Automated Cleanup Tools​

Cleanup Script: scripts/utils/cleanup_dagster_storage.sh​

Cleanup Levels:​

3. Environment Variable Updates​

Best Practices​

For Developers​

For Production​

For CI/CD​

Storage Size Guidelines​

Troubleshooting​

Issue: "212,000+ run directories"​

Issue: "Disk space full"​

Issue: "Slow Dagster performance"​

Issue: "Old runs not being cleaned up"​

Prevention Checklist​

Additional Resources​

The Problem: Excessive Storage Buildup

Root Causes

Solutions Implemented

1. Configuration Improvements

Updated `dagster.yaml`

Updated `dagster_docker.yaml`

2. Automated Cleanup Tools

Cleanup Script: `scripts/utils/cleanup_dagster_storage.sh`

Cleanup Levels:

3. Environment Variable Updates

Best Practices

For Developers

For Production

For CI/CD

Storage Size Guidelines

Troubleshooting

Issue: "212,000+ run directories"

Issue: "Disk space full"

Issue: "Slow Dagster performance"

Issue: "Old runs not being cleaned up"

Prevention Checklist

Additional Resources