The Terraform Scaling Problem
Terraform works beautifully for a handful of resources. At 50+ microservices, a single state file becomes a bottleneck. Plan times stretch to minutes, merge conflicts multiply, and the blast radius of a mistake grows unacceptable. Here is how I structure Terraform for large-scale infrastructure.
State File Strategy
Split state by deployment boundary, not by resource type. Each microservice team should own their service state. Shared infrastructure (VPC, IAM, DNS) lives in separate state files that other configurations reference via data sources or terraform_remote_state.
Module Design Principles
Good Terraform modules are like good APIs — they have clear contracts, sensible defaults, and hide complexity:
- Single responsibility: One module creates one logical unit (ECS service, RDS instance, S3 bucket)
- Versioned modules: Use git tags or a module registry — never point to main branch
- Output everything: Consumers need ARNs, endpoints, and security group IDs
- Validate inputs: Use validation blocks to catch errors before plan
CI/CD Integration
Every pull request should run terraform plan. Merges to main run terraform apply with manual approval for production. Use Atlantis or GitHub Actions with OIDC provider authentication — never store cloud credentials in CI secrets if you can avoid it.
Key Takeaways
- Split state by team ownership, not by resource type
- Use remote state with locking — S3 + DynamoDB is the AWS standard
- Version your modules and pin versions in consuming configurations
- Run plan on every PR, apply on merge with approval gates
- Use terraform import before writing resources from scratch
Senior Software Engineer specializing in cloud architecture, real-time systems, and enterprise-scale applications.