Solving Multi-Account EC2 Disk Utilization Monitoring
A large enterprise, expanded through acquisitions, faces fragmented visibility into EC2 disk utilization across numerous AWS accounts. The CTO requires a comprehensive, scalable solution to proactively monitor disk space, leveraging the company's existing Ansible tooling to mitigate operational risks like outages and performance degradation.
Interactive Architecture Explorer
This solution uses a hub-and-spoke model for security, scalability, and efficiency. Click on any component in the diagram below to learn more about its role, function, and implementation details within the monitoring framework.
Ansible Control Plane
EC2 in Shared Services Account
⚙️Cross-Account IAM Roles
Secure, Temporary Access
🔑Spoke Accounts (Workloads)
Account A
EC2 Instances w/ SSM Agent
Account B
EC2 Instances w/ SSM Agent
Account C ...
EC2 Instances w/ SSM Agent
Central S3 Bucket
Raw Data Storage
🗄️Data Processing & Viz
Lambda, CloudWatch, etc.
📊Implementation Deep Dive
This section provides the core technical assets for the solution: the dynamic inventory configuration and the main Ansible playbook used for collecting disk metrics.
Ansible Playbook: `collect_disk_metrics.yml`
This playbook leverages the `aws_ssm` module to run a single, intelligent script on target instances. The script automatically detects the OS (Linux/Windows) and runs the appropriate disk utilization commands, outputting a consistent JSON payload. The full script content is located in the `ansible/files/generic_disk_script.sh` file.
---
---
- name: Collect EC2 Disk Utilization
hosts: all
gather_facts: no
vars:
s3_bucket_name: "central-monitoring-s3-bucket"
s3_region: "us-east-1"
disk_collection_script_content: "{{ lookup('file', 'files/generic_disk_script.sh') }}"
tasks:
- name: Debug EC2 instance IDs being targeted (optional)
debug:
msg: "Targeting instance ID: {{ hostvars[item].ec2_instance_id | default('N/A') }}"
loop: "{{ ansible_play_batch }}"
when: hostvars[item].ec2_instance_id is defined
- name: Execute generic disk utilization script via SSM
amazon.aws.aws_ssm:
state: "command"
document_name: "AWS-RunShellScript"
instance_ids: "{{ ansible_play_batch | map('extract', hostvars, 'ec2_instance_id') | list }}"
parameters:
commands:
- "{{ disk_collection_script_content }}"
region: "{{ s3_region }}"
register: ssm_command_init_result
delegate_to: localhost
- name: Wait for SSM command to complete for each instance
amazon.aws.aws_ssm_info:
command_id: "{{ ssm_command_init_result.command.CommandId }}"
instance_id: "{{ item }}"
region: "{{ s3_region }}"
register: ssm_command_poll_result
loop: "{{ ansible_play_batch | map('extract', hostvars, 'ec2_instance_id') | list }}"
loop_control:
loop_var: instance_id
until: ssm_command_poll_result.command.InvocationResult.Status in ['Success', 'Failed', 'Cancelled']
retries: 30
delay: 10
- name: Process and Upload output to S3
ansible.builtin.include_tasks: process_and_upload_output.yml
loop: "{{ ssm_command_poll_result.results }}"
loop_control:
loop_var: result_item
when:
- result_item.command.InvocationResult.Status == 'Success'
- result_item.command.InvocationResult.StandardOutputContent is defined
- result_item.command.InvocationResult.StandardOutputContent | length > 0
Simulated Monitoring Dashboard
This is a live demonstration of the final, aggregated data. The dashboard provides a unified view of disk utilization across all accounts and instances. Use the filters to dynamically update the chart and explore the data.
Scalability and Future-Proofing
The solution is designed to scale seamlessly with enterprise growth. Key architectural choices ensure that onboarding new accounts and thousands of instances is an automated and efficient process.
AWS Organizations Integration
New AWS accounts are simply invited to the Organization, immediately placing them under the central governance and security model enforced by Service Control Policies (SCPs).
Automated IAM Deployment
Using Infrastructure as Code (CloudFormation StackSets or Terraform), the required `AnsibleDiskMonitorRole` is automatically and consistently deployed to new accounts.
Ansible Dynamic Inventory
The `aws_ec2` plugin automatically discovers new instances across all accounts, eliminating manual inventory management and enabling playbooks to scale effortlessly.
Security by Design
Security is a foundational pillar of this solution, with best practices embedded at every layer to protect data and infrastructure.
IAM Least Privilege
Roles are configured with the absolute minimum permissions required. The spoke account role, for example, is limited to executing SSM commands and cannot modify or delete resources.
Network Security
Using SSM for command execution eliminates the need for open inbound SSH ports on instances. All traffic to AWS services can be kept private within the AWS network using VPC Endpoints, preventing exposure to the public internet.
Data Encryption & Auditing
Data is encrypted in transit (TLS) and at rest (SSE-S3/KMS). All API calls and role assumptions are logged in AWS CloudTrail for a complete audit trail.