Solving Multi-Account EC2 Disk Utilization Monitoring

A large enterprise, expanded through acquisitions, faces fragmented visibility into EC2 disk utilization across numerous AWS accounts. The CTO requires a comprehensive, scalable solution to proactively monitor disk space, leveraging the company's existing Ansible tooling to mitigate operational risks like outages and performance degradation.

Interactive Architecture Explorer

This solution uses a hub-and-spoke model for security, scalability, and efficiency. Click on any component in the diagram below to learn more about its role, function, and implementation details within the monitoring framework.

Ansible Control Plane

EC2 in Shared Services Account

⚙️

Cross-Account IAM Roles

Secure, Temporary Access

🔑

Spoke Accounts (Workloads)

Account A

EC2 Instances w/ SSM Agent

Account B

EC2 Instances w/ SSM Agent

Account C ...

EC2 Instances w/ SSM Agent

Central S3 Bucket

Raw Data Storage

🗄️

Data Processing & Viz

Lambda, CloudWatch, etc.

📊

Implementation Deep Dive

This section provides the core technical assets for the solution: the dynamic inventory configuration and the main Ansible playbook used for collecting disk metrics.

Ansible Playbook: `collect_disk_metrics.yml`

This playbook leverages the `aws_ssm` module to run a single, intelligent script on target instances. The script automatically detects the OS (Linux/Windows) and runs the appropriate disk utilization commands, outputting a consistent JSON payload. The full script content is located in the `ansible/files/generic_disk_script.sh` file.

---
---
- name: Collect EC2 Disk Utilization
  hosts: all
  gather_facts: no

  vars:
    s3_bucket_name: "central-monitoring-s3-bucket"
    s3_region: "us-east-1"
    disk_collection_script_content: "{{ lookup('file', 'files/generic_disk_script.sh') }}"

  tasks:
    - name: Debug EC2 instance IDs being targeted (optional)
      debug:
        msg: "Targeting instance ID: {{ hostvars[item].ec2_instance_id | default('N/A') }}"
      loop: "{{ ansible_play_batch }}"
      when: hostvars[item].ec2_instance_id is defined

    - name: Execute generic disk utilization script via SSM
      amazon.aws.aws_ssm:
        state: "command"
        document_name: "AWS-RunShellScript"
        instance_ids: "{{ ansible_play_batch | map('extract', hostvars, 'ec2_instance_id') | list }}"
        parameters:
          commands:
            - "{{ disk_collection_script_content }}"
        region: "{{ s3_region }}"
      register: ssm_command_init_result
      delegate_to: localhost

    - name: Wait for SSM command to complete for each instance
      amazon.aws.aws_ssm_info:
        command_id: "{{ ssm_command_init_result.command.CommandId }}"
        instance_id: "{{ item }}"
        region: "{{ s3_region }}"
      register: ssm_command_poll_result
      loop: "{{ ansible_play_batch | map('extract', hostvars, 'ec2_instance_id') | list }}"
      loop_control:
        loop_var: instance_id
      until: ssm_command_poll_result.command.InvocationResult.Status in ['Success', 'Failed', 'Cancelled']
      retries: 30
      delay: 10

    - name: Process and Upload output to S3
      ansible.builtin.include_tasks: process_and_upload_output.yml
      loop: "{{ ssm_command_poll_result.results }}"
      loop_control:
        loop_var: result_item
      when:
        - result_item.command.InvocationResult.Status == 'Success'
        - result_item.command.InvocationResult.StandardOutputContent is defined
        - result_item.command.InvocationResult.StandardOutputContent | length > 0

Simulated Monitoring Dashboard

This is a live demonstration of the final, aggregated data. The dashboard provides a unified view of disk utilization across all accounts and instances. Use the filters to dynamically update the chart and explore the data.

0%

Scalability and Future-Proofing

The solution is designed to scale seamlessly with enterprise growth. Key architectural choices ensure that onboarding new accounts and thousands of instances is an automated and efficient process.

AWS Organizations Integration

New AWS accounts are simply invited to the Organization, immediately placing them under the central governance and security model enforced by Service Control Policies (SCPs).

Automated IAM Deployment

Using Infrastructure as Code (CloudFormation StackSets or Terraform), the required `AnsibleDiskMonitorRole` is automatically and consistently deployed to new accounts.

Ansible Dynamic Inventory

The `aws_ec2` plugin automatically discovers new instances across all accounts, eliminating manual inventory management and enabling playbooks to scale effortlessly.

Security by Design

Security is a foundational pillar of this solution, with best practices embedded at every layer to protect data and infrastructure.

IAM Least Privilege

Roles are configured with the absolute minimum permissions required. The spoke account role, for example, is limited to executing SSM commands and cannot modify or delete resources.

Network Security

Using SSM for command execution eliminates the need for open inbound SSH ports on instances. All traffic to AWS services can be kept private within the AWS network using VPC Endpoints, preventing exposure to the public internet.

Data Encryption & Auditing

Data is encrypted in transit (TLS) and at rest (SSE-S3/KMS). All API calls and role assumptions are logged in AWS CloudTrail for a complete audit trail.