Architecture

Solution Overview

Solution Overview
  1. BucketArchiver’s Step Functions State Machine initiates an execution.
  2. The State Machine provisions an ephemeral EC2 instance.
  3. The EC2 instance reads data from the specified S3 input bucket.
  4. The EC2 instance streams a compressed (gzip’ed tar) archive to the specified S3 output bucket.
  5. Upon completion, the State Machine terminates the ephemeral EC2 instance.
  6. The execution results are published to the BucketArchiver SNS topic.

Bucket Archive Files

BucketArchiver utilizes industry-standard, POSIX.1-1988 compliant tar and RFC 1952/RFC 1951 compatible parallel gzip implementations. The generated archives are multi-platform compatible and can be accessed on:

CloudFormation Stack

BucketArchiver employs an AWS CloudFormation stack to manage its components. Interaction with BucketArchiver is performed through AWS CloudFormation interfaces (Console, CLI, API/SDK). Configurable parameters include:

Modifications to the CloudFormation stack can be made at any time post-deployment to adjust these parameters.

State Machine

BucketArchiver orchestrates the archival workflow using an AWS Step Functions State Machine. Users can interact with the State Machine through AWS Step Functions interfaces (Console, CLI, API/SDK) to:

Ephemeral EC2 Instances

The State Machine provisions and terminates ephemeral EC2 instances as part of the archival workflow. These instances are optimized for archival tasks, leveraging parallel compression techniques on multi-core CPUs. The instances are based on an Amazon Linux 2023.

SNS Topic

An Amazon SNS topic is provisioned to publish BucketArchiver State Machine execution results, including execution statuses and metadata. The SNS topic also supports optional email notifications.

Networking

BucketArchiver requires a VPC for EC2 instance provisioning. Two VPC deployment options are available:

User-Supplied Existing VPC

The VPC must allow the EC2 instance to access public endpoints for the following AWS services:

Dedicated VPC

Alternatively, a VPC tailored for BucketArchiver can be provisioned. This VPC features private subnets and AWS PrivateLink endpoints for secure, internal AWS service access. The VPC contains two subnets and each subnet includes the VPC endspoints:

Security

BucketArchiver is deployed within your AWS account, offering full control over data and compliance. IAM roles are provisioned for:

BucketArchiver Security Overview

Monitoring and Logging

AWS CloudWatch is utilized for:

Limitations

Operations

CloudFormation Template Acquisition

You can obtain the BucketArchiver CloudFormation template from the AWS Marketplace. There are two template variants available:

  1. Existing VPC Deployment: Use this template if you wish to deploy the BucketArchiver CloudFormation stack within your existing VPC.
  2. Dedicated VPC Deployment: This template allows you to deploy the BucketArchiver stack along with a dedicated VPC.

CloudFormation Stack Deployment Steps

  1. Navigate to the BucketArchiver product page on the AWS marketplace.
  2. Select and deploy your desired template variant.
  3. Configure the CloudFormation template to create a stack.
  4. Optionally re-adjust the deployment parameters such as input bucket, output bucket, instance type, archive name, scheduler settings, and more.

We provide a dedicated and interactive deployment guide here: (https://www.bucketarchiver.com/deployment-guide/)[Deployment Guide].

The deployment usually completes with 3-5 minutes.

CloudFormation Parameters

Archive Parameters

Parameter Name Type Default Value Min/Max Constraints Description and Examples
InputBucket String None 3-63 [a-z0-9\-.]+ Input S3 bucket name. Must have 3-63 characters; can contain lowercase letters, numbers, hyphens, and periods.
OutputBucket String None 3-63 [a-z0-9\-.]+ Output S3 bucket name. Similar constraints as InputBucket.
ArchivePattern String ‘*’ N/A Custom Pattern Defines file or directory pattern for archiving. E.g., '*' for all files, '*.txt' for all text files.
ArchiveName String ‘archive’ 1-128 Alphanumeric, -, _, . Specifies the archive name. Resulting file will be <YourSpecifiedName>_<ISODate>.tar.gz. E.g., ‘archive_2023-09-15T14-30-45Z.tar.gz’ if name is ‘archive’.
MaxExecutionTime Number 3600 1-86400 N/A Max time for archival process in seconds. Ranges from 1 second to 24 hours.
Scheduler String None 1-256 Specific Formats Scheduling expression for triggering. E.g., specific time: at(2023-09-15T14:30:45), daily at specific time: cron(30 14 * * ? *), or every 5 days: rate(5 days).
EmailAddress String '' N/A Email Format Optional. Receives notifications when archival completes. E.g., user@example.com or leave empty.
KMSKeysRestrictionList CommaDelimitedList ‘*’ N/A ARN Format or * Comma-separated list of KMS key ARNs to restrict access to. Use * for no restrictions. E.g., arn:aws:kms:REGION:ACCOUNT_ID:key/KEY_ID,arn:aws:kms:REGION:ACCOUNT_ID:key/ANOTHER_KEY_ID.
LogGroupRetentionDays Number 7 N/A 1, 3 Number of days to retain log events in CloudWatch. Allowed values are 1 and 3.
InstanceType String ‘t2.micro’ N/A List of EC2 Types EC2 instance type like m5.large, t3.small, etc.

Dedicated VPC Network Configuration Parameters

Parameter Name Type Default Value Min/Max Constraints Description and Examples
VpcCidrBlock String ‘10.10.10.0/24’ N/A CIDR Format VPC CIDR block for EC2 instances, e.g., 10.0.0.0/24.
SubnetACidrBlock String ‘10.10.10.0/25’ N/A CIDR Format CIDR block for Public Subnet A, e.g., 10.0.0.0/26.
SubnetBCidrBlock String ‘10.10.10.128/25’ N/A CIDR Format CIDR block for Public Subnet B, e.g., 10.0.0.128/26.

(Re-)Configuring CloudFormation Stack(s)

Deployment and continuous configuration of BucketArchiver are executed through CloudFormation. After deploying the CloudFormation template with the initial settings, you can modify these parameters as needed.

To update a CloudFormation stack due to parameter alterations, you typically adjust the stack’s parameters using the AWS Management Console, AWS CLI, or SDKs. After finalizing the changes, you pass the revised parameter values to CloudFormation. The service then contrasts the existing stack with the new parameters and adjusts the stack accordingly to match your updated settings.

Updating Configuration Parameters

  1. Via AWS Console

    • Go to your desired BucketArchiver deployment grouping stack in the CloudFormation console
    • Select ‘Update Stack’
    • Under the ‘Parameters’ section, locate the desired parameter and change/input your desired configuration
  2. Via AWS CLI When updating the stack via the AWS CLI, provide the parameter key and value using the --parameters option:

aws cloudformation update-stack 
--stack-name BucketArchiverStack-XXXYYY 
--use-previous-template 
--parameters ParameterKey=<KEY>,ParameterValue="<VALUE>"

Execution Modes

The Scheduler parameter in the BucketArchiver CloudFormation template is designed to dictate when the archival process is triggered. This scheduling leverages the cron and rate and at expressions from Amazon EventBridge, granting you precise control over the archival start timing. Depending on your expression you will be able to achieve

Rate Expression

Cron Expression

At() Expression

The at() expression specifies a unique timestamp when an event should fire, ideal for one-off scheduled tasks.

Using at() expression yields a one time execution at a point in time. If you specify a past point in time the execution will never be started automatically but can be triggered using an on-demand execution of the BucketArchiver AWS StepFunctions state machine. Please note that selecting a past date in the at expression will immedatly trigger an execution.

If you intend to use BucketArchiver for on-demand operation only (using the AWS console, StepFunctions CLI/API, etc.) we advise to use the at expression and configure a date in the far future, such as: at(2099-12-31T00:00:00)

On-demand Archive Workflow

The on-demand workflow allows you to manually trigger the BucketArchiver AWS StepFunctions state machine as needed. Follow these steps to execute an on-demand archival in

AWS Console

  1. Open the AWS Management Console.
  2. Navigate to AWS Step Functions.
  3. Locate the BucketArchiver state machine and start a new execution.
  4. Monitor the state machine’s execution and review the output in the designated output bucket.

AWS CLI

Using Bash

    REGION=$(aws configure get region)
    ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
    aws stepfunctions start-execution --state-machine-arn "arn:aws:states:$REGION:$ACCOUNT_ID:stateMachine:BucketArchiverStateMachine-<XXYYYY>" --name "Execution-1" --input '{}'

Using PowerShell

    $REGION = (Get-AWSCredential).Region
    $ACCOUNT_ID = (Get-STSCallerIdentity).Account
    aws stepfunctions start-execution --state-machine-arn "arn:aws:states:$REGION:$ACCOUNT_ID:stateMachine:BucketArchiverStateMachine-<XXXYYY>" --name "Execution-1" --input '{}'

Monitoring and Reporting

BucketArchiver integrates with CloudWatch metrics to streamline operational monitoring. In CloudWatch, metrics are organized into namespaces, serving as containers for specific metric categories. The BucketArchiver namespace contains:

EC2 Machine Metrics

Dimensions: InstanceId & InstanceType

BucketArchiver Archive Generation Metrics

Dimensions: InputBucket (indicating the source S3 bucket) & RunTimestamp (marking the task’s conclusion timestamp).

Logging

BucketArchiver logs important events and information to Amazon CloudWatch Logs. You can access the logs in the CloudWatch console under the /BucketArchiver-<CloudFormationStackId> log group. This log group contains the excecution log of the AWS Step Functions state machine and the output of the compression process on the EC2 instance. The following log streams are created as part of the log group and can be used for troubleshooting purposes:

Log Stream Name Use Case Notes
states/BucketArchiverStateMachine/<date>/<execution-id> Log events of BucketArchiver state machine
invoke/<instance-id>/log Log events of BucketArchiver EC2 invocation process Use this to understand which objects where included in archival process.
archive/<input-bucket-name>/<date> Log events of BucketArchiver archival and compression process Use this to understand potential S3 bucket access permissions.

Performance

BucketArchiver internally uses a parallel compression tool capable of exploiting multiple CPU cores, providing a more scalable and faster compression of your buckets then traditional approaches. Compression and archival performance is dependant on multiple factors such as

We adise the following procedure for optimizing effiency

Picking Instance Types

When selecting an EC2 instance type for your BucketArchiver workflows, consider factors such as CPU, memory, and network performance. With the CloudFormation template we have preselected a wide selection of EC2 instance types that should provide optimum price/performance for BucketArhive operation.

Cross Region Workflows

BucketArchiver supports archival workflows across different AWS regions. Keep in mind that cross-region data transfers may impact performance and incur additional data transfer costs. The dedicated VPC deployment variant is limited to accessing S3 buckets in the same region as the CloudFormation stack.

Patching and Updates

While we maintain a very trimmed down version of the Amazon Linux based AMI for BucketArchiver and the instances deployed form this AMI are only emphemeral every now and then an patch update will be required. If there’s a crucial update for the BucketArchiver AMI, primarily related to critical CVEs (Common Vulnerabilities and Exposures), AWS Marketplace will send out a notification. Once you are notified we will provide a set of updates CloudFormation templates that use an AMI that is updated and not affected by the CVE. Updating BuckerArchiver involves updating the template of all BucketArchiver CloudFormation stacks that you have deployed.

Changelog

Initial Release

Start using our solution as 14 day free trial.

Pick from our deployment options at AWS marketplace.