AWS RDS Anonymization |

Contents

Updated on 2024-11-28

OS: Ubuntu
RDS (MySQL) used: 5.7
Requirements:
- AWS account with RDS MySQL instance
- Gitlab account
- Terraform installation
- Docker installation
- Vault installation

Introduction:

The General Data Protection Regulation (GDPR) is a European regulatory text that frames data processing equally throughout the European Union. It came into force on May 25, 2018.

It has been designed around 3 objectives:

enforce people’s rights
make data processors more accountable
enhance the credibility of regulation through closer cooperation between data protection authorities.

Process:

To summarize a classic need: receivers want to test Production data but in an environment to which they have access and which is less critical, Staging. However, if we want to comply with RGPD standards, we need to anonymize this data and then export it to another environment.

To do this, we want to create a backup of a database in the Production environment, anonymize this backup and then restore it to a database in the Staging environment.

To ensure that the backup is carried out without any impact on our employees, we’re going to launch it at 4:00 a.m. with a CloudWatch event.

Here’s a diagram illustrating the anonymization process:

Technical Part:

Without going into too much detail, I’ll just give you a few examples of how, technically, I came up with my project.

When I declare my provider “aws” I specify the assume role that allows me to access the AWS account in question:

1
2
3
4
5
6


provider "aws" {
  region = var.region
  assume_role {
    role_arn = var.devops_role_env_map[terraform.workspace]
  }
}

Then I create my Lambda function, which will allow me to execute my python code and set up environment variables thanks to datasources that allow me to retrieve 2 variabes from Vault :GITLAB_PIPELINE_TOKEN et GITLAB_ANONYMIZE_PIPELINE_URL.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


resource "aws_lambda_function" "rds_snapshot" {
  filename      = "lambda_package.zip"
  function_name = "rds_snapshot"
  role          = aws_iam_role.anonymization.arn
  handler       = "rds_snapshot.lambda_handler"

  source_code_hash = base64sha256("lambda_package.zip")

  timeout       = 900
  runtime       = "python3.7"

  environment {
    variables = {
      GITLAB_PIPELINE_TOKEN = data.vault_generic_secret.gitlab-pipeline.data.token
      GITLAB_ANONYMIZE_PIPELINE_URL = data.vault_generic_secret.gitlab-pipeline.data.url
      LAMBDA_WORKSPACE = terraform.workspace
    }
  }
}

I then want to configure my 3 Cloudwatch events, which will allow me to perform a specific action depending on the event. Here’s an example, in 3 parts, of how to configure an event:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


## RDS SNAPSHOT EVENT
resource "aws_cloudwatch_event_rule" "event_create_snapshot" {
  name                = "create-snapshot"
  description         = "Create RDS Snapshot, triggers every day at 4am from monday to friday"
  schedule_expression = "cron(0 3 ? * MON-FRI *)" # caution: aws cron uses UTC time
}

resource "aws_lambda_permission" "allow_event_create_snapshot" {
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.rds_snapshot.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.event_create_snapshot.arn
}

resource "aws_cloudwatch_event_target" "event_create_snapshot_target" {
  for_each  = local.lang
  rule      = aws_cloudwatch_event_rule.event_create_snapshot.name
  target_id = "rds_snapshot"
  arn       = aws_lambda_function.rds_snapshot.arn
  input     = <<JSON
        {
            "action": "create-snapshot",
            "lang": "${each.value}"
        }
JSON
}

Above, in order;

I’ve created an event that will run every day at 4:00 a.m.
I authorize my event to invoke my Lambda function rds_snapshot
I create my event_target which will target my Lambda function rds_snapshot with action and lang as arguments.

As you can see, this architecture, which consists of 3 Terraform resources, will be replicated 3 times, making 3 events:

A cronjob-based event for snapshot creation (example above)
An event that occurs when snapshot creation is available, to invoke restoration (as an RDS instance) of the snapshot via my python script.
Then a final event based on the end of restoration, to execute a POST request to Gitlab, which will anonymize the restored instance, dump and restore it on a Staging basis.

Here’s a diagram explaining how to automate the RDS Snapshot backup: