Skip to content

theorib/gdpr-obfuscator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

167 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GDPR Obfuscator

Table of Contents

Click to expand

Introduction

The purpose of this project is to create a general-purpose Python package that can process data stored on an AWS S3 bucket, obfuscating any personally identifiable information (PII) the data may contain. The generated result is an exact copy of the original data except for the specified data fields that are replaced with obfuscated values such as ***.

The package is designed to ingest data directly from a specified AWS S3 bucket. It returns a bytes object that can be easily stored back into a S3 or be further processed in the data pipeline. The package can be easily integrated into existing AWS services such as Lambda, Glue, Step Functions, EC2 instances, etc, being fully compatible with serverless environments.

It is written in Python, is fully tested using pytest, PEP-8 compliant (linted and formatted with ruff), and follows best practices for security and performance (tested using bandit).

Currently the package supports ingesting and processing CSV, JSON, and Parquet files.

Tech Stack

The GDPR Obfuscator package is built with modern, high-performance Python libraries:

  • Polars: Used to convert file types into DataFrames and perform data obfuscation. It's a Lightning-fast DataFrame library and a modern alternative to pandas
  • Boto3: AWS SDK for Python, enabling S3 integration
  • types-boto3: Type stubs for boto3, providing full IDE autocomplete and type safety

These dependencies are automatically installed when you install the package.

Quick Start

  1. Install the gdpr-obfuscator package in your virtual environment:

    uv add git+https://github.com/theorib/gdpr-obfuscator.git

    or

    pip install git+https://github.com/theorib/gdpr-obfuscator.git
  2. Configure your AWS CLI with the correct S3 permissions to allow the package to read from an S3 bucket (s3:GetObject). See Configuring the AWS CLI for detailed setup instructions.

  3. Use the package in your Python script:

    from gdpr_obfuscator import gdpr_obfuscator
    
    result_bytes = gdpr_obfuscator("s3://bucket-name/file-name.csv", ["email", "name"])

New to Python or AWS? Check the detailed setup instructions below.

Requirements

  • Python 3.10+ (tested with v3.10 through v3.13)
  • AWS Permissions as a minimum, s3:GetObject permissions
πŸ“– New to Python or AWS? Click for detailed setup instructions

Installing Python

This package requires Python version 3.10 or higher installed and configured on your computer.

If you are using uv, you can install Python by running:

uv python install

Or by following their docs on Installing Python.

Otherwise, you can follow standard instructions on the Python official website to install Python manually.

AWS Account Setup

You'll need an active AWS account with the right permissions and credentials configured. This is required for anyone using this package as it reads data directly from an S3 bucket.

The required AWS IAM permissions depend on your use case:

  • Minimum: S3 read permissions (s3:GetObject) for the bucket(s) you'll be accessing
  • Optional: Also include S3 write permissions (s3:PutObject) if you plan to save obfuscated results back to the same S3 bucket

How you configure credentials and permissions depends on where you're running the package:

  • Running locally: Configure the AWS CLI with your credentials (see Configuring the AWS CLI below)
  • Running on an AWS environment (Lambda, EC2, ECS, etc): Use IAM roles attached to your compute resource

Terminal Basics

You'll need to be comfortable with the basics of running terminal commands using a terminal emulator. A terminal emulator comes built-in on MacOS and Linux, and can be easily installed on Windows using Windows Terminal.

Optional Requirements

  • We recommend uv as your project's package manager. It can install Python versions, create a virtual environment, and manage dependencies for you automatically.
  • AWS CLI must be installed and configured with your AWS credentials if you want to run this package locally and for deploying the sample Lambda infrastructure included in this repository. Make sure your AWS CLI is configured with the necessary permissions.

Configuring the AWS CLI

To use the GDPR Obfuscator package locally, you need the AWS CLI installed and configured with credentials that have as a minimum S3 read permissions (s3:GetObject) for the bucket you will be accessing.

πŸ“– AWS CLI Setup Guide (click to expand)

Creating an AWS Account

If you don't have an AWS account yet, you can follow Launch Goat's excellent tutorial on creating an AWS account. This guide will walk you through AWS account creation and security setup.

Installing the Latest Version of the AWS CLI

Follow the official documentation to install the latest version of the AWS CLI for your operating system.

AWS CLI Authentication Options

There are two main approaches for authenticating the AWS CLI (AWS CLI credentials):

Option 1: IAM User with Access Keys (quickest setup for development/testing, not recommended for production)

This is the quickest method for local development and testing. Follow this step-by-step guide for IAM user and AWS CLI setup. Despite mentioning Windows in the title, most steps are platform-agnostic.

Security Note: Using long-lived access keys with broad permissions is convenient for development but is not recommended for production environments. Access keys can be exposed or compromised. For production deployments, use IAM roles (when running on AWS services like Lambda, EC2, or ECS) or SSO authentication.

Option 2: AWS SSO / IAM Identity Center (most robust security-focused approach, recommended for Production/Teams)

For production environments or team-based development, AWS recommends using AWS Single Sign-On (SSO) authentication through the IAM Identity Center.

Follow the AWS Launch Goat SSO setup guide for a good step by step guide to this approach. Note that this requires setting up AWS Organizations and is more involved than the access key method.

SSO authentication on the AWS CLI signs you off periodically. To sign back in, run:

aws sso login

Official AWS Documentation

For reference, AWS provides comprehensive documentation:

Required IAM Permissions

Ensure your credentials have these permissions:

Installing GDPR Obfuscator in your Python Project

Install with uv (recommended):

uv add git+https://github.com/theorib/gdpr-obfuscator.git

Or with pip:

pip install git+https://github.com/theorib/gdpr-obfuscator.git
πŸ“– Setting up a new Python project? Click here for detailed instructions

Creating a New Project with uv

If you haven't already, install uv. Then navigate to the directory where you wish to create your Python project and run:

uv init

Follow the onscreen prompts to initialize the project, then install the GDPR Obfuscator package as a dependency:

uv add git+https://github.com/theorib/gdpr-obfuscator.git

Creating a New Project with pip

Navigate to the directory where you wish to create your Python project and initialize your virtual environment:

python -m venv venv
source venv/bin/activate
export PYTHONPATH=$(pwd)

Then install the GDPR Obfuscator package as a dependency:

pip install git+https://github.com/theorib/gdpr-obfuscator.git

API Reference

gdpr_obfuscator(file_to_obfuscate, pii_fields)

This is the main function that processes CSV, JSON, and Parquet files and obfuscates specified PII fields.

Parameters

  • file_to_obfuscate (str): S3 address to the file to be obfuscated. Formatted as s3://<bucket_name>/<file_key> (e.g., s3://my-bucket-name/some_file_to_obfuscate.csv)
  • pii_fields (list[str]): List of column names (or fields) that contain PII data to be obfuscated (e.g. ["full_name", "date_of_birth", "address", "phone"])
  • masking_string (str): String used to replace PII data (default is "***")
  • file_type (Literal["csv", "json", "parquet"]): Type of file to obfuscate, can be one of csv, json, or parquet, (default is "csv")

Raises

  • ValueError: If an empty file_to_obfuscate is passed
  • FileNotFoundError: If the specified file doesn't exist (invalid S3 path)
  • KeyError: If any of the specified pii_fields are not found in the file
  • RuntimeError: If an unexpected S3 response error occurs

Returns

  • bytes: Obfuscated file as a bytes object, ready for S3 upload or further processing

Examples

Obfuscating a CSV file

from gdpr_obfuscator import gdpr_obfuscator

# Process a file with multiple PII fields
result = gdpr_obfuscator(
    "s3://my-bucket/customer-data.csv",
    ["email", "phone", "address"]
)

Obfuscating a Parquet file

from gdpr_obfuscator import gdpr_obfuscator

result = gdpr_obfuscator(
    "s3://my-bucket/customer-data.parquet",
    ["email", "phone", "address"],
    file_type="parquet"
)

Obfuscating a JSON file with a custom masking string

from gdpr_obfuscator import gdpr_obfuscator

result = gdpr_obfuscator(
    "s3://my-bucket/customer-data.json",
    ["email", "phone", "address"],
    masking_string="#######",
    file_type="json"
)

Saving back to S3

The result could be easily saved back to S3 using a library such as Boto3:

import boto3
from gdpr_obfuscator import gdpr_obfuscator

obfuscated_bytes = gdpr_obfuscator("s3://bucket-name/file_key.csv", ["name","email", "phone", "address"])

s3_client = boto3.client('s3')
 
response = s3_client.put_object(
    Bucket='another-bucket-name',
    Key='file_key_obfuscated.csv',
    Body=obfuscated_bytes,
    ContentType="text/csv", # specifying a MIME content type is optional
)

Notes

  • All PII field values are replaced with *** by default but can be customized using the masking_string parameter
  • Non-PII columns remain unchanged
  • Original file structure and formatting is preserved
  • Compatible with CSV, JSON, and Parquet files

Error Handling

The GDPR Obfuscator raises descriptive exceptions for common issues (see Raises above). Error messages guide you to the solution.

πŸ“– Common Issues and Solutions (click to expand)

Malformed S3 Path

  • Error: FileNotFoundError
  • Error Message: Invalid S3 path: Missing or malformed "s3://" prefix
  • Correct format: gdpr_obfuscator("s3://bucket-name/file.csv", ["field_1", "field_2"])
  • Incorrect: gdpr_obfuscator("bucket/file.csv", ["field_1", "field_2"])

Invalid S3 Bucket

  • Error: FileNotFoundError
  • Error Message: The specified key does not exist
  • Solution: Ensure the bucket name is correct and exists

Invalid S3 Key

  • Error: FileNotFoundError
  • Error Message: The specified key does not exist
  • Solution: Ensure the file key is correct and exists

Missing PII Fields

  • Error: KeyError
  • Error Message: PII fields not found: ["Email"]
  • Solution: Check your CSV headers match the pii_fields exactly
  • Case-sensitive: "Email" β‰  "email"

File Format Issues

  • Only CSV, JSON, and Parquet files are currently supported
  • CSV Files must have proper CSV headers in the first row
  • JSON Files must have proper JSON structure
  • Parquet Files must have proper Parquet structure
  • Maximum file size: 1MB for optimal performance

Performance

The GDPR Obfuscator is designed to handle large files efficiently. It can easily process files larger than 1MB with thousands of rows and is optimized for performance.

During local testing, the processing time for a large 1MB CSV file with 7,032 rows, took 1.602619s.

99.8% of that time was spent on network overheads, (connecting to S3 and retrieving the data).

The actual time spent processing the data was only 0.003911s (0.2% of total time).

You can run performance tests locally if you want.

Performance Summary

Metric Full (S3 + Processing) Mocked (Processing Only) Difference
Execution Time 1.602619s 0.003911s 1.598708s (99.8%)
Throughput 4,388 rows/s 1,797,872 rows/s 409.74x faster
Function Calls 26,852 491 +26,361
Primitive Calls 25,113 489 +24,624

Data Processed

  • File size: 1.00 MB
  • Rows: 7,032
  • Fields obfuscated: 4
  • Total fields processed: 28,128

Key Insights

  • Network overhead: 1.598708s (99.8% of total time)
  • Processing time: 0.003911s (0.2% of total time)

AWS Lambda Deployment Example

This repository includes a complete, production-ready example of deploying an AWS Lambda function that uses the GDPR Obfuscator package. It uses Pulumi for infrastructure as code. Pulumi is a modern alternative to Terraform that allows you to write infrastructure code in your own language (Python in this case).

The sample infrastructure shows best practices for using this package on an AWS production environment, including:

  • Lambda Layers architecture: The GDPR Obfuscator package and its dependencies are deployed as separate Lambda Layers, following AWS best practices for code organization and deployment efficiency (since changing the lambda source code does not require new layer dependency updates)
  • Proper IAM configuration: Includes secure IAM roles and policies that follow the principle of least-privilege in order to give the Lambda access to an S3 bucket with minimal permissions
  • S3 integration: Sample S3 bucket with test data to help you get started quickly
  • CloudWatch logging: Configured with JSON formatted logs for monitoring and debugging
  • Infrastructure as Code: Everything is defined in Pulumi, making it easy to deploy, modify, and tear down

The Lambda function can be triggered manually, via the included make script or integrated with EventBridge, Step Functions, or other AWS services to create automated data obfuscation pipelines.

You can find detailed step-by-step instructions for deploying and testing the sample infrastructure in the Deploying sample infrastructure into AWS section below.

It is a reference implementation if you're planning to use the GDPR Obfuscator in your own AWS environment.

Local Development and Testing

πŸ“– Getting started with local development (click to expand)

Development Environment Setup

You'll need a terminal emulator to run this project in a local development environment.

The commands you'll see below can be copy/pasted into your terminal. After pasting each command, press Enter to execute it.

Required Tools

  • uv - Package manager used in this project (see Installing uv)
  • git - To clone this repository
  • make - To run makefile commands

Optional Tools

  • ruff - For code formatting and linting
  • AWS CLI - For running this package locally or deploying sample infrastructure (see Configuring the AWS CLI)
  • Pulumi - For deploying sample infrastructure to AWS Lambda

Installing uv

πŸ“– How to install uv (click to expand)

uv is an extremely fast Python package and project manager, written in Rust. It is used to build and run this project.

If Python is already installed on your system, uv will detect and use it without configuration. However, if you don't have Python installed, uv will automatically install missing Python versions as needed β€” you don't need to install Python to get started.

On MacOS or Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

On Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

For other setups or troubleshooting, please refer to the official uv documentation.

Cloning this repository

On your terminal, navigate to the directory where you want to add this project to and clone this repository using the following command:

git clone https://github.com/theorib/gdpr-obfuscator

Makefile commands

Run make help to see all available commands.

πŸ“– About Makefile commands (click to expand)

We provide a series of Makefile commands to help you navigate this project. You can get a complete list of commands with descriptions by running:

make help

This will display all available commands including setup, testing, linting, formatting, and deployment operations.

Project Setup

Make sure uv and make are installed on your system and then from the root of your cloned project directory run:

make setup

This will install all dependencies and run checks to ensure everything is set up correctly. You should see all tests passing as well as security and test coverage reports in your terminal

Deploying sample infrastructure into AWS

This project contains Infrastructure as Code (IaC) scripts using Pulumi, which is a modern Python-native alternative to Terraform.

It demonstrates a complete AWS deployment workflow using the GDPR Obfuscator package. It's a real-world example that follows AWS best practices and can serve as a reference for your own deployments.

The current pulumi setup is ready to:

  • Create a sample S3 bucket
  • Load the S3 bucket with test data
  • Create a sample lambda function that can read from that S3 bucket using the GDPR Obfuscator package to obfuscate data stored in the S3 bucket, saving the processed data back to the same bucket
  • Configure the necessary IAM roles and policies following the principle of least privilege
  • Set up CloudWatch logging for monitoring and debugging

Required IAM Permissions for Sample Infrastructure Deployment

To deploy the sample infrastructure using Pulumi, your AWS CLI credentials must have permissions to create and manage the following AWS resources:

  • aws:s3:Bucket - Create and manage S3 buckets
  • aws:s3:BucketObject - Upload and manage objects in S3 buckets
  • aws:iam:Role - Create IAM roles for Lambda execution
  • aws:iam:Policy - Create IAM policies for resource access
  • aws:iam:RolePolicyAttachment - Attach policies to IAM roles
  • aws:lambda:LayerVersion - Create Lambda layers for dependencies
  • aws:lambda:Function - Create and configure Lambda functions
  • aws:cloudwatch:LogGroup - Create CloudWatch log groups for Lambda logging

Note: In production environments, you should always follow the principle of least privilege and grant only the specific permissions needed. For testing and development, you may use an IAM user or role with a broader permission set, but ensure you understand the security implications.

To deploy the sample infrastructure, follow these steps

Run all of the commands below from the root of your cloned repository.

  1. Make sure you have followed the steps above for:

  2. Make sure you have the Pulumi CLI installed and set up

  3. Make sure you have the AWS CLI installed and configured with your AWS credentials (see Configuring the AWS CLI)

  4. Make sure that the credentials configured in the AWS CLI have the necessary permissions to create and manage resources in your AWS account

  5. Setup pulumi for this project:

    make sample-infrastructure-setup
  6. Deploy sample infrastructure into your AWS account:

    make sample-infrastructure-deploy

    This will create a sample S3 bucket, load test data, and create a sample lambda function.

You can now login into your AWS Management Console and inspect the lambda and S3 buckets that were created as well as the sample data that was loaded into the bucket.

Running live tests with the sample infrastructure and GDPR Obfuscator

For convenience, we have provided a set of make scripts that will run the lambda using the sample test files.

The following command will send an event to the Lambda using a small sample test file.

make sample-infrastructure-run-test

The following script will send an event with a large, 1MB CSV test file containing 7031 data rows.

make sample-infrastructure-run-test-large

After running these scripts, you can check the output files that will have been saved to the same test bucket in the AWS management console. The newly created file keys will have been suffixed with _obfuscated before the extension (example: from large_pii_data.csv to large_pii_data_obfuscated.csv).

Once you are done testing and want to clean up the AWS resources that were created, you can run:

make sample-infrastructure-destroy

Running manual tests on the sample infrastructure using the AWS management console

You can also manually test the lambda by giving it a test event of your choosing. Make sure it references an existing bucket as well as existing file keys and pii fields. If you want to manually use the sample bucket and test data provided, you can get their values by running:

make sample-infrastructure-get-output

The CSV columns included in those files are the following. You can pick and choose any combination of them to obfuscate:

["id", "name", "email_address", "phone_number", "date_of_birth", "address", "salary", "department", "hire_date", "project_code", "status", "region"]

The sample lambda, expects an event object as it's first argument and it should have the following shape:

{
    "file_to_obfuscate": "s3://<bucket-name>/<file-key>",
    "pii_fields": [
        "field_1",
        "field_2",
        "field_3",
    ],
    "destination_bucket": "<bucket-name>"
}

Replace <bucket-name>,<file-key> and the pii_fields list with the values that you want (such as those from the make sample-infrastructure-get-output command). Or from any other bucket you may have that contains test data.

Running performance tests locally

This repository includes a performance profiling script which you can run locally by using the following make command:

make profile-gdpr-obfuscator

Make sure that before you do so, you have the sample infrastructure deployed

Alternatively, you can import the gdpr_obfuscator_profiling function and run it directly with any suitable files you may have hosted on s3:

from src.gdpr_obfuscator_profiling.gdpr_obfuscator_profiling import gdpr_obfuscator_profiling

gdpr_obfuscator_profiling(
  file_to_obfuscate="s3://some-bucket/some_file.parquet",
  pii_fields=["name", "email_address", "phone_number", "address"],
  profiling_data_output_dir='/profile-report',
  file_type="parquet"
)

About

A general-purpose Python package that can process data stored on an AWS S3 bucket, obfuscating select personally identifiable information (PII) contained in the data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors