Click to expand
- GDPR Obfuscator
- Table of Contents
- Introduction
- Tech Stack
- Quick Start
- Requirements
- Optional Requirements
- Configuring the AWS CLI
- Installing GDPR Obfuscator in your Python Project
- API Reference
- Error Handling
- Performance
- AWS Lambda Deployment Example
- Local Development and Testing
The purpose of this project is to create a general-purpose Python package that can process data stored on an AWS S3 bucket, obfuscating any personally identifiable information (PII) the data may contain. The generated result is an exact copy of the original data except for the specified data fields that are replaced with obfuscated values such as ***.
The package is designed to ingest data directly from a specified AWS S3 bucket. It returns a bytes object that can be easily stored back into a S3 or be further processed in the data pipeline. The package can be easily integrated into existing AWS services such as Lambda, Glue, Step Functions, EC2 instances, etc, being fully compatible with serverless environments.
It is written in Python, is fully tested using pytest, PEP-8 compliant (linted and formatted with ruff), and follows best practices for security and performance (tested using bandit).
Currently the package supports ingesting and processing CSV, JSON, and Parquet files.
The GDPR Obfuscator package is built with modern, high-performance Python libraries:
- Polars: Used to convert file types into DataFrames and perform data obfuscation. It's a Lightning-fast DataFrame library and a modern alternative to pandas
- Boto3: AWS SDK for Python, enabling S3 integration
- types-boto3: Type stubs for boto3, providing full IDE autocomplete and type safety
These dependencies are automatically installed when you install the package.
-
Install the
gdpr-obfuscatorpackage in your virtual environment:uv add git+https://github.com/theorib/gdpr-obfuscator.git
or
pip install git+https://github.com/theorib/gdpr-obfuscator.git
-
Configure your AWS CLI with the correct S3 permissions to allow the package to read from an S3 bucket (
s3:GetObject). See Configuring the AWS CLI for detailed setup instructions. -
Use the package in your Python script:
from gdpr_obfuscator import gdpr_obfuscator result_bytes = gdpr_obfuscator("s3://bucket-name/file-name.csv", ["email", "name"])
New to Python or AWS? Check the detailed setup instructions below.
- Python 3.10+ (tested with v3.10 through v3.13)
- AWS Permissions as a minimum,
s3:GetObjectpermissions- When running locally: Configure the AWS CLI
- When running on an AWS environment (Lambda, EC2, ECS): set up the correct IAM roles, policies and permissions for the environment you are using
π New to Python or AWS? Click for detailed setup instructions
This package requires Python version 3.10 or higher installed and configured on your computer.
If you are using uv, you can install Python by running:
uv python installOr by following their docs on Installing Python.
Otherwise, you can follow standard instructions on the Python official website to install Python manually.
You'll need an active AWS account with the right permissions and credentials configured. This is required for anyone using this package as it reads data directly from an S3 bucket.
The required AWS IAM permissions depend on your use case:
- Minimum: S3 read permissions (
s3:GetObject) for the bucket(s) you'll be accessing - Optional: Also include S3 write permissions (
s3:PutObject) if you plan to save obfuscated results back to the same S3 bucket
How you configure credentials and permissions depends on where you're running the package:
- Running locally: Configure the AWS CLI with your credentials (see Configuring the AWS CLI below)
- Running on an AWS environment (Lambda, EC2, ECS, etc): Use IAM roles attached to your compute resource
You'll need to be comfortable with the basics of running terminal commands using a terminal emulator. A terminal emulator comes built-in on MacOS and Linux, and can be easily installed on Windows using Windows Terminal.
- We recommend uv as your project's package manager. It can install Python versions, create a virtual environment, and manage dependencies for you automatically.
- AWS CLI must be installed and configured with your AWS credentials if you want to run this package locally and for deploying the sample Lambda infrastructure included in this repository. Make sure your AWS CLI is configured with the necessary permissions.
To use the GDPR Obfuscator package locally, you need the AWS CLI installed and configured with credentials that have as a minimum S3 read permissions (s3:GetObject) for the bucket you will be accessing.
π AWS CLI Setup Guide (click to expand)
If you don't have an AWS account yet, you can follow Launch Goat's excellent tutorial on creating an AWS account. This guide will walk you through AWS account creation and security setup.
Follow the official documentation to install the latest version of the AWS CLI for your operating system.
There are two main approaches for authenticating the AWS CLI (AWS CLI credentials):
Option 1: IAM User with Access Keys (quickest setup for development/testing, not recommended for production)
This is the quickest method for local development and testing. Follow this step-by-step guide for IAM user and AWS CLI setup. Despite mentioning Windows in the title, most steps are platform-agnostic.
Security Note: Using long-lived access keys with broad permissions is convenient for development but is not recommended for production environments. Access keys can be exposed or compromised. For production deployments, use IAM roles (when running on AWS services like Lambda, EC2, or ECS) or SSO authentication.
Option 2: AWS SSO / IAM Identity Center (most robust security-focused approach, recommended for Production/Teams)
For production environments or team-based development, AWS recommends using AWS Single Sign-On (SSO) authentication through the IAM Identity Center.
Follow the AWS Launch Goat SSO setup guide for a good step by step guide to this approach. Note that this requires setting up AWS Organizations and is more involved than the access key method.
SSO authentication on the AWS CLI signs you off periodically. To sign back in, run:
aws sso loginFor reference, AWS provides comprehensive documentation:
Ensure your credentials have these permissions:
- Minimum:
s3:GetObjectfor reading files from the S3 bucket you will be accessing - Optional:
s3:PutObjectif you plan to save obfuscated data back to S3 - For deploying sample infrastructure: See the required IAM Permissions for Sample Infrastructure Deployment
Install with uv (recommended):
uv add git+https://github.com/theorib/gdpr-obfuscator.gitOr with pip:
pip install git+https://github.com/theorib/gdpr-obfuscator.gitπ Setting up a new Python project? Click here for detailed instructions
If you haven't already, install uv. Then navigate to the directory where you wish to create your Python project and run:
uv initFollow the onscreen prompts to initialize the project, then install the GDPR Obfuscator package as a dependency:
uv add git+https://github.com/theorib/gdpr-obfuscator.gitNavigate to the directory where you wish to create your Python project and initialize your virtual environment:
python -m venv venv
source venv/bin/activate
export PYTHONPATH=$(pwd)Then install the GDPR Obfuscator package as a dependency:
pip install git+https://github.com/theorib/gdpr-obfuscator.gitThis is the main function that processes CSV, JSON, and Parquet files and obfuscates specified PII fields.
file_to_obfuscate(str): S3 address to the file to be obfuscated. Formatted ass3://<bucket_name>/<file_key>(e.g.,s3://my-bucket-name/some_file_to_obfuscate.csv)pii_fields(list[str]): List of column names (or fields) that contain PII data to be obfuscated (e.g.["full_name", "date_of_birth", "address", "phone"])masking_string(str): String used to replace PII data (default is"***")file_type(Literal["csv", "json", "parquet"]): Type of file to obfuscate, can be one ofcsv,json, orparquet, (default is"csv")
ValueError: If an emptyfile_to_obfuscateis passedFileNotFoundError: If the specified file doesn't exist (invalid S3 path)KeyError: If any of the specifiedpii_fieldsare not found in the fileRuntimeError: If an unexpected S3 response error occurs
bytes: Obfuscated file as abytesobject, ready for S3 upload or further processing
from gdpr_obfuscator import gdpr_obfuscator
# Process a file with multiple PII fields
result = gdpr_obfuscator(
"s3://my-bucket/customer-data.csv",
["email", "phone", "address"]
)from gdpr_obfuscator import gdpr_obfuscator
result = gdpr_obfuscator(
"s3://my-bucket/customer-data.parquet",
["email", "phone", "address"],
file_type="parquet"
)from gdpr_obfuscator import gdpr_obfuscator
result = gdpr_obfuscator(
"s3://my-bucket/customer-data.json",
["email", "phone", "address"],
masking_string="#######",
file_type="json"
)The result could be easily saved back to S3 using a library such as Boto3:
import boto3
from gdpr_obfuscator import gdpr_obfuscator
obfuscated_bytes = gdpr_obfuscator("s3://bucket-name/file_key.csv", ["name","email", "phone", "address"])
s3_client = boto3.client('s3')
response = s3_client.put_object(
Bucket='another-bucket-name',
Key='file_key_obfuscated.csv',
Body=obfuscated_bytes,
ContentType="text/csv", # specifying a MIME content type is optional
)- All PII field values are replaced with
***by default but can be customized using themasking_stringparameter - Non-PII columns remain unchanged
- Original file structure and formatting is preserved
- Compatible with CSV, JSON, and Parquet files
The GDPR Obfuscator raises descriptive exceptions for common issues (see Raises above). Error messages guide you to the solution.
π Common Issues and Solutions (click to expand)
- Error:
FileNotFoundError - Error Message: Invalid S3 path: Missing or malformed "s3://" prefix
- Correct format:
gdpr_obfuscator("s3://bucket-name/file.csv", ["field_1", "field_2"]) - Incorrect:
gdpr_obfuscator("bucket/file.csv", ["field_1", "field_2"])
- Error:
FileNotFoundError - Error Message: The specified key does not exist
- Solution: Ensure the bucket name is correct and exists
- Error:
FileNotFoundError - Error Message: The specified key does not exist
- Solution: Ensure the file key is correct and exists
- Error:
KeyError - Error Message:
PII fields not found: ["Email"] - Solution: Check your CSV headers match the
pii_fieldsexactly - Case-sensitive:
"Email"β"email"
- Only CSV, JSON, and Parquet files are currently supported
- CSV Files must have proper CSV headers in the first row
- JSON Files must have proper JSON structure
- Parquet Files must have proper Parquet structure
- Maximum file size: 1MB for optimal performance
The GDPR Obfuscator is designed to handle large files efficiently. It can easily process files larger than 1MB with thousands of rows and is optimized for performance.
During local testing, the processing time for a large 1MB CSV file with 7,032 rows, took 1.602619s.
99.8% of that time was spent on network overheads, (connecting to S3 and retrieving the data).
The actual time spent processing the data was only 0.003911s (0.2% of total time).
You can run performance tests locally if you want.
| Metric | Full (S3 + Processing) | Mocked (Processing Only) | Difference |
|---|---|---|---|
| Execution Time | 1.602619s | 0.003911s | 1.598708s (99.8%) |
| Throughput | 4,388 rows/s | 1,797,872 rows/s | 409.74x faster |
| Function Calls | 26,852 | 491 | +26,361 |
| Primitive Calls | 25,113 | 489 | +24,624 |
- File size: 1.00 MB
- Rows: 7,032
- Fields obfuscated: 4
- Total fields processed: 28,128
- Network overhead: 1.598708s (99.8% of total time)
- Processing time: 0.003911s (0.2% of total time)
This repository includes a complete, production-ready example of deploying an AWS Lambda function that uses the GDPR Obfuscator package. It uses Pulumi for infrastructure as code. Pulumi is a modern alternative to Terraform that allows you to write infrastructure code in your own language (Python in this case).
The sample infrastructure shows best practices for using this package on an AWS production environment, including:
- Lambda Layers architecture: The GDPR Obfuscator package and its dependencies are deployed as separate Lambda Layers, following AWS best practices for code organization and deployment efficiency (since changing the lambda source code does not require new layer dependency updates)
- Proper IAM configuration: Includes secure IAM roles and policies that follow the principle of least-privilege in order to give the Lambda access to an S3 bucket with minimal permissions
- S3 integration: Sample S3 bucket with test data to help you get started quickly
- CloudWatch logging: Configured with JSON formatted logs for monitoring and debugging
- Infrastructure as Code: Everything is defined in Pulumi, making it easy to deploy, modify, and tear down
The Lambda function can be triggered manually, via the included make script or integrated with EventBridge, Step Functions, or other AWS services to create automated data obfuscation pipelines.
You can find detailed step-by-step instructions for deploying and testing the sample infrastructure in the Deploying sample infrastructure into AWS section below.
It is a reference implementation if you're planning to use the GDPR Obfuscator in your own AWS environment.
π Getting started with local development (click to expand)
You'll need a terminal emulator to run this project in a local development environment.
The commands you'll see below can be copy/pasted into your terminal. After pasting each command, press Enter to execute it.
- uv - Package manager used in this project (see Installing uv)
- git - To clone this repository
- make - To run makefile commands
- ruff - For code formatting and linting
- AWS CLI - For running this package locally or deploying sample infrastructure (see Configuring the AWS CLI)
- Pulumi - For deploying sample infrastructure to AWS Lambda
π How to install uv (click to expand)
uv is an extremely fast Python package and project manager, written in Rust. It is used to build and run this project.
If Python is already installed on your system, uv will detect and use it without configuration. However, if you don't have Python installed, uv will automatically install missing Python versions as needed β you don't need to install Python to get started.
On MacOS or Linux:
curl -LsSf https://astral.sh/uv/install.sh | shOn Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"For other setups or troubleshooting, please refer to the official uv documentation.
On your terminal, navigate to the directory where you want to add this project to and clone this repository using the following command:
git clone https://github.com/theorib/gdpr-obfuscatorRun make help to see all available commands.
π About Makefile commands (click to expand)
We provide a series of Makefile commands to help you navigate this project. You can get a complete list of commands with descriptions by running:
make helpThis will display all available commands including setup, testing, linting, formatting, and deployment operations.
Make sure uv and make are installed on your system and then from the root of your cloned project directory run:
make setupThis will install all dependencies and run checks to ensure everything is set up correctly. You should see all tests passing as well as security and test coverage reports in your terminal
This project contains Infrastructure as Code (IaC) scripts using Pulumi, which is a modern Python-native alternative to Terraform.
It demonstrates a complete AWS deployment workflow using the GDPR Obfuscator package. It's a real-world example that follows AWS best practices and can serve as a reference for your own deployments.
The current pulumi setup is ready to:
- Create a sample S3 bucket
- Load the S3 bucket with test data
- Create a sample lambda function that can read from that S3 bucket using the GDPR Obfuscator package to obfuscate data stored in the S3 bucket, saving the processed data back to the same bucket
- Configure the necessary IAM roles and policies following the principle of least privilege
- Set up CloudWatch logging for monitoring and debugging
To deploy the sample infrastructure using Pulumi, your AWS CLI credentials must have permissions to create and manage the following AWS resources:
aws:s3:Bucket- Create and manage S3 bucketsaws:s3:BucketObject- Upload and manage objects in S3 bucketsaws:iam:Role- Create IAM roles for Lambda executionaws:iam:Policy- Create IAM policies for resource accessaws:iam:RolePolicyAttachment- Attach policies to IAM rolesaws:lambda:LayerVersion- Create Lambda layers for dependenciesaws:lambda:Function- Create and configure Lambda functionsaws:cloudwatch:LogGroup- Create CloudWatch log groups for Lambda logging
Note: In production environments, you should always follow the principle of least privilege and grant only the specific permissions needed. For testing and development, you may use an IAM user or role with a broader permission set, but ensure you understand the security implications.
Run all of the commands below from the root of your cloned repository.
-
Make sure you have followed the steps above for:
-
Make sure you have the Pulumi CLI installed and set up
-
Make sure you have the AWS CLI installed and configured with your AWS credentials (see Configuring the AWS CLI)
-
Make sure that the credentials configured in the AWS CLI have the necessary permissions to create and manage resources in your AWS account
-
Setup pulumi for this project:
make sample-infrastructure-setup
-
Deploy sample infrastructure into your AWS account:
make sample-infrastructure-deploy
This will create a sample S3 bucket, load test data, and create a sample lambda function.
You can now login into your AWS Management Console and inspect the lambda and S3 buckets that were created as well as the sample data that was loaded into the bucket.
For convenience, we have provided a set of make scripts that will run the lambda using the sample test files.
The following command will send an event to the Lambda using a small sample test file.
make sample-infrastructure-run-testThe following script will send an event with a large, 1MB CSV test file containing 7031 data rows.
make sample-infrastructure-run-test-largeAfter running these scripts, you can check the output files that will have been saved to the same test bucket in the AWS management console. The newly created file keys will have been suffixed with _obfuscated before the extension (example: from large_pii_data.csv to large_pii_data_obfuscated.csv).
Once you are done testing and want to clean up the AWS resources that were created, you can run:
make sample-infrastructure-destroyYou can also manually test the lambda by giving it a test event of your choosing. Make sure it references an existing bucket as well as existing file keys and pii fields. If you want to manually use the sample bucket and test data provided, you can get their values by running:
make sample-infrastructure-get-outputThe CSV columns included in those files are the following. You can pick and choose any combination of them to obfuscate:
["id", "name", "email_address", "phone_number", "date_of_birth", "address", "salary", "department", "hire_date", "project_code", "status", "region"]The sample lambda, expects an event object as it's first argument and it should have the following shape:
{
"file_to_obfuscate": "s3://<bucket-name>/<file-key>",
"pii_fields": [
"field_1",
"field_2",
"field_3",
],
"destination_bucket": "<bucket-name>"
}Replace <bucket-name>,<file-key> and the pii_fields list with the values that you want (such as those from the make sample-infrastructure-get-output command). Or from any other bucket you may have that contains test data.
This repository includes a performance profiling script which you can run locally by using the following make command:
make profile-gdpr-obfuscatorMake sure that before you do so, you have the sample infrastructure deployed
Alternatively, you can import the gdpr_obfuscator_profiling function and run it directly with any suitable files you may have hosted on s3:
from src.gdpr_obfuscator_profiling.gdpr_obfuscator_profiling import gdpr_obfuscator_profiling
gdpr_obfuscator_profiling(
file_to_obfuscate="s3://some-bucket/some_file.parquet",
pii_fields=["name", "email_address", "phone_number", "address"],
profiling_data_output_dir='/profile-report',
file_type="parquet"
)