This project analyses 13,867 documented cybersecurity attack scenarios spanning 63 attack categories — from SQL Injection and Malware to AI/LLM Exploits, Quantum Cryptography threats, Physical Hardware attacks, and everything in between.
The goal was to apply a full end-to-end data analytics pipeline to a real-world cybersecurity dataset: starting from raw data, identifying and fixing quality issues, loading into a relational database, running structured SQL analysis, and building Python visualisations to surface actionable intelligence.
This project demonstrates skills directly applicable to Security Operations Analyst, Risk Analyst, and Data Analyst roles — specifically the ability to analyse incident and threat data, identify risk patterns, and communicate findings clearly through dashboards and reports.
cybersecurity-attack-analysis/
│
├── data/
│ ├── Attack_Dataset.csv # Original raw dataset
│ └── Attack_Dataset_Cleaned.csv # Cleaned dataset (post-QA)
│
├── sql/
│ └── Cybersecurity_Attack_Analysis.sql # Full MySQL analysis script
│
├── python/
│ └── Cybersecurity_Attack_Analysis.py # Full Python analysis & visuals
│
├── notebooks/
│ └── cyber_security_project_full_python.ipynb # Jupyter Notebook version
│
├── visuals/
│ ├── 00_full_dashboard.png # Complete multi-panel dashboard
│ ├── 01_attacks_by_category.png
│ ├── 02_attack_type_families.png
│ ├── 03_most_targeted_systems.png
│ ├── 04_high_impact_by_category.png
│ ├── 05_mitre_techniques.png
│ ├── 06_source_distribution.png
│ ├── 07_common_impacts.png
│ └── 08_detection_coverage.png
│
└── README.md
| Layer | Tools Used |
|---|---|
| Data Cleaning | Python, Pandas |
| Database & SQL | MySQL (SQLite for local execution) |
| Analysis & Visuals | Python, Matplotlib, NumPy |
| Notebook | Jupyter Notebook |
| Version Control | Git, GitHub |
Before any analysis could begin, the raw dataset required thorough QA. The following issues were identified and fixed using Python:
| # | Issue | Fix | Rows Affected |
|---|---|---|---|
| 1 | Trailing unnamed blank column with stray data | Removed column, preserved values in Notes column |
46 |
| 2 | "Attack Steps " column had trailing space in header |
Renamed to "Attack Steps" |
All |
| 3 | Category casing inconsistency ("Network security" vs "Network Security") |
Standardised to "Network Security" |
2 |
| 4 | MITRE Technique field mixed em dash – and hyphen - formats |
Standardised all to hyphen | 5,232 |
| 5 | Duplicate rows (same title + same content) | Removed true duplicates, kept same-title/different-content rows | 266 removed |
| 6 | Data bleed in Target Type column (attack step text overflowing) |
Truncated and flagged with [DATA BLEED DETECTED] |
53 |
| 7 | Missing Source values |
Filled with "Unknown" |
159 |
Original row count: 14,133 → Cleaned row count: 13,867
The SQL script (Cybersecurity_Attack_Analysis.sql) follows a structured workflow:
- Database setup — creates
cybersecurity_dband theattackstable - Exploration —
SELECT *,COUNT,DISTINCTqueries to understand the data - Staging table — creates
attacks_stagingas a working copy to preserve the raw table - Cleaning in SQL — fixes remaining casing issues, trims whitespace, adds and populates
attack_familycolumn using aCASEgrouping statement - Analysis queries (Q1–Q10) — structured business questions answered with SQL
- Bonus queries — unique attack types per category, critical impact analysis, tool frequency, and cross-tab category vs family breakdown
| Query | Question |
|---|---|
| Q1 | How many attacks are recorded per category? |
| Q2 | What are the top 15 most common specific attack types? |
| Q3 | Which system types are most frequently targeted? |
| Q4 | Which MITRE ATT&CK techniques are most referenced? |
| Q5 | What are the most common attack impacts? |
| Q6 | Which categories have attacks with no documented solution? |
| Q7 | What is the detection method coverage per category? |
| Q8 | Which categories contain the most high-impact attacks? |
| Q9 | How are attack sources distributed across known security frameworks? |
| Q10 | How do 8,833 unique attack types group into broader attack families? |
The Python script (Cybersecurity_Attack_Analysis.py) replicates all SQL analysis in Python using Pandas, adds feature engineering, and produces 8 individual charts plus a full multi-panel dashboard.
attack_family— maps 8,833+ raw attack type values into 19 meaningful attack families using keyword-matching logic (mirrors the SQLCASEgrouping)is_high_impact— boolean flag for records mentioning data theft, privilege escalation, ransomware, account takeover, or remote code execution in the impact fieldhas_detection— binary flag indicating whether a detection method is documentedsource_group— groups raw source strings into known security framework buckets (OWASP, MITRE ATT&CK, CVE/NVD, NIST, Other)
| File | Chart | Description |
|---|---|---|
01_attacks_by_category.png |
Horizontal bar | Attack count per category, top 15 |
02_attack_type_families.png |
Vertical bar | Grouped attack families excl. "Other" |
03_most_targeted_systems.png |
Horizontal bar | Top 10 most targeted system types |
04_high_impact_by_category.png |
Horizontal bar | Categories ranked by high-impact attack count |
05_mitre_techniques.png |
Horizontal bar | Top 10 MITRE ATT&CK techniques referenced |
06_source_distribution.png |
Donut chart | Share of records by known security framework |
07_common_impacts.png |
Horizontal bar | Most common documented attack impacts |
08_detection_coverage.png |
Stacked bar | Detection coverage (has vs. no detection method) |
00_full_dashboard.png |
Multi-panel | All charts combined in one dashboard |
-
Physical / Hardware Attacks is the largest single category with 548 records, reflecting growing concern over hardware-level threats including JTAG exploitation, fault injection, and side-channel attacks.
-
Windows is the most targeted system — appearing 291 times, more than 2.6× the next entry (Workstations at 109). Windows remains the dominant attack surface globally.
-
Web Application Security leads all categories for high-impact attacks (150 records), followed closely by AI/ML Security (114) — signalling that AI systems are now a primary threat vector alongside traditional web apps.
-
MITRE technique T1203 (Exploitation for Client Execution) is the most referenced at 343 occurrences, indicating client-side exploitation remains a dominant attacker strategy.
-
Credential Theft is the most common attack outcome (65 combined records), followed by Remote Code Execution (45). Stolen credentials remain the primary entry point for most modern attacks.
-
Detection coverage is 99.9% — only 3 records across the entire dataset lack a documented detection method, confirming strong dataset quality after cleaning.
-
OWASP is the most cited known security framework (2,522 records), followed by MITRE ATT&CK (1,633). Together they account for ~30% of all sourced records.
-
Dependency Confusion stands out as a top specific attack type (88 records), reflecting the explosion of supply chain attacks targeting package ecosystems like npm, PyPI, and Maven.
-
The dataset is evenly distributed across all 63 categories (~3.5–4% each), suggesting it was deliberately balanced to provide comprehensive coverage of every attack domain.
-
After cleaning, 266 duplicate rows were removed (1.9% of original data) — confirming that even well-structured public datasets require thorough quality assurance before analysis.
-- 1. Open MySQL Workbench or your preferred MySQL client
-- 2. Import Attack_Dataset_Cleaned.csv into the attacks table
-- 3. Run Cybersecurity_Attack_Analysis.sql step by step# Install dependencies
pip install pandas matplotlib numpy
# Run the script
python Cybersecurity_Attack_Analysis.pypip install jupyter pandas matplotlib numpy
jupyter notebook cyber_security_project_full_python.ipynb- Source: Kaggle — Cybersecurity Attack Dataset
- Original size: 14,133 records × 16 columns
- Cleaned size: 13,867 records × 16 columns +
Notescolumn - Columns: ID, Title, Category, Attack Type, Scenario Description, Tools Used, Attack Steps, Target Type, Vulnerability, MITRE Technique, Impact, Detection Method, Solution, Tags, Source, Notes
Sibusiso D. Mbuyane Data Analyst | Python · SQL · Power BI · Tableau
🔗 GitHub · LinkedIn · Personal Website
