-
Notifications
You must be signed in to change notification settings - Fork 1k
Added Error Handling & Defensive Programming Guide #3539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,240 @@ | ||
| # Error Handling & Defensive Programming Guide | ||
|
|
||
| Balancing defensiveness with the extreme constraints of microcontrollers (where | ||
| every byte of Flash/RAM and every clock cycle matters) is a core design tension | ||
| in TFLM. Since TFLM targets environments that often lack memory protection (no | ||
| MMU), don't support C++ exceptions, and run on bare metal or simple RTOSs, a | ||
| "standard" defensive posture would result in unacceptable binary bloat and | ||
| performance degradation. | ||
|
|
||
| This guide provides an architectural framework for writing safe kernels and | ||
| offers recommendations for code reviews for contributors. | ||
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| ## 1. Trust Boundaries: Where Data Comes From | ||
|
|
||
| To achieve high performance without sacrificing critical stability, TFLM | ||
| distinguishes between trusted and untrusted data sources. | ||
|
|
||
| ### The C++ API (Trusted) | ||
|
|
||
| When a firmware developer uses the public TFLM C++ API (e.g., instantiating a | ||
| `MicroInterpreter` or calling `Invoke()`), we treat the developer as trusted. | ||
|
|
||
| * **Recommendation:** The core API should not spend runtime cycles or binary | ||
| space checking for invalid parameters (e.g., passing `nullptr`) or incorrect | ||
| state-machine sequences. | ||
| * **Mechanism:** Rely entirely on `TFLITE_DCHECK`. The firmware engineer gets | ||
| an immediate assertion failure during local debugging if they misuse the | ||
| API. In release builds, these compile out, leaving zero overhead. | ||
|
|
||
| ### The FlatBuffer Model (Partially Trusted) | ||
|
|
||
| When dealing with the `.tflite` FlatBuffer model, we distinguish between two | ||
| types of malformation: | ||
|
|
||
| * **Corrupted FlatBuffer Files (Structural Malformation):** The core | ||
| `MicroInterpreter` assumes the FlatBuffer is structurally valid. We *do not* | ||
| perform standard bounds checking on FlatBuffer schema offsets or operator | ||
| tensor indices (e.g., if an operator requests tensor index `999` but the | ||
| model only has `10` tensors, a production TFLM build will read out of | ||
| bounds). To aid local debugging, structural checks are considered a | ||
| "good-to-have" feature but **should exclusively use `TFLITE_DCHECK`**. If | ||
| models are provided via an untrusted channel (e.g., OTA), the application | ||
| layer is entirely responsible for validating the integrity of the model | ||
| before passing it to TFLM. | ||
| * **Invalid Model Topologies (Semantic Malformation):** The runtime *should* | ||
| defend against TFLite Converter bugs or unsupported configurations (e.g., | ||
| wrong tensor shapes, unsupported types). This validation happens entirely in | ||
| the Setup phase. Because developers can use `TF_LITE_STRIP_ERROR_STRINGS` to | ||
| remove the string bloat in production, checking topologies in `Prepare` | ||
|
Comment on lines
+50
to
+51
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (ex. a RELEASE build) |
||
| provides safe fallback (e.g., rejecting a bad OTA update) without the severe | ||
| ROM penalty of embedded strings. | ||
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| ## 2. Execution Phases: When Code Runs | ||
|
|
||
| TFLM distinguishes between setup (which runs once) and execution (which runs | ||
| continuously). | ||
|
|
||
| ### Phase 1: Setup & Initialization (`Prepare`) | ||
|
|
||
| During the `Prepare` phase, kernels should validate their inputs and parameters | ||
| to ensure `Eval` can run blindly and safely. We prioritize clear error messages | ||
| here (relying on `TF_LITE_STRIP_ERROR_STRINGS` to mitigate the ROM cost in | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (ex. a RELEASE build) |
||
| production) and we can afford to spend CPU cycles on validation. | ||
|
|
||
| **What to Validate (Use `TF_LITE_ENSURE`):** | ||
|
|
||
| * **Model-provided parameters:** Check the number of inputs/outputs, tensor | ||
| types, and tensor shapes. | ||
| * **Quantization parameters:** Explicitly check for invalid quantization | ||
| parameters (e.g., a `scale` of `0.0` or a `zero_point` out of bounds) that | ||
| could cause a divide-by-zero or overflow later in `Eval`. | ||
| * **Resource allocations:** Check if memory allocations (like | ||
| `AllocateTempInputTensor`) return `nullptr`. | ||
|
|
||
| ### Phase 2: Execution (`Eval`) | ||
|
|
||
| During the `Eval` phase, the runtime is vulnerable to data that causes hardware | ||
| traps or memory corruption. We should avoid spending cycles or ROM on redundant | ||
| checks. | ||
|
|
||
| **What to Defend Against (Recommended Validation):** | ||
|
|
||
| Kernels should actively defend against three specific runtime threats during | ||
| `Eval`: | ||
|
|
||
| 1. **Out-of-bounds Memory Access:** If an input tensor contains indices or | ||
| offsets generated at runtime (e.g., in `GATHER`, `STRIDED_SLICE`), these | ||
| **should** be bounds-checked at runtime using a raw `if (!valid) return | ||
| kTfLiteError;`. | ||
| 2. **Hardware Faults (Divide by Zero):** Kernels should mathematically protect | ||
| against divide-by-zero (e.g., if a divisor comes from an input tensor) or | ||
| explicitly check and return an error. | ||
| 3. **Infinite Loops:** All loops inside `Eval` *should* have a guaranteed | ||
| maximum bound to prevent data-dependent infinite hangs. | ||
|
|
||
| **What to Ignore (GIGO for Signal Data):** | ||
|
|
||
| * TFLM kernels are **strongly discouraged** from explicitly scanning general | ||
| input signal data (like images or audio) for `NaN`/`Inf`. For standard math | ||
| operations, garbage data simply results in garbage output (GIGO). | ||
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| ## 3. Macro Selection Guide | ||
|
|
||
| To strike the right balance between debuggability, code size (ROM), and | ||
| performance, follow this decision tree. | ||
|
|
||
| ### The Decision Tree | ||
|
|
||
| 1. **Are you writing a Unit Test?** | ||
| * Use GoogleTest-style macros like `EXPECT_EQ`, `EXPECT_NEAR`, etc. for | ||
| verifying test conditions. | ||
| 2. **Are you in `Init` or `Prepare`?** | ||
| * Use `TF_LITE_ENSURE`. We want to fail early with a clear log message if | ||
| the model is incompatible. | ||
| 3. **Are you in `Eval` (or a helper called by `Eval`)?** | ||
| * *Is it an invariant guaranteed by `Prepare`?* (e.g., "this pointer | ||
| cannot be null"). Use `TFLITE_DCHECK`. It costs nothing in production. | ||
| * *Is it validating control data from an input tensor?* (e.g., a dynamic | ||
| index). Use a raw `if (!condition) return kTfLiteError;`. This safely | ||
| prevents memory corruption without paying the ROM cost of a string | ||
| literal. | ||
| * *Are you propagating an error from a helper function?* Use | ||
| `TF_LITE_ENSURE_STATUS` or `TF_LITE_ENSURE_OK`. | ||
|
|
||
| ### Macro Cheat Sheet | ||
|
|
||
| Macro | Cost in Release | When to Use | ||
| :--------------------------------------------- | :------------------------------------------------------- | :---------- | ||
| `TFLITE_DCHECK`<br>`TFLITE_DCHECK_EQ` | **Zero** (Optimized out) | **Default for `Eval` invariants and FlatBuffer structural bounds checking.** | ||
| `if (!cond) return kTfLiteError;` | **Low** (Branch only) | **Default for validating control data in `Eval`.** | ||
| `TF_LITE_ENSURE_OK`<br>`TF_LITE_ENSURE_STATUS` | **Low** (Branch only) | **Default for propagating errors from helper functions.** | ||
| `TF_LITE_ENSURE`<br>`TF_LITE_ENSURE_EQ` | **High** (Branch + logs `__FILE__`, `__LINE__`, `#cond`) | **Default for `Prepare` / Setup.** Can be used in `Eval` *only* if the failure indicates an unrecoverable state-machine corruption; avoid for normal signal data out-of-bounds. | ||
| `TF_LITE_ENSURE_MSG` | **Highest** (Branch + custom string) | **Use Sparingly.** Only when the default `TF_LITE_ENSURE` error is too cryptic. | ||
| `TFLITE_CHECK` | **Fatal** (Calls `Abort()`) | **Avoid in core TFLM.** Halts the microcontroller. | ||
|
|
||
| > **The Hidden Cost of `TF_LITE_ENSURE`:** This macro expands to an `if` branch | ||
| > that calls `TF_LITE_KERNEL_LOG`. By default, this embeds `__FILE__`, | ||
| > `__LINE__`, and the stringified condition directly into the `.rodata` section | ||
| > of the binary. Every single invocation permanently consumes precious ROM. If | ||
| > you have multiple preconditions, combine them into a single | ||
| > `TF_LITE_ENSURE(context, a != nullptr && b != nullptr)`. **Note:** For | ||
| > production builds where ROM is severely constrained, firmware developers | ||
| > should define the `TF_LITE_STRIP_ERROR_STRINGS` macro to compile out these | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or use a RELEASE build for production systems |
||
| > strings, reducing `TF_LITE_ENSURE` to a simple low-cost branch. | ||
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| ## 4. Code Review Guidelines & Concrete Examples | ||
|
|
||
| How to address common scenarios in PRs: | ||
|
|
||
| ### Validating Framework Pointers | ||
|
|
||
| **Discouraged.** Please avoid wasting ROM checking the validity of the TFLM | ||
| framework itself. The `context` and `node` pointers passed by the runtime are | ||
| typically guaranteed to be valid. Notably, `TF_LITE_ENSURE(context, context != | ||
| nullptr)` is logically flawed: if `context` is actually `nullptr`, the macro | ||
| will dereference it to log the error, causing an immediate hardware fault. | ||
|
|
||
| ### Validating Public API Parameters | ||
|
|
||
| **Discouraged.** This penalizes production deployments for bugs that the | ||
| firmware developer should have caught during local testing. We recommend asking | ||
| the contributor to use `TFLITE_DCHECK(op_resolver != nullptr);` instead of | ||
| adding `if (op_resolver == nullptr) return kTfLiteError;` to the | ||
| `MicroInterpreter` API. | ||
|
|
||
| ### Preventing Null Pointer Dereferences | ||
|
|
||
| * **If in `Prepare` (Recommended):** It is standard practice to check for | ||
| nulls when pulling tensors from the context. | ||
|
|
||
| ```cpp | ||
| TfLiteTensor* operand = micro_context->AllocateTempInputTensor(node, kOperandTensor); | ||
| TF_LITE_ENSURE(context, operand != nullptr); | ||
| ``` | ||
|
|
||
| * **If in `Eval` (Discouraged):** Production builds shouldn't pay the cycle | ||
| cost for checks that can't fail at runtime if `Prepare` succeeded. Ask the | ||
| author to change it to `TFLITE_DCHECK`. | ||
|
|
||
| ```cpp | ||
| const TfLiteEvalTensor* input_id = tflite::micro::GetEvalInput(context, node, 0); | ||
| TFLITE_DCHECK(input_id != nullptr); | ||
| ``` | ||
|
|
||
| ### Preventing Buffer Overflows | ||
|
|
||
| * **If in `Prepare` (Recommended):** Use `TF_LITE_ENSURE` to validate that | ||
| tensor sizes are compatible. | ||
| * **If in `Eval`:** If the overflow comes from invalid *control data* (like an | ||
| input tensor providing indices), it *should* be validated in `Eval` to | ||
| prevent memory corruption. Use raw returns: | ||
|
|
||
| ```cpp | ||
| // e.g., third_party/tflite_micro/tensorflow/lite/micro/kernels/gather_nd.cc | ||
| // Note: use subtraction to prevent integer overflow! | ||
| if (from_pos < 0 || from_pos > params_flat_size - slice_size) { | ||
| return kTfLiteError; // Halts execution to prevent out-of-bounds memory read | ||
| } | ||
| ``` | ||
|
|
||
| ``` | ||
| If it's just an invariant guaranteed by `Prepare`, ask to hoist the check or | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hoist -> discard ? |
||
| use `TFLITE_DCHECK`. | ||
| ``` | ||
|
|
||
| ### Fixing Fuzzer Crashes | ||
|
|
||
| Fuzzer fixes should align with the core philosophy. We evaluate them based on | ||
| the type of crash: | ||
|
|
||
| 1. **Corrupted FlatBuffer Files (Working As Intended):** If the fuzzer mutates | ||
| the FlatBuffer byte array so a `Tensor` offset points beyond the end of the | ||
| file (causing a segfault), **request changes if the PR uses `TF_LITE_ENSURE` | ||
| or raw `if` statements**. To keep the binary size as small as possible, we | ||
| prefer to omit the FlatBuffer verifier. However, **it is recommended to | ||
| accept the PR if it exclusively uses `TFLITE_DCHECK`** to catch the | ||
| out-of-bounds index. This treats the structural check purely as a zero-cost | ||
| developer aid during debugging. | ||
| 2. **Invalid Model Topologies (Fix in Prepare):** If the fuzzer generates a | ||
| valid FlatBuffer but modifies a `CONV_2D` operator to have 0 inputs instead | ||
| of the expected 3, **this is a good fix in `Prepare`** using | ||
| `TF_LITE_ENSURE_EQ(context, NumInputs(node), 3);`. | ||
| 3. **Static Math Crashes (Fix in Prepare):** If the fuzzer sets a quantization | ||
| `scale` parameter to `0.0` (causing a hardware trap during inference), | ||
| **this is a good fix in `Prepare`** using `TF_LITE_ENSURE(context, scale != | ||
| 0.0);`. | ||
| 4. **Dynamic Math/Data Crashes (Fix in Eval):** If the fuzzer provides input | ||
| data to a `GATHER` op with an index of `999` (causing out-of-bounds | ||
| corruption), or a divisor tensor evaluates to `0` (causing a divide-by-zero | ||
| trap), **this is a good fix in `Eval`**. Prefer a raw `if (index >= 10) | ||
| return kTfLiteError;` to save ROM (avoid using `TF_LITE_ENSURE` here just to | ||
| log the error; save the ROM). | ||
|
Comment on lines
+214
to
+240
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no fuzzer |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 The opening paragraph 👍