OSIDB-4678: Improve promote performance #1156
Conversation
05358f2 to
0694834
Compare
b238de6 to
be28eab
Compare
| refs = pghistory.models.Events.objects.tracks(self).all() | ||
| refs = pghistory.models.Events.objects.references(self).all() | ||
| for ref in refs: | ||
| model_audit = apps.get_model(ref.pgh_model).objects.filter( | ||
| db, model_name = ref.pgh_model.split(".") | ||
| # Skip snippet audit events - snippets should always remain internal | ||
| if model_name == "SnippetAudit": | ||
| continue | ||
|
|
||
| model_audit = apps.get_model(db, model_name).objects.filter( | ||
| pgh_id=ref.pgh_id | ||
| ) | ||
|
|
||
| with pgtrigger.ignore(f"{ref.pgh_model}:append_only"): | ||
| with pgtrigger.ignore(f"{db}.{model_name}:append_only"): |
There was a problem hiding this comment.
From my search it looks like references goes through the history of all of the related models while tracks processes the model is is called on. Since set_history_public is called on each model there might be a lot of duplicate history ACL adjustments with using references. The calls also ends up attempting to grab the Snippet which shouldn't happen.
I think it would make sense to sync set_history_public with what exists in the master branch (with tracks).
There was a problem hiding this comment.
I understand the logic the same way as @Jincxz so if I am not missing something I would also rather see the original code here instead of the new one. Can you explain the change?
There was a problem hiding this comment.
Hi, sorry the delay (vacations, etc 🙈
I will rollback this method and check how many calls do.
There was a problem hiding this comment.
ok, I rollback the method.
| self.set_public() | ||
| self.set_history_public() | ||
| self.save(**kwargs) | ||
| if self.is_internal: |
There was a problem hiding this comment.
Slight optimization: I think you can keep the original short-circuit logic
if not self.is_internal:
return
If the object is not internal the function will still search through the related objects w/o the return.
There was a problem hiding this comment.
You should apply Jin's suggestion, and while we're at it you can also shortcircuit if self is a Flaw, as we obviously only want non-Flaw related objects.
You can then get rid of visited as I mentioned earlier.
| if issubclass(type(self), AlertMixin): | ||
| # suppress the validation errors as we expect that during | ||
| # the update the parent and child ACLs will not equal | ||
| kwargs["raise_validation_error"] = False | ||
| if issubclass(type(self), TrackingMixin): | ||
| # do not auto-update the updated_dt timestamp as the | ||
| # followup update would fail on a mid-air collision | ||
| kwargs["auto_timestamps"] = False |
There was a problem hiding this comment.
The commit removes these adjustments but they might be covered in the bulk update. I think it might be a good idea to double-check that the removal of these lines won't cause issues.
There was a problem hiding this comment.
set_public_nested() uses QuerySet.update(...), which does not call save() and does not run model validation, so there’s nothing to suppress and also I explicitly set updated_dt when bulk-updating to do the corresponding as TrackingMixin models with out the trigger.
osoukup
left a comment
There was a problem hiding this comment.
LGTM. I have only some minor comments and question
| def _collect_objects_for_public_update(self, objects_to_update, visited): | ||
| """ | ||
| Recursively collect all related objects that need to have their ACLs | ||
| updated to public. The Flaw itself is not collected as it's saved separately. |
There was a problem hiding this comment.
nitpick: The ACLMixin class is a general mixin so self is not necessarily a flaw.
| """ | ||
| from osidb.models import Flaw | ||
|
|
||
| # Create a unique key for this object to track if we've visited it |
There was a problem hiding this comment.
nitpick: The following line does not create anything. I would more naturally put the comment three lines below.
| refs = pghistory.models.Events.objects.tracks(self).all() | ||
| refs = pghistory.models.Events.objects.references(self).all() | ||
| for ref in refs: | ||
| model_audit = apps.get_model(ref.pgh_model).objects.filter( | ||
| db, model_name = ref.pgh_model.split(".") | ||
| # Skip snippet audit events - snippets should always remain internal | ||
| if model_name == "SnippetAudit": | ||
| continue | ||
|
|
||
| model_audit = apps.get_model(db, model_name).objects.filter( | ||
| pgh_id=ref.pgh_id | ||
| ) | ||
|
|
||
| with pgtrigger.ignore(f"{ref.pgh_model}:append_only"): | ||
| with pgtrigger.ignore(f"{db}.{model_name}:append_only"): |
There was a problem hiding this comment.
I understand the logic the same way as @Jincxz so if I am not missing something I would also rather see the original code here instead of the new one. Can you explain the change?
Elkasitu
left a comment
There was a problem hiding this comment.
On top of the existing comments, I think you should provide performance regression tests that check the amount of queries expected for a base-case for both of the performance optimizations introduced, look at osidb/tests/test_query_regresion.py.
It is not clear from your benchmark how much each patch improves performance and where, having these regression tests helps determine which patch did what.
| self.assert_audit_acls(model, internal_read_groups, internal_write_groups) | ||
|
|
||
| @pytest.mark.vcr | ||
| def test_promote_flaw_with_multiple_affects( |
There was a problem hiding this comment.
you don't need a vcr test nor a test this big to test that promoting a flaw with multiple affects correctly sets its own and its related objects' ACLs, you can either:
- create a unit test which creates a flaw with related affects, trackers and/or comments/cvss/etc. and then calls promote() on said flaw
- piggyback off of existing promote() tests to check that related object ACLs are updated as expected
in their current state these two tests depend on so many things that it can fail for multiple reasons not directly related to ACLs: the tests are brittle
There was a problem hiding this comment.
I will see to change it, The original promote test did have a VCR and I created a new one to avoid messing up the original test.
| # Each object needs its history set to public after ACL update | ||
| for obj in objects: | ||
| if hasattr(obj, "set_history_public"): | ||
| obj.set_history_public() |
There was a problem hiding this comment.
I think this whole approach is not using the correct tools provided by Django:
First and foremost, there's a non-negligible memory cost to storing all these objects locally -- wouldn't it be best to simply keep track of ids instead of model instance objects?
This brings me to my second point, which is that bulk_update is intended for cases where you want to update the same fields for multiple records with different values, here we're updating the same fields for multiple records with the same values, so QuerySet.update() can be used instead:
for model_class, object_ids in objects_by_model.items():
model_class.filter(pk__in=object_ids).update(acl_read=..., acl_write=..., updated_dt=...)This will generate one UPDATE query per model, whereas update_bulk will generate more than one UPDATE query, which correlates with the increase of write queries in your test, in fact I'm not convinced that the current approach is any better than simply calling save() (other than for the side-effects such as signals and save() overrides).
Lastly, I think the collection can be done with a defaultdict(set) directly, you initialize it and pass it to the collection function which will simply do something like:
if self.pk and self.pk in objects_to_update[type(self)]:
# this works as visited
return
...
objects_to_update[type(self)].add(self.pk)
...There was a problem hiding this comment.
Its a good point I will try to just use the uuid.
There was a problem hiding this comment.
Ok, I applied the changes you asked me about and now it even more performant.
There was a risk of the update being to big for the in so I set batch limit of 5k this means more updates but it avoids an issue if there are too many rows. Eg for 10k it will do 2 updates of 5k and so on.
| self.acl_write = acls | ||
| return acls | ||
|
|
||
| def set_public(self): |
There was a problem hiding this comment.
I'm not sure I agree with this -- set_public is a low-level method that does one job: change an ACL-enabled object to public ACLs regardless of its current state, said state can even already be public for all it cares.
It's the caller's responsibility to ensure that only objects that should be public are passed to this method, this method is unaware of business rules and should stay unaware IMO.
There was a problem hiding this comment.
Its true, but doing the change on set_public_nested I felt a lot of responsibility as not to change anything that is not in internal and even more so for posible embargo, I do agree that it breaks the main point but the benefit for a safety net to avoid setting public an embargo is bigger than the pattern in this case.
There was a problem hiding this comment.
ok, I remove it but I will add a test to be sure this a case like this does not happen.
| # Force initial classification because signals are muted in queryset tests | ||
| flaw.classification = { | ||
| "workflow": "DEFAULT", | ||
| "state": WorkflowModel.WorkflowState.NEW, | ||
| } | ||
| # NEW -> TRIAGE requires owner | ||
| flaw.owner = "Alice" | ||
| flaw.save(raise_validation_error=False) |
There was a problem hiding this comment.
Is the @pytest.mark.enable_signals decorator not enough?
There was a problem hiding this comment.
I added the signal, it wasn't working like I thought
| "HTTP_JIRA_API_KEY": jira_token, | ||
| "HTTP_BUGZILLA_API_KEY": bugzilla_token, | ||
| } | ||
| with assertNumQueriesLessThan(92): |
There was a problem hiding this comment.
Just nitpicking, normally when I write new regression tests, I write them before doing the changes to see the actual change, because now we don't know if 92 is an improvement or not...
Notice the # initial value -> X in some tests
There was a problem hiding this comment.
Also, did you try to use assertNumQueries first? We should utilize it whenever possible, assertNumQueriesLessThan is less accurate but some tests are a bit flaky due to environment
There was a problem hiding this comment.
yeah, I just change it assertNumQueries and I also added quantity of affects/trackers to check that the query quanty doesnt grow depnding on the quantitiy.
PS: there seems to be a bug for the first iteration, I couldn't find what it is.
000bece to
443f514
Compare
443f514 to
40f5294
Compare
|
@MrMarble,@Elkasitu I Created this PR (No merge intention) with the regression test in master to compare. In this new, there is an increase but on 30 queries by affect. (here is the table
Is not perfect but is much better than before. Things I need to verify.
|
b7a7fe3 to
dfec49d
Compare
8fe2d14 to
45c2f74
Compare
Jincxz
left a comment
There was a problem hiding this comment.
Thanks for reverting the changes to set_history_public!
3ccccf1 to
823f724
Compare
207bfd5 to
4f09bf6
Compare
5463ff6 to
168b2d4
Compare
Elkasitu
left a comment
There was a problem hiding this comment.
Please cleanup the AI comments on set_public_nested, they don't really add anything (comments should explain why we're doing something, not what is being done) and merge the regression test commit into the ACL bulk update commit, tests are seldom a change worth their own commit.
| update_kwargs = {"acl_read": public_acl_read, "acl_write": public_acl_write} | ||
| if hasattr(model_class, "updated_dt"): | ||
| update_kwargs["updated_dt"] = now |
There was a problem hiding this comment.
Sorry I missed that this is being defined in ACLMixin, you don't need the subclass check for ACLMixin, but the TrackingMixin one is looking good
| # Keep batching sane; historically <=0 behaved like "chunk by 1". | ||
| max_chunk_size = max(1, max_chunk_size) |
| # Non-Flaw public objects don't need nested ACL promotion; avoid traversing. | ||
| if not isinstance(self, Flaw) and not self.is_internal: | ||
| return |
There was a problem hiding this comment.
This is not True, affects are an example (they contain trackers).
When I said to short-circuit Flaw I meant in the context of not having a visited set which we agreed is necessary.
Simply add if not self.is_internal: return, but I would add this above even computing the visited key
There was a problem hiding this comment.
ok I change it to what I think now is correct.
686efe4 to
4e0ed3e
Compare
|
|
||
|
|
||
| def assertNumQueriesLessThan(max_queries, using="default"): | ||
| def assertNumQueriesLessThan(max_queries, max_time=0, using="default"): |
There was a problem hiding this comment.
measuring time is:
- complicated
- not in scope for this function
- in this particular case it doesn't provide an accurate representation of what's good or bad
let's remove it
There was a problem hiding this comment.
ok, I will remove it, I still think it's a better measure for queries but I agree that is OOS
| [ | ||
| (True, 1, 76, 1), | ||
| (True, 10, 76, 1), | ||
| (True, 50, 76, 1), |
There was a problem hiding this comment.
we should only test orders of magnitude (10n), if you test for 100 you don't need to test 50, if you test for 10 you don't need to test for 5, this is the principle upon which Big O notation is built.
6c3f03b to
292edb4
Compare
292edb4 to
bc9635a
Compare
Improves the performance for the promote endpoint.
This are numbers taken from Alvaro PR to get the results.
There are 2 main changes.
Important: this changes a lot how it updates if during the review you feel something is wrong or dont like something please say so, I can create a new PR with just the Prefetch that increases the performance.
I also added a new safety net for set_public in case there is an embargo.
How the new bulk update works.
I took both reports and asked the AI to write a report 🤖
The naming convention is test_promote_megaflaw__, so test_promote_megaflaw_5_2 mean 5 affects with 2 trackers.
Performance Comparison: Before (report 1) vs this MR (report 4)
📊 Executive Summary Comparison
🔍 Key Findings
✅ Massive Performance Improvements in Report 4
Report 4 (review_0) shows dramatic improvements compared to Report 1 (before_prefetch):
Execution Time Reduction ✅
test_promote_megaflaw_20_0went from 10,641ms to 430ms (-96%)Query Count Reduction ✅
test_promote_megaflaw_20_0went from 4,011 to 311 queries (-92.2%)Database Time Reduction ✅
test_promote_megaflaw_20_0went from 1,749ms to 47ms (-97.3%)N+1 Query Reduction ✅
Write Operations⚠️
Table Access Increase⚠️
*audit) which are necessary for history trackingWrite Operations for Small Tests⚠️
📊 Average Metrics Comparison
🎯 Performance Analysis
What's Working Well ✅
prefetch_related()andselect_related()What Could Be Improved⚠️
💡 Key Insights
Performance Evolution
Scale Impact
The improvements are more dramatic for larger datasets:
This suggests the optimizations scale very well with data size.
📈 Conclusion
Report 4 shows excellent performance improvements compared to the baseline (Report 1):
✅ 91.7% faster overall execution time
✅ 85.7% fewer database queries
✅ 96.2% less database time
✅ 58% fewer N+1 queries
While there are some areas for improvement (write operations for small tests, audit table access), the overall performance is dramatically better than the original baseline. The optimizations are working very well, especially for larger datasets.
Closes OSIDB-4678