VE2 drm scheduler enable#1276
Conversation
Signed-off-by: Saif <saifuddin.kaijar@amd.com>
There was a problem hiding this comment.
Pull request overview
This PR enables the DRM GPU scheduler for the VE2 (amdxdna) driver path by replacing the prior custom FIFO/workqueue-based scheduling with drm_gpu_scheduler + per-context drm_sched_entity usage to manage partition context switching and job execution.
Changes:
- Add DRM scheduler state to VE2 hwctx (
drm_sched_entity) and VE2 mgmtctx (drm_gpu_scheduler). - Implement VE2 DRM scheduler backend callbacks (
run_job,free_job,timedout_job) and initialize/finalize the scheduler per partition. - Route VE2 command submission through
drm_sched_job_init/arm+drm_sched_entity_push_job, and destroy entities on hwctx teardown.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
src/driver/amdxdna/ve2_of.h |
Adds DRM scheduler fields to hwctx/mgmtctx structs and exports ve2_job_put(). |
src/driver/amdxdna/ve2_mgmt.c |
Adds DRM scheduler backend ops, scheduler init/fini, and removes the old FIFO/workqueue scheduler logic. |
src/driver/amdxdna/ve2_hwctx.c |
Integrates job submission with DRM scheduler jobs/entities and adds entity init/destroy in hwctx lifecycle. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!mgmtctx) | ||
| goto cleanup_xrs; | ||
|
|
There was a problem hiding this comment.
mgmtctx is computed as the address of an array element and can’t be NULL, so this check is dead code. It also jumps to cleanup_xrs without setting ret, which could mask the real error path. Remove the check (or replace it with a bounds/initialization check that can actually fail and sets an error code).
| if (!mgmtctx) | |
| goto cleanup_xrs; |
| /* DRM scheduler components */ | ||
| struct drm_sched_entity entity; /* Scheduling entity for this hwctx */ | ||
| struct drm_gpu_scheduler *sched; /* Pointer to mgmtctx scheduler */ | ||
| u64 seq; /* Job sequence counter */ |
There was a problem hiding this comment.
struct amdxdna_ctx_priv now embeds struct drm_sched_entity/struct drm_gpu_scheduler, but this header doesn’t include <drm/gpu_scheduler.h>. This can break compilation for translation units that include ve2_of.h without first including amdxdna_ctx.h (e.g., ve2_of.c). Include the required DRM scheduler header (or add local forward declarations and move the full types out of the header).
| ret = ve2_request_context_switch(xdna, mgmtctx); | ||
| if (ret) { | ||
| XDNA_ERR(xdna, "Context switch request failed: ret=%d", ret); | ||
| dma_fence_set_error(fence, ret); |
There was a problem hiding this comment.
On context-switch request failure, the fence error is set but the fence is never signaled. That leaves the DRM scheduler waiting until timeout even though submission already failed. Consider signaling the fence (and completing the sched job) immediately on unrecoverable errors so the scheduler can retire the job and propagate the error.
| dma_fence_set_error(fence, ret); | |
| dma_fence_set_error(fence, ret); | |
| dma_fence_signal(fence); |
| /* Notify firmware that command is ready */ | ||
| ret = notify_fw_cmd_ready(hwctx); | ||
| if (ret < 0) | ||
| XDNA_ERR(xdna, "notify_fw_cmd_ready failed: ret=%d", ret); | ||
|
|
||
| unlock: | ||
| mutex_unlock(&mgmtctx->ctx_lock); | ||
| return fence; |
There was a problem hiding this comment.
notify_fw_cmd_ready() failure is logged but the job is still treated as successfully submitted (a fence is returned with no error/signaling). If notifying firmware fails, the scheduler will likely hang until timeout. Set the fence error and complete the job immediately (or return an error fence) so the scheduler doesn’t block indefinitely.
| u64 read_idx = 0, write_idx = 0, reserved_wr = 0; | ||
|
|
||
| /* Read the actual values from the queue header */ | ||
| read_idx = queue->hsa_queue_p->hq_header.read_index; | ||
| write_idx = queue->hsa_queue_p->hq_header.write_index; | ||
| reserved_wr = queue->reserved_write_index; |
There was a problem hiding this comment.
read_idx, write_idx, and reserved_wr are assigned but never used. This will trigger -Wunused-but-set-variable warnings in many builds. Either log these values (since the comment says they’re for inspection) or remove the variables/reads.
| u64 read_idx = 0, write_idx = 0, reserved_wr = 0; | |
| /* Read the actual values from the queue header */ | |
| read_idx = queue->hsa_queue_p->hq_header.read_index; | |
| write_idx = queue->hsa_queue_p->hq_header.write_index; | |
| reserved_wr = queue->reserved_write_index; | |
| u64 read_idx = queue->hsa_queue_p->hq_header.read_index; | |
| u64 write_idx = queue->hsa_queue_p->hq_header.write_index; | |
| u64 reserved_wr = queue->reserved_write_index; | |
| XDNA_ERR(xdna, | |
| "[TS] !!!TIMEOUT!!! HSA queue=%p read_idx=%llu write_idx=%llu reserved_wr=%llu", | |
| queue->hsa_queue_p, | |
| (unsigned long long)read_idx, | |
| (unsigned long long)write_idx, | |
| (unsigned long long)reserved_wr); |
| // Reset the pending list | ||
| priv_ctx->pending[get_job_idx(job->seq)] = NULL; | ||
| ve2_job_put(job); | ||
|
|
||
| /* Signal DRM scheduler fence */ | ||
| if (job->fence) | ||
| dma_fence_signal(job->fence); | ||
|
|
There was a problem hiding this comment.
The job completion fence is being signaled here, but this helper is only called from the synchronous wait path (ve2_cmd_wait) and cleanup paths—not from the IRQ handler. With DRM scheduler enabled, this means jobs may never complete from the scheduler’s perspective unless userspace calls cmd_wait, leading to scheduler stalls/timeouts. Fence signaling (and any job-done notification required by drm_sched) should happen from the actual completion event path (IRQ / polling work) independent of userspace waiting.
| /* DRM scheduler - initialize and push job */ | ||
| ret = drm_sched_job_init(&job->base, &hwctx->priv->entity, 1, NULL); | ||
| if (ret) { | ||
| return ret; | ||
| } |
There was a problem hiding this comment.
drm_sched_job_init() is called after the command has already been written/committed to the HSA queue and the job is added to the pending list. If drm_sched_job_init() fails, this function returns an error even though the job may still execute in hardware, and the driver has no rollback/cleanup. Initialize/arm the sched job before making the hardware-visible submission (or add a rollback path that invalidates the queued packet and removes the pending entry).
| @@ -1411,19 +1429,15 @@ int ve2_hwctx_init(struct amdxdna_ctx *hwctx) | |||
|
|
|||
| hwctx->priv = priv; | |||
|
|
|||
| /* Allocate unique ID using XArray (thread-safe, supports ID recycling) */ | |||
| { | |||
| u32 hwctx_id; | |||
|
|
|||
| ret = xa_alloc_cyclic(&xdna->dev_handle->hwctx_ids, &hwctx_id, priv, | |||
| XA_LIMIT(1, U32_MAX), | |||
| &xdna->dev_handle->next_hwctx_id, GFP_KERNEL); | |||
| if (ret < 0) { | |||
| XDNA_ERR(xdna, "Failed to allocate hwctx ID, ret=%d", ret); | |||
| goto cleanup_priv; | |||
| } | |||
| priv->id = hwctx_id; | |||
| } | |||
| /* Allocate unique ID using XArray (thread-safe, supports ID recycling) */ | |||
| ret = xa_alloc_cyclic(&xdna->dev_handle->hwctx_ids, &hwctx_id, priv, | |||
| XA_LIMIT(1, U32_MAX), | |||
| &xdna->dev_handle->next_hwctx_id, GFP_KERNEL); | |||
| if (ret < 0) { | |||
| XDNA_ERR(xdna, "Failed to allocate hwctx ID, ret=%d", ret); | |||
| goto cleanup_priv; | |||
| } | |||
| priv->id = hwctx_id; | |||
There was a problem hiding this comment.
Kernel style: these new lines use spaces instead of tabs and have inconsistent indentation, which will fail checkpatch.pl and makes the block hard to read. Please align indentation with surrounding code (tabs, consistent continuation indentation).
No description provided.