Skip to content

VE2 drm scheduler enable#1276

Open
saifuddin-xilinx wants to merge 1 commit intoamd:mainfrom
saifuddin-xilinx:ve2_drm_sched
Open

VE2 drm scheduler enable#1276
saifuddin-xilinx wants to merge 1 commit intoamd:mainfrom
saifuddin-xilinx:ve2_drm_sched

Conversation

@saifuddin-xilinx
Copy link
Copy Markdown
Contributor

No description provided.

Signed-off-by: Saif <saifuddin.kaijar@amd.com>
Copilot AI review requested due to automatic review settings April 26, 2026 23:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables the DRM GPU scheduler for the VE2 (amdxdna) driver path by replacing the prior custom FIFO/workqueue-based scheduling with drm_gpu_scheduler + per-context drm_sched_entity usage to manage partition context switching and job execution.

Changes:

  • Add DRM scheduler state to VE2 hwctx (drm_sched_entity) and VE2 mgmtctx (drm_gpu_scheduler).
  • Implement VE2 DRM scheduler backend callbacks (run_job, free_job, timedout_job) and initialize/finalize the scheduler per partition.
  • Route VE2 command submission through drm_sched_job_init/arm + drm_sched_entity_push_job, and destroy entities on hwctx teardown.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
src/driver/amdxdna/ve2_of.h Adds DRM scheduler fields to hwctx/mgmtctx structs and exports ve2_job_put().
src/driver/amdxdna/ve2_mgmt.c Adds DRM scheduler backend ops, scheduler init/fini, and removes the old FIFO/workqueue scheduler logic.
src/driver/amdxdna/ve2_hwctx.c Integrates job submission with DRM scheduler jobs/entities and adds entity init/destroy in hwctx lifecycle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1481 to +1483
if (!mgmtctx)
goto cleanup_xrs;

Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mgmtctx is computed as the address of an array element and can’t be NULL, so this check is dead code. It also jumps to cleanup_xrs without setting ret, which could mask the real error path. Remove the check (or replace it with a bounds/initialization check that can actually fail and sets an error code).

Suggested change
if (!mgmtctx)
goto cleanup_xrs;

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +92
/* DRM scheduler components */
struct drm_sched_entity entity; /* Scheduling entity for this hwctx */
struct drm_gpu_scheduler *sched; /* Pointer to mgmtctx scheduler */
u64 seq; /* Job sequence counter */
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct amdxdna_ctx_priv now embeds struct drm_sched_entity/struct drm_gpu_scheduler, but this header doesn’t include <drm/gpu_scheduler.h>. This can break compilation for translation units that include ve2_of.h without first including amdxdna_ctx.h (e.g., ve2_of.c). Include the required DRM scheduler header (or add local forward declarations and move the full types out of the header).

Copilot uses AI. Check for mistakes.
ret = ve2_request_context_switch(xdna, mgmtctx);
if (ret) {
XDNA_ERR(xdna, "Context switch request failed: ret=%d", ret);
dma_fence_set_error(fence, ret);
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On context-switch request failure, the fence error is set but the fence is never signaled. That leaves the DRM scheduler waiting until timeout even though submission already failed. Consider signaling the fence (and completing the sched job) immediately on unrecoverable errors so the scheduler can retire the job and propagate the error.

Suggested change
dma_fence_set_error(fence, ret);
dma_fence_set_error(fence, ret);
dma_fence_signal(fence);

Copilot uses AI. Check for mistakes.
Comment on lines +136 to +143
/* Notify firmware that command is ready */
ret = notify_fw_cmd_ready(hwctx);
if (ret < 0)
XDNA_ERR(xdna, "notify_fw_cmd_ready failed: ret=%d", ret);

unlock:
mutex_unlock(&mgmtctx->ctx_lock);
return fence;
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notify_fw_cmd_ready() failure is logged but the job is still treated as successfully submitted (a fence is returned with no error/signaling). If notifying firmware fails, the scheduler will likely hang until timeout. Set the fence error and complete the job immediately (or return an error fence) so the scheduler doesn’t block indefinitely.

Copilot uses AI. Check for mistakes.
Comment on lines +191 to +196
u64 read_idx = 0, write_idx = 0, reserved_wr = 0;

/* Read the actual values from the queue header */
read_idx = queue->hsa_queue_p->hq_header.read_index;
write_idx = queue->hsa_queue_p->hq_header.write_index;
reserved_wr = queue->reserved_write_index;
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_idx, write_idx, and reserved_wr are assigned but never used. This will trigger -Wunused-but-set-variable warnings in many builds. Either log these values (since the comment says they’re for inspection) or remove the variables/reads.

Suggested change
u64 read_idx = 0, write_idx = 0, reserved_wr = 0;
/* Read the actual values from the queue header */
read_idx = queue->hsa_queue_p->hq_header.read_index;
write_idx = queue->hsa_queue_p->hq_header.write_index;
reserved_wr = queue->reserved_write_index;
u64 read_idx = queue->hsa_queue_p->hq_header.read_index;
u64 write_idx = queue->hsa_queue_p->hq_header.write_index;
u64 reserved_wr = queue->reserved_write_index;
XDNA_ERR(xdna,
"[TS] !!!TIMEOUT!!! HSA queue=%p read_idx=%llu write_idx=%llu reserved_wr=%llu",
queue->hsa_queue_p,
(unsigned long long)read_idx,
(unsigned long long)write_idx,
(unsigned long long)reserved_wr);

Copilot uses AI. Check for mistakes.
Comment on lines 344 to +350
// Reset the pending list
priv_ctx->pending[get_job_idx(job->seq)] = NULL;
ve2_job_put(job);

/* Signal DRM scheduler fence */
if (job->fence)
dma_fence_signal(job->fence);

Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job completion fence is being signaled here, but this helper is only called from the synchronous wait path (ve2_cmd_wait) and cleanup paths—not from the IRQ handler. With DRM scheduler enabled, this means jobs may never complete from the scheduler’s perspective unless userspace calls cmd_wait, leading to scheduler stalls/timeouts. Fence signaling (and any job-done notification required by drm_sched) should happen from the actual completion event path (IRQ / polling work) independent of userspace waiting.

Copilot uses AI. Check for mistakes.
Comment on lines +983 to +987
/* DRM scheduler - initialize and push job */
ret = drm_sched_job_init(&job->base, &hwctx->priv->entity, 1, NULL);
if (ret) {
return ret;
}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drm_sched_job_init() is called after the command has already been written/committed to the HSA queue and the job is added to the pending list. If drm_sched_job_init() fails, this function returns an error even though the job may still execute in hardware, and the driver has no rollback/cleanup. Initialize/arm the sched job before making the hardware-visible submission (or add a rollback path that invalidates the queued packet and removes the pending entry).

Copilot uses AI. Check for mistakes.
Comment on lines 1423 to +1440
@@ -1411,19 +1429,15 @@ int ve2_hwctx_init(struct amdxdna_ctx *hwctx)

hwctx->priv = priv;

/* Allocate unique ID using XArray (thread-safe, supports ID recycling) */
{
u32 hwctx_id;

ret = xa_alloc_cyclic(&xdna->dev_handle->hwctx_ids, &hwctx_id, priv,
XA_LIMIT(1, U32_MAX),
&xdna->dev_handle->next_hwctx_id, GFP_KERNEL);
if (ret < 0) {
XDNA_ERR(xdna, "Failed to allocate hwctx ID, ret=%d", ret);
goto cleanup_priv;
}
priv->id = hwctx_id;
}
/* Allocate unique ID using XArray (thread-safe, supports ID recycling) */
ret = xa_alloc_cyclic(&xdna->dev_handle->hwctx_ids, &hwctx_id, priv,
XA_LIMIT(1, U32_MAX),
&xdna->dev_handle->next_hwctx_id, GFP_KERNEL);
if (ret < 0) {
XDNA_ERR(xdna, "Failed to allocate hwctx ID, ret=%d", ret);
goto cleanup_priv;
}
priv->id = hwctx_id;
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kernel style: these new lines use spaces instead of tabs and have inconsistent indentation, which will fail checkpatch.pl and makes the block hard to read. Please align indentation with surrounding code (tabs, consistent continuation indentation).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants