- Oct 11, 2024
-
-
Matthew Brost authored
A fragile micro optimization in xe_sched_add_pending_job relied on both the GPU scheduler being stopped and fence signaling stopped to safely add a job to the pending list without the job list lock in xe_sched_add_pending_job. Remove this optimization and just take the job list lock. Fixes: 7ddb9403 ("drm/xe: Sample ctx timestamp to determine if jobs have timed out") Signed-off-by:
Matthew Brost <matthew.brost@intel.com> Reviewed-by:
Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003001657.3517883-2-matthew.brost@intel.com
-
Imre Deak authored
Atm the display HPD interrupts that got disabled during runtime suspend, are re-enabled only if d3cold is enabled. Fix things by also re-enabling the interrupts if d3cold is disabled. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by:
Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by:
Imre Deak <imre.deak@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009194358.1321200-5-imre.deak@intel.com
-
Imre Deak authored
For clarity separate the d3cold and non-d3cold runtime PM handling. The only change in behavior is disabling polling later during runtime resume. This shouldn't make a difference, since the poll disabling is handled from a work, which could run at any point wrt. the runtime resume handler. The work will also require a runtime PM reference, syncing it with the resume handler. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by:
Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by:
Imre Deak <imre.deak@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009194358.1321200-4-imre.deak@intel.com
-
- Oct 10, 2024
-
-
Currently the check to see if snapshot->copy has been allocated is inverted and ends up dereferencing snapshot->copy when free'ing objects in the array when it is null or not free'ing the objects when snapshot->copy is allocated. Fix this by using the correct non-null pointer check logic. Fixes: d8ce1a97 ("drm/xe/guc: Use a two stage dump for GuC logs and add more info") Signed-off-by:
Colin Ian King <colin.i.king@gmail.com> Reviewed-by:
John Harrison <John.C.Harrison@Intel.com> Signed-off-by:
Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009160510.372195-1-colin.i.king@gmail.com
-
Matthew Auld authored
Technically the or_reset() means we call the action on failure, however that would lead to unbalanced rpm put(). Move the get() earlier to fix this. It should be extremely unlikely to ever trigger this in practice. Fixes: 452bca0e ("drm/xe: Don't suspend device upon wedge") Signed-off-by:
Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by:
Matthew Brost <matthew.brost@intel.com> Reviewed-by:
Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009084808.204432-4-matthew.auld@intel.com
-
Matthew Auld authored
Currently we can call fence_fini() twice if something goes wrong when sending the GuC CT for the tlb request, since we signal the fence and return an error, leading to the caller also calling fini() on the error path in the case of stack version of the flow, which leads to an extra rpm put() which might later cause device to enter suspend when it shouldn't. It looks like we can just drop the fini() call since the fence signaller side will already call this for us. There are known mysterious splats with device going to sleep even with an rpm ref, and this could be one candidate. v2 (Matt B): - Prefer warning if we detect double fini() Fixes: 0a382f9b ("drm/xe: Hold a PM ref when GT TLB invalidations are inflight") Signed-off-by:
Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by:
Matthew Brost <matthew.brost@intel.com> Reviewed-by:
Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009084808.204432-3-matthew.auld@intel.com
-
- Oct 09, 2024
-
-
Add workaround (wa) 15016589081 which applies to Xe2_v3_LPG_MD. Xe2_v3_LPG_MD is a Lunar Lake platform with GFX version: 20.04. This wa is type: permanent, and hence is applicable on all steppings. Signed-off-by:
Aradhya Bhatia <aradhya.bhatia@intel.com> Reviewed-by:
Tejas Upadhyay <tejas.upadhyay@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241009065542.283151-1-aradhya.bhatia@intel.com
-
Implement the initial set of workarounds for Xe3 IPs. Signed-off-by:
Gustavo Sousa <gustavo.sousa@intel.com> Signed-off-by:
Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by:
Matt Roper <matthew.d.roper@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008204626.55802-2-matthew.s.atwood@intel.com
-
Thomas Hellström authored
The xe_bo_shrink_kunit test has an uninitialized value and illegal integer size conversions on 32-bit. Fix. v2: - Use div64_u64 to ensure the u64 division compiles everywhere. (Matt Auld) Reported-by:
Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/20240913195649.GA61514@thelio-3990X/ Fixes: 5a90b60d ("drm/xe: Add a xe_bo subtest for shrinking / swapping") Cc: dri-devel@lists.freedesktop.org Cc: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> #v1 Signed-off-by:
Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by:
Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004141121.186177-1-thomas.hellstrom@linux.intel.com
-
Matthew Auld authored
The BSpec says that EN_L3_RW_CCS_CACHE_FLUSH must be toggled on for manual global invalidation to take effect and actually flush device cache, however this also turns on flushing for things like pipecontrol, which occurs between submissions for compute/render. This sounds like massive overkill for our needs, where we already have the manual flushing on the display side with the global invalidation. Some observations on BMG: 1. Disabling l2 caching for host writes and stubbing out the driver global invalidation but keeping EN_L3_RW_CCS_CACHE_FLUSH enabled, has no impact on wb-transient-vs-display IGT, which makes sense since the pipecontrol is now flushing the device cache after the render copy. Without EN_L3_RW_CCS_CACHE_FLUSH the test then fails, which is also expected since device cache is now dirty and display engine can't see the writes. 2. Disabling EN_L3_RW_CCS_CACHE_FLUSH, but keeping the driver global invalidation also has no impact on wb-transient-vs-display. This suggests that the global invalidation still works as expected and is flushing the device cache without EN_L3_RW_CCS_CACHE_FLUSH turned on. With that drop EN_L3_RW_CCS_CACHE_FLUSH. This helps some workloads since we no longer flush the device cache between submissions as part of pipecontrol. Edit: We now also have clarification from HW side that BSpec was indeed wrong here. v2: - Rebase and update commit message. BSpec: 71718 Signed-off-by:
Matthew Auld <matthew.auld@intel.com> Cc: Vitasta Wattal <vitasta.wattal@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by:
Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241007074541.33937-2-matthew.auld@intel.com
-
- Oct 08, 2024
-
-
Save manual engine capture into capture list. This removes duplicate register definitions across manual-capture vs guc-err-capture. Signed-off-by:
Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by:
Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-7-zhanjun.dong@intel.com
-
When we decide to kill a job, (from guc_exec_queue_timedout_job), we could end up with 4 possible scenarios at this starting point of this decision: 1. the guc-captured register-dump is already there. 2. the driver is wedged.mode > 1, so GuC-engine-reset / GuC-err-capture will not happen. 3. the user has started the driver in execlist-submission mode. 4. the guc-captured register-dump is not ready yet so we force GuC to kill that context now, but: A. we don't know yet if GuC will be successful on the engine-reset and get the guc-err-capture, else kmd will do a manual reset later OR B. guc will be successful and we will get a guc-err-capture shortly. So to accomdate the scenarios of 2 and 4A, we will need to do a manual KMD capture first(which is not be reliable in guc-submission mode) and decide later if we need to use that for the cases of 2 or 4A. So this flow is part of the implementation for this patch. Provide xe_guc_capture_get_reg_desc_list to get the register dscriptor list. Add manual capture by read from hw engine if GuC capture is not ready. If it becomes ready at later time, GuC sourced data will be used. Although there may only be a small delay between (1) the check for whether guc-err-capture is available at the start of guc_exec_queue_timedout_job and (2) the decision on using a valid guc-err-capture or manual-capture, lets not take any chances and lock the matching node down so it doesn't get re-claimed if GuC-Err-Capture subsystem is running out of pre-cached nodes. Signed-off-by:
Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by:
Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-6-zhanjun.dong@intel.com
-
Upon the G2H Notify-Err-Capture event, parse through the GuC Log Buffer (error-capture-subregion) and generate one or more capture-nodes. A single node represents a single "engine- instance-capture-dump" and contains at least 3 register lists: global, engine-class and engine-instance. An internal link list is maintained to store one or more nodes. Because the link-list node generation happen before the call to devcoredump, duplicate global and engine-class register lists for each engine-instance register dump if we find dependent-engine resets in a engine-capture-group. To avoid dynamically allocate the output nodes during gt reset, pre-allocate a fixed number of empty nodes up front (at the time of ADS registration) that we can consume from or return to an internal cached list of nodes. Signed-off-by:
Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by:
Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-5-zhanjun.dong@intel.com
-
Capture-nodes generated by GuC are placed in the GuC capture ring buffer which is a sub-region of the larger Guc-Log-buffer. Add capture output size check before allocating the shared buffer. Signed-off-by:
Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by:
Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-4-zhanjun.dong@intel.com
-
Add the ability for runtime allocation and freeing of steered register list extentions that depend on the detected HW config fuses. Signed-off-by:
Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by:
Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-3-zhanjun.dong@intel.com
-
Add referenced registers defines and list of registers. Update GuC ADS size allocation to include space for the lists of error state capture register descriptors. Then, populate GuC ADS with the lists of registers we want GuC to report back to host on engine reset events. This list should include global, engine-class and engine-instance registers for every engine-class type on the current hardware. Ensure we allocate a persistent storage for the register lists that are populated into ADS so that we don't need to allocate memory during GT resets when GuC is reloaded and ADS population happens again. Signed-off-by:
Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by:
Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-2-zhanjun.dong@intel.com
-
Matt Roper authored
MCR steering on Xe3 media IP is almost the same as it was on Xe2, except for one new range (0x38D0D0 - 0x38D0FF) which has changed to an MCR "MEDIAINF" range on Xe3. Since we can always steer to grpid / instanceid 0 for MEDIAINF ranges, define a new "INSTANCE0" steering table for Xe3 media. Xe3 can continue to use the same OADDRM/GPMXMT table as Xe2. v2: Merge continuous entries 38D0D0 - 38F0FF Bspec: 74298 Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Signed-off-by:
Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by:
Gustavo Sousa <gustavo.sousa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008013509.61233-7-matthew.s.atwood@intel.com
-
PTL is an integrated GPU based on the Xe3 architecture. v2: explicitly turn off display until display patches land. Bspec: 72574 Cc: Matt Roper <matthew.d.roper@intel.com> Signed-off-by:
Haridhar Kalvala <haridhar.kalvala@intel.com> Signed-off-by:
Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by:
Shekhar Chauhan <shekhar.chauhan@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008013509.61233-6-matthew.s.atwood@intel.com
-
PTL is Xe3 architecture but there is no difference between LNL and PTL in MOCS table. So, PTL uses the same MOCS table as LNL. Bspec: 71582 Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Shekhar Chauhan <shekhar.chauhan@intel.com> Signed-off-by:
Haridhar Kalvala <haridhar.kalvala@intel.com> Signed-off-by:
Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by:
Shekhar Chauhan <shekhar.chauhan@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008013509.61233-5-matthew.s.atwood@intel.com
-
Define a common set of Xe3 feature flags and definitions that will be used for all platforms in this family. The feature flags are inherited unchanged from the Xe2 (XE2_FEATURES) platform. Following B-spec details inherited from Xe2 feature flag definition commit. v2: reuse graphics_xe2 definition Bspec: 58695 - dma_mask_size remains 46 (not documented in bspec) - supports_usm=1 (Bspec 59651) - has_flatccs=1 (Bspec 58797) - has_4tile=1 (Bspec 58788) - has_asid=1 (Bspec 59654, 59265, 60288) - has_range_tlb_invalidate=1 (Bspec 71126) - five-level page table (Bspec 59505) - 1 VD + 1 VE + 1 SFC (Bspec 67103, 70819) - platform engine mask (Bspec 60149) Cc: Matt Roper <matthew.d.roper@intel.com> Signed-off-by:
Haridhar Kalvala <haridhar.kalvala@intel.com> Signed-off-by:
Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by:
Shekhar Chauhan <shekhar.chauhan@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008013509.61233-3-matthew.s.atwood@intel.com
-
Matt Roper authored
Xe3 platforms use the same PAT tables as Xe2. Bspec: 71582 Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Signed-off-by:
Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by:
Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241008013509.61233-2-matthew.s.atwood@intel.com
-
On PTL platforms with media version 30.00, the fuse registers for reporting L3 bank availability to the GT just read out as ~0 and do not provide proper values. Xe does not use the L3 bank mask for anything internally; it only passes the mask through to userspace via the GT topology query. Since we don't have any way to get the real L3 bank mask, we don't want to pass garbage to userspace. Passing a zeroed mask or a copy of the primary GT's L3 bank mask would also be inaccurate and likely to cause confusion for userspace. The best approach is to simply not include L3 in the list of masks returned by the topology query in cases where we aren't able to provide a meaningful value. This won't change the behavior for any existing platforms (where we can always obtain L3 masks successfully for all GTs), it will only prevent us from mis-reporting bad information on upcoming platform(s). There's a good chance this will become a formal workaround in the future, but for now we don't have a lineage number so "no_media_l3" is used in place of a lineage as the OOB workaround descriptor. v2: - Re-calculate query size to properly match data returned. (Gustavo) - Update kerneldoc to clarify that the L3bank mask may not be included in the query results if the hardware doesn't make it available. (Gustavo) Cc: Matt Atwood <matthew.s.atwood@intel.com> Cc: Gustavo Sousa <gustavo.sousa@intel.com> Signed-off-by:
Shekhar Chauhan <shekhar.chauhan@intel.com> Co-developed-by:
Matt Roper <matthew.d.roper@intel.com> Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Reviewed-by:
Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by:
Gustavo Sousa <gustavo.sousa@intel.com> Acked-by:
Francois Dugast <francois.dugast@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241007154143.2021124-2-matthew.d.roper@intel.com
-
John Harrison authored
Create a helper function that can be used to dump the GuC log to dmesg in a manner that is reliable for extraction and decode. The intention is that calls to this can be added by developers when debugging specific issues that require a GuC log but do not allow easy capture of the log - e.g. failures in selftests and failues that lead to kernel hangs. Also note that this is really a temporary stop-gap. The aim is to allow on demand creation and dumping of devcoredump captures (which includes the GuC log and much more). Currently this is not possible as much of the devcoredump code requires a 'struct xe_sched_job' and those are not available at many places that might want to do the dump. v2: Add kerneldoc - review feedback from Michal W. Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Reviewed-by:
Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-12-John.C.Harrison@Intel.com
-
John Harrison authored
Include the GuC log in devcoredump captures because they can be useful with debugging certain types of bug. v2: Fix kerneldoc v3: Drop module parameter as now using more compact ascii85 encoding rather than hexdump (although still not compressed) (review feedback from Matthew B). Rebase onto recent refactoring of devcoredump code. v4: Don't move the submission snapshot inside the GuC internals structure 'cos it really doesn't belong there. Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Reviewed-by:
Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-11-John.C.Harrison@Intel.com
-
John Harrison authored
The dump of the CT buffers was only showing the unprocessed data which is not generally useful for saying why a hang occurred - because it was probably caused by the commands that were just processed. So save and dump the entire buffer but in a more compact dump format. Also zero fill it on allocation to avoid confusion over uninitialised data in the dump. v2: Add kerneldoc - review feedback from Michal W. v3: Fix kerneldoc. v4: Use ascii85 instead of hexdump (review feedback from Matthew B). v5: Dump the entire CTB object rather than separately dumping just the H2G and G2H sections. That way it includes the full header info. Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Reviewed-by:
Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-10-John.C.Harrison@Intel.com
-
John Harrison authored
Add a worker function helper for asynchronously dumping state when an internal/fatal error is detected in CT processing. Being asynchronous is required to avoid deadlocks and scheduling-while-atomic or process-stalled-for-too-long issues. Also check for a bunch more error conditions and improve the handling of some existing checks. v2: Use compile time CONFIG check for new (but not directly CT_DEAD related) checks and use unsigned int for a bitmask, rename CT_DEAD_RESET to CT_DEAD_REARM and add some explaining comments, rename 'hxg' macro parameter to 'ctb' - review feedback from Michal W. Drop CT_DEAD_ALIVE as no need for a bitfield define to just set the entire mask to zero. v3: Fix kerneldoc v4: Nullify some floating pointers after free. v5: Add section headings and device info to make the state dump look more like a devcoredump to allow parsing by the same tools (eventual aim is to just call the devcoredump code itself, but that currently requires an xe_sched_job, which is not available in the CT code). v6: Fix potential for leaking snapshots with concurrent error conditions (review feedback from Julia F). v7: Don't complain about unexpected G2H messages yet because there is a known issue causing them. Fix bit shift bug with v6 change. Add GT id to fake coredump headers and use puts instead of printf. v8: Disable the head mis-match check in g2h_read because it is failing on various discrete platforms due to unknown reasons. Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Reviewed-by:
Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-9-John.C.Harrison@Intel.com
-
This drm printer wrapper can be used to increase the robustness of the captured output generated by any other drm_printer to make sure we didn't lost any intermediate lines of the output by adding line numbers to each output line. Helpful for capturing some crash data. v2: Extended short int counters to full int (JohnH) Signed-off-by:
Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: dri-devel@lists.freedesktop.org Reviewed-by:
Jani Nikula <jani.nikula@intel.com> Acked-by:
Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-8-John.C.Harrison@Intel.com
-
John Harrison authored
Split the GuC log dump into a two stage snapshot and print mechanism. This allows the log to be captured at the point of an error (which may be in a restricted context) and then dump it out later (from a regular context such as a worker function or a sysfs file handler). Also add a bunch of other useful pieces of information that can help (or are fundamentally required!) to decode and parse the log. v2: Add kerneldoc and fix a couple of comment typos - review feedback from Michal W. v3: Move chunking code to this patch as it makes the deltas simpler. Fix a bunch of kerneldoc issues. v4: Move the CS frequency out of the coredump snapshot function into the debugfs only code (as that info is already part of the main devcoredump). Add a header to the debugfs log to match the one in the devcoredump to aid processing by a unified tool. Add forcewake to the GuC timestamp read so it actually works. v6: Add colon to GuC version string (review feedback by Julia F). Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Reviewed-by:
Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-7-John.C.Harrison@Intel.com
-
John Harrison authored
Add an extra stage to the GuC log print to copy the log buffer into regular host memory first, rather than printing the live GPU buffer object directly. Doing so helps prevent inconsistencies due to the log being updated as it is being dumped. It also allows the use of the ASCII85 helper function for printing the log in a more compact form than a straight hex dump. v2: Use %zx instead of %lx for size_t prints. v3: Replace hexdump code with ascii85 call (review feedback from Matthew B). Move chunking code into next patch as that reduces the deltas of both. v4: Add a prefix to the ASCII85 output to aid tool parsing. Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Reviewed-by:
Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-6-John.C.Harrison@Intel.com
-
John Harrison authored
There is a need to include the GuC log and other large binary objects in core dumps and via dmesg. So add a helper for dumping to a printer function via conversion to ASCII85 encoding. Another issue with dumping such a large buffer is that it can be slow, especially if dumping to dmesg over a serial port. So add a yield to prevent the 'task has been stuck for 120s' kernel hang check feature from firing. v2: Add a prefix to the output string. Fix memory allocation bug. v3: Correct a string size calculation and clean up a define (review feedback from Julia F). Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Reviewed-by:
Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-5-John.C.Harrison@Intel.com
-
John Harrison authored
The xe_guc_exec_queue_snapshot is not really a GuC internal thing and is definitely not a GuC CT thing. So give it its own section heading. The snapshot itself is really a capture of the submission backend's internal state. Although all it currently prints out is the submission contexts. So label it as 'Contexts'. If more general state is added later then it could be change to 'Submission backend' or some such. Further, everything from the GuC CT section onwards is GT specific but there was no indication of which GT it was related to (and that is impossible to work out from the other fields that are given). So add a GT section heading. Also include the tile id of the GT, because again significant information. Lastly, drop a couple of unnecessary line feeds within sections. v2: Add GT section heading, add tile id to device section. Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Reviewed-by:
Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-4-John.C.Harrison@Intel.com
-
John Harrison authored
There are a bunch of calls to drm_printf with static strings. Switch them to drm_puts instead. There are also a bunch of 'coredump->snapshot.XXX' references when 'coredump->snapshot' has alread been cached locally as 'ss'. So use 'ss->XXX' instead. Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Reviewed-by:
Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-3-John.C.Harrison@Intel.com
-
John Harrison authored
Including line feeds at the start of a debug print messes up the output when sent to dmesg. The break appears between all the useful prefix information and the actual string being printed. In this case, each block of data has a very clear start line and an extra delimeter is really not necessary. So don't do it. v2: Fix typo in commit message (review feedback from Michal W.) Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Reviewed-by:
Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by:
Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241003004611.2323493-2-John.C.Harrison@Intel.com
-
- Oct 07, 2024
-
-
mwajdecz authored
For feature enabling and testing purposes, allow to capture and replace full VF configuration using debugfs blob file, but only under strict debug config. Signed-off-by:
Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by:
Michał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240919171528.1451-6-michal.wajdeczko@intel.com
-
mwajdecz authored
We already have support to save and restore GuC VF state, but that will only work when the target VF configuration (provisioning) will be exactly the same as the source VF configuration. To help with assuring that both configurations match, allow to encode whole VF configuration that can be saved as blob and restored later. In the future we may want to use such captured configuration blobs as templates to make sure we provision VFs with exactly the same configuration that was previously tested or recommended, or when debugfs knobs are not be available. Signed-off-by:
Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by:
Michał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240919171528.1451-5-michal.wajdeczko@intel.com
-
mwajdecz authored
We want to reuse format of the GuC KLVs while saving and restoring VF configuration by the PF driver, but some of those KLVs (like doorbell begin index or GGTT starting offset) are not strictly needed to correctly restore VF configuration. Modify functions to omit encoding of some of the KLVs with GuC only details. Signed-off-by:
Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by:
Michał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240919171528.1451-4-michal.wajdeczko@intel.com
-
mwajdecz authored
This function may return negative error codes on invalid or incomplete VF configuration, but unlike other int functions, it was returning 1 instead of 0 on success, which might be little inconvinient if we would like to use it directly in other functions. Signed-off-by:
Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by:
Michał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240919171528.1451-3-michal.wajdeczko@intel.com
-
mwajdecz authored
We already have and use MAKE_GUC_KLV_VF_CFG_THRESHOLD_KEY, but in upcoming patch we want to generate the key for the length. Add MAKE_GUC_KLV_VF_CFG_THRESHOLD_LEN for this purpose. Signed-off-by:
Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by:
Michał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240919171528.1451-2-michal.wajdeczko@intel.com
-
- Oct 06, 2024
-
-
mwajdecz authored
Both xe_memirq_{source,status}_ptr functions are now strictly for obtaining an address of the memory based interrupt report pages used by the HW engines. When initializing the GuC that does not require special per instance page preparation, we don't need to abuse these public functions and pass a NULL instead of valid hwe pointer. Also, without further fixes, this actually may lead to NPD crash once the hw_reports_to_instance_zero() will be true. Add internal helpers that will provide report page addresses based solely on the instance number, which will be always 0 for both GuCs. Fixes: ef6103d2 ("drm/xe: memirq infra changes for MSI-X") Signed-off-by:
Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Ilia Levi <ilia.levi@intel.com> Reviewed-by:
Ilia Levi <ilia.levi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240925084445.1495-1-michal.wajdeczko@intel.com
-
- Oct 04, 2024
-
-
Matt Roper authored
The intent of this debugfs entry is to allow modification of wedging behavior, either from IGT tests or during manual debug; it should be marked as writable to properly reflect this. In practice this hasn't caused a problem because we always access wedged_mode as root, which ignores file permissions, but it's still misleading to have the entry incorrectly marked as RO. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Fixes: 6b8ef44c ("drm/xe: Introduce the wedged_mode debugfs") Signed-off-by:
Matt Roper <matthew.d.roper@intel.com> Reviewed-by:
Gustavo Sousa <gustavo.sousa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241002230620.1249258-2-matthew.d.roper@intel.com
-