-
Notifications
You must be signed in to change notification settings - Fork 60
[PTI-LIB] Define Callback API and make it work for two domains #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- implement for 2 domians only, append to immediate cmd list only - make callback sample - add sample to tests and run asan and tsan on it Signed-off-by: jfedorov <julia.fedorova@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements an initial version of the PTI Synchronous Callback API, designed to collect hardware metrics per individual GPU operations via event query mechanism. The implementation covers two domains for immediate command lists and includes a sample demonstrating the functionality.
- Adds a complete callback subscription system with subscriber management
- Implements callback support for GPU operation appended and completed domains
- Creates a callback sample demonstrating matrix multiplication with callback hooks
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
sdk/include/pti/pti_sync_callback.h | New public API header defining callback domains, phases, and function types |
sdk/src/levelzero/ze_collector_cb_helpers.h | New helper classes for managing callback subscribers and domain properties |
sdk/src/levelzero/ze_collector.h | Extensive updates adding callback subscriber management and callback invocation logic |
sdk/src/pti_view.cc | Implementation of callback API functions with exception handling |
sdk/src/pti_view_load.cc | API function wrappers for callback functionality |
sdk/src/view_handler.h | View handler methods for callback operations |
sdk/src/pti_lib_handler.h | Function pointer declarations for callback APIs |
sdk/src/pti.cc | String conversion utilities for callback enums |
sdk/src/unikernel.h | Addition of KernelCommandType enum and command_type_ field |
sdk/samples/callback/main.cc | Complete sample application demonstrating callback API usage |
sdk/samples/callback/CMakeLists.txt | Build configuration for callback sample |
sdk/samples/samples_utilities/samples_utils.h | Utility functions for callback data dumping and command list type conversion |
sdk/src/gen_tracing_callbacks.py | Code generator updates to support callback functionality |
sdk/test/CMakeLists.txt | Test configuration updates to include callback sample |
sdk/CMakeLists.txt | Build system updates to include callback sample |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
// - All times reported by PTI_VIEW in CPU (aka Host) timescale | ||
// - However GPU "commands" (kernel & memory transfers) start/end reported in GPU timescale | ||
// - There is significant time drift between CPU and GPU, so to cope wth it - need to | ||
// - There is significant time drift between CPU and GPU, so to cope with it - need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed typo 'wth' to 'with' in the comment.
Copilot uses AI. Check for mistakes.
|
||
void PostAppendKernelCommandCommon(ZeCollector* /*collector*/, ZeKernelCommand* command, | ||
ZeKernelCommandProps& props, ze_event_handle_t& signal_event, | ||
ze_event_handle_t& signal_event, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature changed to remove the ZeKernelCommandProps& props
parameter, but the function is still being called with the old signature in some places. Ensure all call sites are updated to match the new signature.
ze_event_handle_t& signal_event, |
Copilot uses AI. Check for mistakes.
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle)); | ||
if (res) { | ||
std::cout << "Prepared data for Append" << std::endl; | ||
res = zeCommandListAppendWriteGlobalTimestamp( | ||
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is casting _queue_handle
to ze_command_list_handle_t
, but based on the API design, _queue_handle
should be a queue handle, not a command list handle. This type mismatch could lead to runtime errors.
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle)); | |
if (res) { | |
std::cout << "Prepared data for Append" << std::endl; | |
res = zeCommandListAppendWriteGlobalTimestamp( | |
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle), | |
callback_data->_command_list_handle); | |
if (res) { | |
std::cout << "Prepared data for Append" << std::endl; | |
res = zeCommandListAppendWriteGlobalTimestamp( | |
callback_data->_command_list_handle, |
Copilot uses AI. Check for mistakes.
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle)); | ||
if (res) { | ||
std::cout << "Prepared data for Append" << std::endl; | ||
res = zeCommandListAppendWriteGlobalTimestamp( | ||
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous issue, _queue_handle
is being incorrectly cast to ze_command_list_handle_t
when it should be treated as a queue handle. This could cause the Level Zero API call to fail or behave unexpectedly.
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle)); | |
if (res) { | |
std::cout << "Prepared data for Append" << std::endl; | |
res = zeCommandListAppendWriteGlobalTimestamp( | |
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle), | |
callback_data->_command_list_handle); | |
if (res) { | |
std::cout << "Prepared data for Append" << std::endl; | |
res = zeCommandListAppendWriteGlobalTimestamp( | |
callback_data->_command_list_handle, |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for giving the option to provide feedback on this interface.
I think that this generally goes in the right direction. I have a few comments and questions regarding a few things.
This is obviously only the first step towards having a callback interface, as one can see with only having a limited amount of domains, operation kinds and so on.
I think some parts can be made a bit more clear with more documentation, e.g. when one would expect to see cb_data
to actually be set. At the moment, a tool developer would just need to guess when this is the case, and hope that he casts to the right struct.
I'd guess that one needs to cast to pti_callback_gpu_op_data
for PTI_CB_DOMAIN_DRIVER_GPU_OPERATION_*
and to pti_internal_callback_data
for
PTI_CB_PHASE_INTERNAL_*
?
Please also note that I only looked at the headers here. I'm not familiar with the internals of the interface, and would need quite a bit more time to get familiar with that for a proper review.
//!< provided TimestampCallback | ||
PTI_ERROR_BAD_API_ID = 7, //!< invalid api_id when enable/disable runtime/driver specific api_id | ||
PTI_ERROR_BAD_API_ID = 7, //!< invalid api_id when enable/disable runtime/driver specific api_id | ||
PTI_ERROR_AT_LEAST_ONE_GPU_VIEW_MUST_BE_ENABLED = 8, //!< at least one GPU view must be enabled for kernel tracing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTI_ERROR_AT_LEAST_ONE_GPU_VIEW_MUST_BE_ENABLED = 8, //!< at least one GPU view must be enabled for kernel tracing | |
PTI_NO_GPU_VIEW_ENABLED = 8, //!< at least one GPU view must be enabled for kernel tracing |
I think having a shorter name for the error is actually helpful here.
If developers are interested in having a string description of their error code, one could think about implementing a function like const char* ptiGetErrorMessage(...)
, similar to CUPTI: https://docs.nvidia.com/cupti/api/group__CUPTI__RESULT__API.html
@@ -0,0 +1,227 @@ | |||
//============================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about the file name. pti_sync_callback.h
sounds to me like we're only dealing with synchronization stuff here, although this looks to be all callback related event handling.
Can we find a better name for this? Maybe even just drop the _sync
part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was a suggestion from @jmellorcrummey - to name it in a way to pronounce that these callbacks are synchronous. may be I misunderstood
if you and @jmellorcrummey (and anyone) can suggest better name - this would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, but I think using the abbreviation sync
might be a bit unfortunate. I'm also not the best person for naming things though, so I don't have a better idea 😄
The naming of this file should certainly not be a blocker. Even having a short description in the file that this contains synchronous callbacks is sufficient to clearly communicate what this is about.
* This file contains APIs that so far experimental in PTI | ||
* APIs and data structures in this file are work-in-progress and subject to change! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* This file contains APIs that so far experimental in PTI | |
* APIs and data structures in this file are work-in-progress and subject to change! | |
* This file contains APIs that are so far experimental in PTI | |
* APIs and data structures in this file are work-in-progress and subject to change! |
PTI_CB_DOMAIN_DRIVER_GPU_OPERATION_APPENDED = 4, //!< This also serves as PTI_CB_DOMAIN_DRIVER_GPU_OPERATION_DISPATCHED | ||
//!< when appended to Immediate Command List, | ||
//!< which means no separate callback PTI_CB_DOMAIN_DRIVER_GPU_OPERATION_DISPATCHED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding:
The comment basically states that attaching an event to an immediate command list only dispatches this event. We still get an event when the operation has completed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. there still be event on the completion.
But append and dispatch "merged" into one for Immediate command list
PTI_CB_DOMAIN_DRIVER_API = 1023, //!< Not implemeted yet, | ||
//!< attempt to enable it will return PTI_ERROR_NOT_IMPLEMENTED | ||
//!< Callback created for all Driver APIs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend to have something like a RUNTIME_API
as well or would this be handled in any other way? Would that even make sense with Level Zero, since runtime is most likely something like SYCL, OpenCL or other programming paradigms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRIVER_API
here exactly refers to Level Zero. OpenCL would be also "driver level" (when/if supported)
May be one day we might want to add Callbacks to "user"-side /runtime APIs, that many users directly us - e.g. think oneCCL.
pti_backend_queue_t _queue_handle; //!< Device back-end queue handle | ||
pti_device_handle_t _device_handle; //!< Device handle, | ||
pti_callback_phase _phase; //!< Could be ONLY_NOTIFY or API Call ENTER/EXIT | ||
uint32_t _return_code; // will be valid only for L0 API EXIT, for others will be nullptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nullptr
wouldn't really be valid for uint32_t
, should probably also be a defined value as suggested above.
uint32_t _detail; // depending on the domain should be casted/interpreted | ||
// as a purpose of an internal PTI thread or | ||
// pti_internal_event_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint32_t _detail; // depending on the domain should be casted/interpreted | |
// as a purpose of an internal PTI thread or | |
// pti_internal_event_type | |
uint32_t _detail; // For THREAD START/END, this describes the internal PTI thread purpose. | |
// For INTERNAL EVENT, this describes the pti_internal_event_type. |
Is there any way to get the internal PTI thread purpose?
Having a number without any actual meaning feels a bit pointless, even if we have the message in the same struct.
pti_api_group_id driver_api_group_id, // driver API group ID, keep it to distinguish between L0 and OpenCL | ||
// although the current implementation is only for L0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always the driver_api
ID, or may we also get other event types here?
PTI_API_GROUP_SYCL
would be a runtime group_id
, just by looking at the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest just dropping the driver_
part from this argument and the next one. This gives a bit more flexibility in the future without breaking code building on top of this API.
pti_result PTI_EXPORT | ||
ptiCallbackSubscribe(pti_callback_subscriber_handle* subscriber, | ||
pti_callback_function callback, | ||
void* user_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a question out of curiosity. Do you plan to support more than one subscriber at a time?
CUPTI certainly won't allow more than one, I don't know about the AMD APIs. The OpenMP Tools Interface also only allows one, but one can multiplex this interface, as demonstrated by LLVM.
Having the option to attach multiple subscribers can have its benefit, with the main drawback obviously being the potential overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Current design/implementation supports multiple subscribers. Right - having multiple ones would bring additional overhead,. plus we also need explain how they should work together.
* @param enter_cb - indicate if callback called on enter/start: 0-no, 1-yes; used only for domains with 2 sites | ||
* @param exit_cb - indicates if callback called on exit/end: 0-no, 1-yes; used only for domains with 2 sites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One should probably document which domains support these two parameters.
How should one call this function with these domains if he want to enable this domain?
What would happen if one passes false
for both enter_cb
and exit_cb
?
Would the domain be disabled? Should this return an error code?
@Thyre , thank you for your comments. |
There's no need to rush things 😄 ( at least not from my side ) |
I left some comments internally, I'll take some of the API comments and put them here. |
|
||
#ifndef PTI_SYNC_CALLBACK_H_ | ||
#define PTI_SYNC_CALLBACK_H_ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <stdint.h> | |
So we don't have implicit dependencies on other header files.
#ifndef PTI_SYNC_CALLBACK_H_ | ||
#define PTI_SYNC_CALLBACK_H_ | ||
|
||
#include "pti/pti_metrics.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include metrics in callbacks if metrics is going to be a user of the callbacks?
#include "pti/pti_view.h" | ||
|
||
/** | ||
* This file contains APIs that so far experimental in PTI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* This file contains APIs that so far experimental in PTI | |
* This file contains APIs that so far experimental in PTI. |
* This file contains APIs that so far experimental in PTI | ||
* APIs and data structures in this file are work-in-progress and subject to change! | ||
* | ||
* All in this file concerns Callback API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* All in this file concerns Callback API | |
* All content in this file concerns Callback API. |
* APIs and data structures in this file are work-in-progress and subject to change! | ||
* | ||
* All in this file concerns Callback API | ||
* Callback API is useful for many things, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Callback API is useful for many things, | |
* The Callback API is useful for many purposes, |
* | ||
* All in this file concerns Callback API | ||
* Callback API is useful for many things, | ||
* including to the implementation of MetricsScope functionality that wants to subscribe for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* including to the implementation of MetricsScope functionality that wants to subscribe for | |
* including to the implementation of `MetricsScope` functionality that needs to subscribe to |
* All in this file concerns Callback API | ||
* Callback API is useful for many things, | ||
* including to the implementation of MetricsScope functionality that wants to subscribe for | ||
* kernel append to command list .. and may be to other events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* kernel append to command list .. and may be to other events. | |
* events like kernel appends to a command list, and potentially other events. |
// although the current implementation is only for L0 | ||
uint32_t driver_api_id, | ||
pti_backend_ctx_t backend_context, //!< Driver (L0) level context handle | ||
void* cb_data, //!< depending on the domain it should be type-casted to the pointer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both pti_callback_gpu_op_data
and pti_internal_callback_data
contain _domain
and _phase
can we seperate those into a base record and make this,
pti_callback_base {
.. _domain
.. _phase.
}
pti_callback_base* cb_data
then cast based on the value of cb_data->domain?
struct _pti_callback_gpu_op_data {
pti_callback_base*
}
then I think you can remove domain from the callback too but not sure about that
// or API related to GPU operation submission. | ||
} pti_callback_gpu_op_data; | ||
|
||
typedef struct _pti_internal_callback_data { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistently named with _pti_callback_gpu_op_data
.
Should it be _pti_gpu_op_callback_data
? or vice versa
void* user_data); | ||
|
||
/** | ||
* @brief Unsubscribe Callback subscriber, this unsubscribes from all domains, disables callback, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @brief Unsubscribe Callback subscriber, this unsubscribes from all domains, disables callback, | |
* @brief Unsubscribe Callback subscriber. This unsubscribes from all domains, disables the callback, |
|
||
/** | ||
* @brief Unsubscribe Callback subscriber, this unsubscribes from all domains, disables callback, | ||
* clean all resources related to the subscriber handle and invalidate the handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* clean all resources related to the subscriber handle and invalidate the handle | |
* cleans up all resources related to the subscriber handle, and invalidate the handle. |
* | ||
* @param subscriber - subscriber handle | ||
* @param domain - domain to enable | ||
* @param enter_cb - indicate if callback called on enter/start: 0-no, 1-yes; used only for domains with 2 sites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no/yes should probably be an enum. The API is ambiguous this way.
Description
This is an initial implementation of PTI Synchronous Callback API.
Current implementation serves to implement on top of it PTIMetricsScope API (will be added soon) - to collect hardware metrics per individual GPU operations via Event Query mechanism.
This PR
Area of the change
Type(s) of change
Choose one or multiple, leave empty if none of the other choices apply
Tests
Tests will be added in the next PR
Specific HW and OS where to run the test unless generic:
For example, 2 discrete GPUs, integrated GPU, specific GPU model, PyTorch integration test(s)
Checklist
Details on API(s) or command line option(s) changes
If applies - details on the broken backward compatibility
Indicate what API(s) backward compatibility or option(s) is broken, why it might be OK, or suggest on how to deal with it moving forward
Notify the following users
@jmellorcrummey , @Thyre , @anmyachev, @yuninxia @mschilling0, @Rogersyp
Other information