Implement get_uncompressed_data_size_in_bytes()#92
Implement get_uncompressed_data_size_in_bytes()#92rapids-bot[bot] merged 3 commits intoNVIDIA:mainfrom
Conversation
joosthooz
left a comment
There was a problem hiding this comment.
I'm in favor of this. Do we want to change the existing get_data_size_in_bytes() to return the full memory footprint, including the padding due to the allocation blocks and the metadata for the packed representation? Like a cudf table has alloc_size().
|
/ok to test 5248bab |
9prady9
left a comment
There was a problem hiding this comment.
Does get_data_size always return uncompressed size ? I don't see it in this PR, can we rename that as well to reflect what size it returns ?
|
In reply to @joosthooz: I think it would be nice if we had one API which was for the @9prady9 I believe that the existing |
…umption estimation (#473) Implements #472 Requires NVIDIA/cuCascade#92 If the above change in cuCascade is not desired, I can make a couple small changes to explicitly account for the host_parquet_representation. This PR adds the following functions/concepts: - local_state::get_task_consumption_basis(), which basically for most tasks would be the input number of bytes, but can mean different things for things like scans. - local_state::_bytes_to_materialize_input, which contains how many bytes are needed to put the input on the GPU. (this is 0 if the data is already on the GPU) - global_state::get_memory_history(): this will allow you to `record()` the input task consuption basis and the peak memory consumption. - get_estimated_reservation_size() is updated to use the pipeline_memory_history::estimate_peak_memory() which uses a weighted average of historical peak/estimated ratios, and it also accounts for needing to pull data onto the GPU from host or disk. Design considerations: - The history is currently per pipeline for simplicity sake. As opposed to having a per operator history. The issue with this is that if a task fails in the middle of a pipeline, then the estimation logic is not as clean. This is ameliorated by the fact that if a task fails, we capture the initial task estimation basis so that it can be used in the task retry and therefore the memory estimation/reservation logic will apply the same as if the task retry was starting from the beginning of the pipeline. - If a task fails, we capture the peak consumption during failure, so that retries should then at least reserve as much memory as was being consumed during the OOM. If the same task fails multiple times, it will record the history of the failure with the highest peak consumption, such that the next attempt will reserve more and ideally succeed. - If the data resides off the GPU, the memory estimation will add to the memory reservation how much it needs to hold the data on the GPU (_bytes_to_materialize_input). This value also gets subtracted from the peak consumption recorded for a task since its not really part of the operators themselves and if an operator receives a mixture of data that came from GPU and Host, we want the operator history to represent the actual operator consumption. - If materializing data onto the GPU requires any overhead (such as a decompression overhead like for host_parqut_representation), that overhead will not be explicitly accounted for. But if the peak consumption in a pipeline is actually due to this overhead and not from the peak consumption from an operator, then it will be captured by the operator history, which is not accurate, but until a better way of accounting for this is implemented, it does help with memory estimation and tracking. If we later introduce a compressed format that we can use during downgrading, then we dont have a good way to account for that. I have some ideas which can be layered on top of this implementation. One option is to have converters be able to acccuratelly represent their memory needs. Option 2 is that we can create historical tracking of conversions. A couple things missing for followup PRs: - This does not properly capture historicals for scan tasks. We need to figure out what to do about parquet scans in general. Right now they can use GPU, but then leave results in CPU. This is not what we want to do in the long run. - We need to add a way to request downgrades from the downgrade executor --------- Co-authored-by: William Malpica <[email protected]> Co-authored-by: Copilot <[email protected]>
…umption estimation (#473) Implements #472 Requires NVIDIA/cuCascade#92 If the above change in cuCascade is not desired, I can make a couple small changes to explicitly account for the host_parquet_representation. This PR adds the following functions/concepts: - local_state::get_task_consumption_basis(), which basically for most tasks would be the input number of bytes, but can mean different things for things like scans. - local_state::_bytes_to_materialize_input, which contains how many bytes are needed to put the input on the GPU. (this is 0 if the data is already on the GPU) - global_state::get_memory_history(): this will allow you to `record()` the input task consuption basis and the peak memory consumption. - get_estimated_reservation_size() is updated to use the pipeline_memory_history::estimate_peak_memory() which uses a weighted average of historical peak/estimated ratios, and it also accounts for needing to pull data onto the GPU from host or disk. Design considerations: - The history is currently per pipeline for simplicity sake. As opposed to having a per operator history. The issue with this is that if a task fails in the middle of a pipeline, then the estimation logic is not as clean. This is ameliorated by the fact that if a task fails, we capture the initial task estimation basis so that it can be used in the task retry and therefore the memory estimation/reservation logic will apply the same as if the task retry was starting from the beginning of the pipeline. - If a task fails, we capture the peak consumption during failure, so that retries should then at least reserve as much memory as was being consumed during the OOM. If the same task fails multiple times, it will record the history of the failure with the highest peak consumption, such that the next attempt will reserve more and ideally succeed. - If the data resides off the GPU, the memory estimation will add to the memory reservation how much it needs to hold the data on the GPU (_bytes_to_materialize_input). This value also gets subtracted from the peak consumption recorded for a task since its not really part of the operators themselves and if an operator receives a mixture of data that came from GPU and Host, we want the operator history to represent the actual operator consumption. - If materializing data onto the GPU requires any overhead (such as a decompression overhead like for host_parqut_representation), that overhead will not be explicitly accounted for. But if the peak consumption in a pipeline is actually due to this overhead and not from the peak consumption from an operator, then it will be captured by the operator history, which is not accurate, but until a better way of accounting for this is implemented, it does help with memory estimation and tracking. If we later introduce a compressed format that we can use during downgrading, then we dont have a good way to account for that. I have some ideas which can be layered on top of this implementation. One option is to have converters be able to acccuratelly represent their memory needs. Option 2 is that we can create historical tracking of conversions. A couple things missing for followup PRs: - This does not properly capture historicals for scan tasks. We need to figure out what to do about parquet scans in general. Right now they can use GPU, but then leave results in CPU. This is not what we want to do in the long run. - We need to add a way to request downgrades from the downgrade executor --------- Co-authored-by: William Malpica <[email protected]> Co-authored-by: Copilot <[email protected]>
|
I changed the API name from |
dhruv9vats
left a comment
There was a problem hiding this comment.
Thanks @wmalpica! uncompressed is much more unambiguous.
|
/ok to test 77d7590 |
…sses The pure virtual method was added to idata_representation but the test stubs in test_representation_converter.cpp were not updated, causing abstract class instantiation errors. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
/ok to test 6951406 |
|
/merge |
…ze_in_bytes() (#539) This is a small change that I forgot to add to #473 It needs to be merged with NVIDIA/cuCascade#92 --------- Co-authored-by: William Malpica <[email protected]> Co-authored-by: Matthijs Brobbel <[email protected]>
This PR introduces the concept of get_logical_data_size_in_bytes()
This is necessary because the existing get_data_size_in_bytes() returns how much space it occupies, but if the representation is compressed, we need a different API to return how much it should occupy if we materialize the data on the GPU