Skip to content

Implement get_uncompressed_data_size_in_bytes()#92

Merged
rapids-bot[bot] merged 3 commits intoNVIDIA:mainfrom
wmalpica:logical_data_size
Apr 2, 2026
Merged

Implement get_uncompressed_data_size_in_bytes()#92
rapids-bot[bot] merged 3 commits intoNVIDIA:mainfrom
wmalpica:logical_data_size

Conversation

@wmalpica
Copy link
Copy Markdown
Contributor

This PR introduces the concept of get_logical_data_size_in_bytes()
This is necessary because the existing get_data_size_in_bytes() returns how much space it occupies, but if the representation is compressed, we need a different API to return how much it should occupy if we materialize the data on the GPU

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@joosthooz joosthooz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of this. Do we want to change the existing get_data_size_in_bytes() to return the full memory footprint, including the padding due to the allocation blocks and the metadata for the packed representation? Like a cudf table has alloc_size().

@mbrobbel
Copy link
Copy Markdown
Member

/ok to test 5248bab

Copy link
Copy Markdown
Contributor

@9prady9 9prady9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does get_data_size always return uncompressed size ? I don't see it in this PR, can we rename that as well to reflect what size it returns ?

@wmalpica
Copy link
Copy Markdown
Contributor Author

In reply to @joosthooz:

I think it would be nice if we had one API which was for the get_logical_data_size_in_bytes() vs get_physical_data_size_in_bytes(), and we rename get_data_size_in_bytes() to be the physical one, and that one would include the padding.
Now, this would make it a very breaking change, and its not super simple to capture the true physical allocations. So, I would suggest that as a followup PR.

@9prady9 I believe that the existing get_data_size_in_bytes() does indeed always return uncompressed bytes, because (in cucascade) all the representations are uncompressed. But it does not include padding.

github-merge-queue bot pushed a commit to sirius-db/sirius that referenced this pull request Apr 1, 2026
…umption estimation (#473)

Implements #472 

Requires NVIDIA/cuCascade#92
If the above change in cuCascade is not desired, I can make a couple
small changes to explicitly account for the host_parquet_representation.

This PR adds the following functions/concepts:
- local_state::get_task_consumption_basis(), which basically for most
tasks would be the input number of bytes, but can mean different things
for things like scans.
- local_state::_bytes_to_materialize_input, which contains how many
bytes are needed to put the input on the GPU. (this is 0 if the data is
already on the GPU)
- global_state::get_memory_history(): this will allow you to `record()`
the input task consuption basis and the peak memory consumption.
- get_estimated_reservation_size() is updated to use the
pipeline_memory_history::estimate_peak_memory() which uses a weighted
average of historical peak/estimated ratios, and it also accounts for
needing to pull data onto the GPU from host or disk.

Design considerations:
- The history is currently per pipeline for simplicity sake. As opposed
to having a per operator history. The issue with this is that if a task
fails in the middle of a pipeline, then the estimation logic is not as
clean. This is ameliorated by the fact that if a task fails, we capture
the initial task estimation basis so that it can be used in the task
retry and therefore the memory estimation/reservation logic will apply
the same as if the task retry was starting from the beginning of the
pipeline.
- If a task fails, we capture the peak consumption during failure, so
that retries should then at least reserve as much memory as was being
consumed during the OOM. If the same task fails multiple times, it will
record the history of the failure with the highest peak consumption,
such that the next attempt will reserve more and ideally succeed.
- If the data resides off the GPU, the memory estimation will add to the
memory reservation how much it needs to hold the data on the GPU
(_bytes_to_materialize_input). This value also gets subtracted from the
peak consumption recorded for a task since its not really part of the
operators themselves and if an operator receives a mixture of data that
came from GPU and Host, we want the operator history to represent the
actual operator consumption.
- If materializing data onto the GPU requires any overhead (such as a
decompression overhead like for host_parqut_representation), that
overhead will not be explicitly accounted for. But if the peak
consumption in a pipeline is actually due to this overhead and not from
the peak consumption from an operator, then it will be captured by the
operator history, which is not accurate, but until a better way of
accounting for this is implemented, it does help with memory estimation
and tracking. If we later introduce a compressed format that we can use
during downgrading, then we dont have a good way to account for that. I
have some ideas which can be layered on top of this implementation. One
option is to have converters be able to acccuratelly represent their
memory needs. Option 2 is that we can create historical tracking of
conversions.

A couple things missing for followup PRs:
- This does not properly capture historicals for scan tasks. We need to
figure out what to do about parquet scans in general. Right now they can
use GPU, but then leave results in CPU. This is not what we want to do
in the long run.
- We need to add a way to request downgrades from the downgrade executor

---------

Co-authored-by: William Malpica <[email protected]>
Co-authored-by: Copilot <[email protected]>
github-merge-queue bot pushed a commit to sirius-db/sirius that referenced this pull request Apr 1, 2026
…umption estimation (#473)

Implements #472 

Requires NVIDIA/cuCascade#92
If the above change in cuCascade is not desired, I can make a couple
small changes to explicitly account for the host_parquet_representation.

This PR adds the following functions/concepts:
- local_state::get_task_consumption_basis(), which basically for most
tasks would be the input number of bytes, but can mean different things
for things like scans.
- local_state::_bytes_to_materialize_input, which contains how many
bytes are needed to put the input on the GPU. (this is 0 if the data is
already on the GPU)
- global_state::get_memory_history(): this will allow you to `record()`
the input task consuption basis and the peak memory consumption.
- get_estimated_reservation_size() is updated to use the
pipeline_memory_history::estimate_peak_memory() which uses a weighted
average of historical peak/estimated ratios, and it also accounts for
needing to pull data onto the GPU from host or disk.

Design considerations:
- The history is currently per pipeline for simplicity sake. As opposed
to having a per operator history. The issue with this is that if a task
fails in the middle of a pipeline, then the estimation logic is not as
clean. This is ameliorated by the fact that if a task fails, we capture
the initial task estimation basis so that it can be used in the task
retry and therefore the memory estimation/reservation logic will apply
the same as if the task retry was starting from the beginning of the
pipeline.
- If a task fails, we capture the peak consumption during failure, so
that retries should then at least reserve as much memory as was being
consumed during the OOM. If the same task fails multiple times, it will
record the history of the failure with the highest peak consumption,
such that the next attempt will reserve more and ideally succeed.
- If the data resides off the GPU, the memory estimation will add to the
memory reservation how much it needs to hold the data on the GPU
(_bytes_to_materialize_input). This value also gets subtracted from the
peak consumption recorded for a task since its not really part of the
operators themselves and if an operator receives a mixture of data that
came from GPU and Host, we want the operator history to represent the
actual operator consumption.
- If materializing data onto the GPU requires any overhead (such as a
decompression overhead like for host_parqut_representation), that
overhead will not be explicitly accounted for. But if the peak
consumption in a pipeline is actually due to this overhead and not from
the peak consumption from an operator, then it will be captured by the
operator history, which is not accurate, but until a better way of
accounting for this is implemented, it does help with memory estimation
and tracking. If we later introduce a compressed format that we can use
during downgrading, then we dont have a good way to account for that. I
have some ideas which can be layered on top of this implementation. One
option is to have converters be able to acccuratelly represent their
memory needs. Option 2 is that we can create historical tracking of
conversions.

A couple things missing for followup PRs:
- This does not properly capture historicals for scan tasks. We need to
figure out what to do about parquet scans in general. Right now they can
use GPU, but then leave results in CPU. This is not what we want to do
in the long run.
- We need to add a way to request downgrades from the downgrade executor

---------

Co-authored-by: William Malpica <[email protected]>
Co-authored-by: Copilot <[email protected]>
@wmalpica wmalpica changed the title Implement get_logical_data_size_in_bytes() Implement get_uncompressed_data_size_in_bytes() Apr 2, 2026
@wmalpica
Copy link
Copy Markdown
Contributor Author

wmalpica commented Apr 2, 2026

I changed the API name from get_logical_data_size_in_bytes() to get_uncompressed_data_size_in_bytes() due to feedback

Copy link
Copy Markdown
Member

@dhruv9vats dhruv9vats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wmalpica! uncompressed is much more unambiguous.

@dhruv9vats
Copy link
Copy Markdown
Member

/ok to test 77d7590

…sses

The pure virtual method was added to idata_representation but the test
stubs in test_representation_converter.cpp were not updated, causing
abstract class instantiation errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@mbrobbel
Copy link
Copy Markdown
Member

mbrobbel commented Apr 2, 2026

/ok to test 6951406

@mbrobbel
Copy link
Copy Markdown
Member

mbrobbel commented Apr 2, 2026

/merge

@rapids-bot rapids-bot bot merged commit 942c0bf into NVIDIA:main Apr 2, 2026
11 checks passed
github-merge-queue bot pushed a commit to sirius-db/sirius that referenced this pull request Apr 2, 2026
…ze_in_bytes() (#539)

This is a small change that I forgot to add to
#473

It needs to be merged with NVIDIA/cuCascade#92

---------

Co-authored-by: William Malpica <[email protected]>
Co-authored-by: Matthijs Brobbel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants