I've been looking at how record types can be integrated in rust-numpy and here's an unsorted collection of thoughts for discussion.
Let's look at Element:
pub unsafe trait Element: Clone + Send {
const DATA_TYPE: DataType;
fn is_same_type(dtype: &PyArrayDescr) -> bool;
fn npy_type() -> NPY_TYPES { ... }
fn get_dtype(py: Python) -> &PyArrayDescr { ... }
}
npy_type() is used in PyArray::new() and the like. Instead, one should use PyArray_NewFromDescr() to make use of the custom descriptor. Should all places where npy_type() is used split between "simple type, use New" and "user type, use NewFromDescr"? Or, alternatively, should arrays always be constructed from descriptor? (in which case, npy_type() becomes redundant and should be removed)
- Why is
same_type() needed at all? It is only used in FromPyObject::extract where one could simply use PyArray_EquivTypes (like it's done in pybind11). Isn't it largely redundant? (or does it exist for optimization purposes? In which case, is it even noticeable performance-wise?)
DATA_TYPE constant is really only used to check if it's an object or not in 2 places, like this:
if T::DATA_TYPE != DataType::Object
Isn't this redundant as well? Given that one can always do
T::get_dtype().get_datatype() != Some(DataType::Object)
// or, can add something like: T::get_dtype().is_object()
- With all the notes above,
Element essentially is just
pub unsafe trait Element: Clone + Send {
fn get_dtype(py: Python) -> &PyArrayDescr;
}
- For structured types, do we want to stick the type descriptor into
DataType? E.g.:
enum DataType { ..., Record(RecordType) }
Or, alternatively, just keep it as DataType::Void? In which case, how does one recover record type descriptor? (it can always be done through numpy C API of course, via PyArrayDescr).
- In order to enable user-defined record dtypes, having to return
&PyArrayDescr would probably require:
- Maintaining a global static thread-safe registry of registered dtypes (kind of like it's done in pybind11)
- Initializing this registry somewhere
- Any other options?
Element should probably be implemented for tuples and fixed-size arrays.
- In order to implement structured dtypes, we'll inevitably have to resort to proc-macros. A few random thoughts and examples of how it can be done (any suggestions?):
-
#[numpy(record)]
#[derive(Clone, Copy)]
#[repr(packed)]
struct Foo { x: i32, u: Bar } // where Bar is a registered numpy dtype as well
// dtype = [('x', '<i4'), ('u', ...)]
- We probably have to require either of
#[repr(C)], #[repr(packed)] or #[repr(transparent)]
- If repr is required, it can be an argument of the macro, e.g.
#[numpy(record, repr = "C")]. (or not)
- We also have to require
Copy? (or not? technically, you could have object-type fields inside)
- For wrapper types, we can allow something like this:
-
#[numpy(transparent)]
#[repr(transparent)]
struct Wrapper(pub i32);
// dtype = '<i4'
- For object types, the current suggestion in the docs is to implement a wrapper type and then impl
Element for it manually. This seems largely redundant, given that the DATA_TYPE will always be Object. It would be nice if any #[pyclass]-wrapped types could automatically implement Element, but it would be impossible due to orphan rule. An alternative would be something like this:
#[pyclass]
#[numpy] // i.e., #[numpy(object)]
struct Foo {}
- How does one register dtypes for foreign (remote) types? I.e.,
OrderedFloat<f32> or Wrapping<u64> or some PyClassFromOtherCrate? We can try doing something like what serde does for remote types.
I've been looking at how record types can be integrated in rust-numpy and here's an unsorted collection of thoughts for discussion.
Let's look at
Element:npy_type()is used inPyArray::new()and the like. Instead, one should usePyArray_NewFromDescr()to make use of the custom descriptor. Should all places wherenpy_type()is used split between "simple type, useNew" and "user type, useNewFromDescr"? Or, alternatively, should arrays always be constructed from descriptor? (in which case,npy_type()becomes redundant and should be removed)same_type()needed at all? It is only used inFromPyObject::extractwhere one could simply usePyArray_EquivTypes(like it's done in pybind11). Isn't it largely redundant? (or does it exist for optimization purposes? In which case, is it even noticeable performance-wise?)DATA_TYPEconstant is really only used to check if it's an object or not in 2 places, like this:Elementessentially is justDataType? E.g.:DataType::Void? In which case, how does one recover record type descriptor? (it can always be done through numpy C API of course, viaPyArrayDescr).&PyArrayDescrwould probably require:Elementshould probably be implemented for tuples and fixed-size arrays.#[repr(C)],#[repr(packed)]or#[repr(transparent)]#[numpy(record, repr = "C")]. (or not)Copy? (or not? technically, you could have object-type fields inside)Elementfor it manually. This seems largely redundant, given that theDATA_TYPEwill always beObject. It would be nice if any#[pyclass]-wrapped types could automatically implementElement, but it would be impossible due to orphan rule. An alternative would be something like this:OrderedFloat<f32>orWrapping<u64>or somePyClassFromOtherCrate? We can try doing something like what serde does for remote types.