[Core] NumPy String support in NumSharp

## Overview

Add NumPy-compatible string dtype support to NumSharp, enabling text data processing with full NumPy 2.x behavioral parity.

## Problem

NumSharp cannot store strings in arrays, perform vectorized string operations, or interoperate with NumPy string data. This blocks text-heavy ML/data science workflows, pandas-like DataFrames, and full .npy file compatibility.

## Reference

`docs/plans/NUMPY_STRING_TYPES.md` - 1959-line specification from NumPy v2.4.2 source analysis.

## Proposal

### Core Type System

- [ ] Add `NPTypeCode.Bytes` for fixed-width byte strings (`S`, NPY_STRING = 18)
- [ ] Add `NPTypeCode.Unicode` for fixed-width Unicode strings (`U`, NPY_UNICODE = 19)
- [ ] Add `NPTypeCode.VString` for variable-length UTF-8 strings (`T`, NPY_VSTRING = 2056)
- [ ] Keep `NPTypeCode.Char` as its own type (maps to `U1` or `S1` internally for string operations)
- [ ] Add `np.bytes_` scalar type (wraps `byte[]`)
- [ ] Add `np.str_` scalar type (wraps `string` or `int[]` for UTF-32)
- [ ] Add `np.character` base class for string scalar hierarchy
- [ ] Add `np.flexible` base class (parent of `character` and `void`)

### Dtype Parsing and Construction

- [ ] Parse `S`, `S5`, `S10`, `|S10` to bytes_ dtype
- [ ] Parse `U`, `U5`, `U10` to str_ dtype
- [ ] Parse `<U10` (little-endian), `>U10` (big-endian), `=U10` (native) byte orders
- [ ] Parse `T` to StringDType
- [ ] Parse `a` as deprecated alias for `S`
- [ ] Parse `c` dtype - map to `NPTypeCode.Char` or treat as `S1`
- [ ] Support `np.dtype(np.bytes_)`, `np.dtype(np.str_)` constructor forms
- [ ] Support tuple form: `np.dtype((np.str_, 5))` for 5-char unicode
- [ ] Implement `dtype.char` property (`S`, `U`, `T`)
- [ ] Implement `dtype.kind` property (`S`, `U`, `T`)
- [ ] Implement `dtype.type` property (scalar type class)
- [ ] Implement `dtype.itemsize` (bytes per element)
- [ ] Implement `dtype.name` (e.g., `bytes80`, `str320`, `StringDType128`)
- [ ] Implement `dtype.str` (e.g., `|S10`, `<U10`, `|T16`)

### Memory Layout - bytes_ (S)

- [ ] Store as contiguous byte buffer, 1 byte per character
- [ ] Itemsize = exact byte count per element
- [ ] Null-pad shorter strings to itemsize
- [ ] Truncate longer strings to itemsize
- [ ] No byte order (always `|` prefix)
- [ ] Strip trailing nulls on scalar access

### Memory Layout - str_ (U)

- [ ] Store as UTF-32/UCS-4, 4 bytes per code point (`npy_ucs4` = `uint`)
- [ ] Itemsize = character count x 4
- [ ] Support little-endian (`<`) and big-endian (`>`) byte orders
- [ ] Null-pad shorter strings (null = `0x00000000`)
- [ ] Truncate longer strings to character count
- [ ] Implement byte swapping for cross-platform .npy files

### Memory Layout - StringDType (T)

- [ ] 16-byte header per element (64-bit) or 8-byte (32-bit)
- [ ] Implement small-string optimization (15 bytes or less inline on 64-bit, 7 or less on 32-bit)
- [ ] Implement arena allocator for medium strings (255 bytes or less)
- [ ] Implement heap allocation for large/mutated strings
- [ ] Store UTF-8 encoded content
- [ ] Implement flag byte: `MISSING` (0x80), `INITIALIZED` (0x40), `OUTSIDE_ARENA` (0x20), `LONG` (0x10)
- [ ] Thread-safe allocator access (mutex/lock)
- [ ] Preserve embedded null bytes (unlike legacy types)

### StringDType NA/Missing Support

- [ ] Support `StringDType(na_object=None)` - None as NA
- [ ] Support `StringDType(na_object=np.nan)` - NaN as NA
- [ ] Support `StringDType(coerce=True/False)` - coercion control
- [ ] NA values sort to end
- [ ] `np.isnan()` identifies NaN-like NA
- [ ] Raise on comparison with non-NaN NA (e.g., None)

### Array Creation

- [ ] `np.array(["hello", "world"], dtype="S")` - auto-size to longest
- [ ] `np.array(["hello", "world"], dtype="U")` - auto-size to longest
- [ ] `np.array(["hello", "world"], dtype="T")` - variable length
- [ ] `np.array([b"hello"], dtype="S")` - from Python bytes
- [ ] `np.array(["hello"], dtype="S")` - encode string as ASCII
- [ ] Truncation when explicit size < content: `dtype="U3"` results in `"hel"`
- [ ] Null-padding when explicit size > content
- [ ] `np.zeros((3,), dtype="U10")` - empty strings
- [ ] `np.empty((3,), dtype="U10")` - uninitialized (may contain garbage)
- [ ] `np.full((3,), "x", dtype="U10")` - fill with string

### Type Conversions and Casting

- [ ] `bytes_.astype("U")` - decode as ASCII to Unicode
- [ ] `str_.astype("S")` - encode as ASCII (fail on non-ASCII)
- [ ] `str_.astype("T")` - full Unicode preserved
- [ ] `StringDType.astype("U10")` - may truncate to width
- [ ] `StringDType.astype("S")` - ASCII only
- [ ] Width truncation: `"U10".astype("U5")` truncates
- [ ] `int/float.astype("U")` results in `["1", "1.5"]`
- [ ] `["1", "2"].astype(int)` results in `[1, 2]`
- [ ] `bool.astype("U")` results in `["True", "False"]`
- [ ] `string.astype(bool)` - empty = False, non-empty = True
- [ ] `datetime64.astype("T")` results in ISO string
- [ ] `StringDType.astype("datetime64")` - parse ISO, `"NaT"` becomes NaT
- [ ] `StringDType.astype("V5")` - raw UTF-8 bytes to void
- [ ] Casting safety: `casting="safe"` vs `"unsafe"`
- [ ] `NPTypeCode.Char` conversion to/from string types

### Comparison Operations

- [ ] `==` element-wise equality (returns bool array)
- [ ] `!=` element-wise inequality
- [ ] `<`, `<=`, `>`, `>=` lexicographic comparison
- [ ] `np.equal`, `np.not_equal`, `np.less`, `np.less_equal`, `np.greater`, `np.greater_equal`
- [ ] Comparison broadcasts correctly
- [ ] bytes_ and str_ cannot be compared directly (TypeError)
- [ ] StringDType can compare with str_ and object (type promotion)

### String Concatenation and Repetition

- [ ] `np.strings.add(a, b)` - concatenate strings
- [ ] `np.strings.multiply(a, n)` - repeat string n times
- [ ] `multiply(s, 0)` or `multiply(s, -1)` results in empty string
- [ ] `multiply(s, sys.maxsize)` results in OverflowError
- [ ] Broadcasting: `["a", "b"] + "x"` results in `["ax", "bx"]`

### String Length

- [ ] `np.strings.str_len(arr)` - returns int64 array
- [ ] Length = code points, not bytes (emoji = 1)
- [ ] Empty string results in 0

### String Searching

- [ ] `np.strings.find(a, sub, start=0, end=None)` - first occurrence, -1 if not found
- [ ] `np.strings.rfind(a, sub, start=0, end=None)` - last occurrence
- [ ] `np.strings.index(a, sub, ...)` - like find but raises ValueError
- [ ] `np.strings.rindex(a, sub, ...)` - like rfind but raises ValueError
- [ ] `np.strings.count(a, sub, start=0, end=None)` - count occurrences
- [ ] `np.strings.startswith(a, prefix, start=0, end=None)` - bool array
- [ ] `np.strings.endswith(a, suffix, start=0, end=None)` - bool array

### Case Operations

- [ ] `np.strings.upper(arr)` - uppercase
- [ ] `np.strings.lower(arr)` - lowercase
- [ ] `np.strings.capitalize(arr)` - first char upper, rest lower
- [ ] `np.strings.title(arr)` - titlecase each word
- [ ] `np.strings.swapcase(arr)` - swap case

### Character Classification

- [ ] `np.strings.isalpha(arr)` - all alphabetic
- [ ] `np.strings.isdigit(arr)` - all digits
- [ ] `np.strings.isalnum(arr)` - all alphanumeric
- [ ] `np.strings.isspace(arr)` - all whitespace
- [ ] `np.strings.islower(arr)` - all lowercase (cased chars)
- [ ] `np.strings.isupper(arr)` - all uppercase (cased chars)
- [ ] `np.strings.istitle(arr)` - titlecase format
- [ ] `np.strings.isnumeric(arr)` - numeric (Unicode-aware, str_/StringDType only)
- [ ] `np.strings.isdecimal(arr)` - decimal digits (Unicode-aware, str_/StringDType only)
- [ ] Empty strings return False for all classification functions

### Whitespace Operations

- [ ] `np.strings.strip(arr, chars=None)` - strip both ends
- [ ] `np.strings.lstrip(arr, chars=None)` - strip left
- [ ] `np.strings.rstrip(arr, chars=None)` - strip right
- [ ] `np.strings.center(arr, width, fillchar=" ")` - center-align
- [ ] `np.strings.ljust(arr, width, fillchar=" ")` - left-align
- [ ] `np.strings.rjust(arr, width, fillchar=" ")` - right-align
- [ ] `np.strings.zfill(arr, width)` - zero-fill numeric strings
- [ ] `np.strings.expandtabs(arr, tabsize=8)` - expand tabs

### Splitting and Partitioning

- [ ] `np.strings.partition(arr, sep)` - returns tuple of 3 arrays (before, sep, after)
- [ ] `np.strings.rpartition(arr, sep)` - partition from right
- [ ] `np.char.split(arr, sep, maxsplit)` - returns object array of lists
- [ ] `np.char.rsplit(arr, sep, maxsplit)` - split from right
- [ ] `np.char.splitlines(arr, keepends=False)` - split on line boundaries
- [ ] `np.char.join(sep, seq)` - join sequences

### Replacement and Slicing

- [ ] `np.strings.replace(arr, old, new, count=-1)` - replace occurrences
- [ ] `np.strings.slice(arr, start=0, stop=None, step=1)` - substring extraction
- [ ] `np.strings.slice(arr, None, None, -1)` - reverse string

### Encoding/Decoding

- [ ] `np.strings.encode(arr, encoding="utf-8", errors="strict")` - str to bytes
- [ ] `np.strings.decode(arr, encoding="utf-8", errors="strict")` - bytes to str
- [ ] Error modes: `"strict"`, `"ignore"`, `"replace"`

### Formatting and Translation

- [ ] `np.strings.mod(format_arr, values)` - printf-style formatting
- [ ] `np.strings.translate(arr, table, deletechars=None)` - character translation

### Array Functions with String Support

- [ ] `np.sort(arr)` - lexicographic sort
- [ ] `np.argsort(arr)` - sort indices
- [ ] `np.argmax(arr)` - index of lexicographic max
- [ ] `np.argmin(arr)` - index of lexicographic min
- [ ] `np.searchsorted(arr, value)` - binary search
- [ ] `np.unique(arr)` - unique strings
- [ ] `np.concatenate([arr1, arr2])` - join arrays
- [ ] `np.stack`, `np.vstack`, `np.hstack`, `np.dstack` - stacking
- [ ] `np.where(cond, x, y)` - conditional selection
- [ ] `np.take(arr, indices)` - index selection
- [ ] `np.nonzero(arr)` - indices of non-empty strings
- [ ] `np.any(arr)` - True if any non-empty
- [ ] `np.all(arr)` - True if all non-empty
- [ ] `np.copy(arr)` - copy array
- [ ] `np.resize(arr, new_shape)` - resize with repetition
- [ ] `np.maximum(a, b)`, `np.minimum(a, b)` - lexicographic element-wise

### Structured Arrays

- [ ] String fields in structured dtypes: `[("name", "U20"), ("id", "i4")]`
- [ ] StringDType in structured arrays
- [ ] Nested structured types with string fields
- [ ] Record arrays (`np.rec.array`) with attribute access

### File I/O

- [ ] `np.save()`/`np.load()` for bytes_, str_, StringDType (.npy format)
- [ ] `np.savez()`/`np.load()` for string arrays (.npz archives)
- [ ] `np.loadtxt(dtype="U")` - load text as strings
- [ ] `np.savetxt(fmt="%s")` - save strings to text
- [ ] `np.genfromtxt(dtype="U")` - flexible text loading
- [ ] `arr.tofile()`/`np.fromfile(dtype="S10")` - binary I/O (fixed-width only)
- [ ] Memory mapping for fixed-width strings: `np.memmap(dtype="U100")`
- [ ] StringDType cannot be memory-mapped (variable length)

### Boolean Conversion and Truthiness

- [ ] `arr.astype(bool)` - empty = False, non-empty = True
- [ ] `np.nonzero(arr)` returns indices of non-empty strings
- [ ] StringDType NA is truthy for nonzero (but identifiable via `np.isnan`)

### Edge Cases

- [ ] Empty string arrays: `np.array([""], dtype="S")` results in `dtype="|S1"`
- [ ] Zero-size dtype: `np.dtype("S0").itemsize` results in 0
- [ ] Embedded null bytes preserved in storage
- [ ] Null bytes may terminate on scalar access (bytes_)
- [ ] Unicode surrogates: valid in UTF-32, may fail converting to UTF-8
- [ ] Combining characters: `"e\u0301"` is 2 code points (displays as 1)
- [ ] No automatic Unicode normalization
- [ ] Very large strings: respect `NPY_MAX_STRING_SIZE`
- [ ] Overflow detection in `multiply` and `add`

### Legacy Support

- [ ] `np.char.chararray` - deprecated subclass (minimal support if needed)
- [ ] chararray automatic whitespace stripping behavior
- [ ] chararray instance methods (upper, lower, etc.)
- [ ] `np.char.*` functions as aliases to `np.strings.*`

### Type Checking

- [ ] `np.issubdtype(arr.dtype, np.bytes_)` - True for S
- [ ] `np.issubdtype(arr.dtype, np.str_)` - True for U
- [ ] `np.issubdtype(arr.dtype, np.character)` - True for S or U
- [ ] `np.issubdtype(arr.dtype, np.flexible)` - True for S, U, V

## Evidence

- Reference document: `docs/plans/NUMPY_STRING_TYPES.md`
- NumPy source: `src/numpy/` (v2.4.2)
- String dtypes required for: pandas interop, scikit-learn text vectorizers, NLP pipelines, CSV/text file processing

## Scope / Non-goals

**In scope**: Core string dtypes, string operations API, file I/O, structured arrays, type conversions

**Out of scope**:
- Regular expressions (use .NET `Regex`)
- Locale-aware collation/sorting
- NLP-specific operations (tokenization, stemming)
- Full chararray compatibility (deprecated in NumPy)


[Core] NumPy String support in NumSharp #592

Description

Overview

Problem

Reference

Proposal

Core Type System

Dtype Parsing and Construction

Memory Layout - bytes_ (S)

Memory Layout - str_ (U)

Memory Layout - StringDType (T)

StringDType NA/Missing Support

Array Creation

Type Conversions and Casting

Comparison Operations

String Concatenation and Repetition

String Length

String Searching

Case Operations

Character Classification

Whitespace Operations

Splitting and Partitioning

Replacement and Slicing

Encoding/Decoding

Formatting and Translation

Array Functions with String Support

Structured Arrays

File I/O

Boolean Conversion and Truthiness

Edge Cases

Legacy Support

Type Checking

Evidence

Scope / Non-goals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions