You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add NumPy-compatible string dtype support to NumSharp, enabling text data processing with full NumPy 2.x behavioral parity.
Problem
NumSharp cannot store strings in arrays, perform vectorized string operations, or interoperate with NumPy string data. This blocks text-heavy ML/data science workflows, pandas-like DataFrames, and full .npy file compatibility.
Reference
docs/plans/NUMPY_STRING_TYPES.md - 1959-line specification from NumPy v2.4.2 source analysis.
Proposal
Core Type System
Add NPTypeCode.Bytes for fixed-width byte strings (S, NPY_STRING = 18)
Add NPTypeCode.Unicode for fixed-width Unicode strings (U, NPY_UNICODE = 19)
Add NPTypeCode.VString for variable-length UTF-8 strings (T, NPY_VSTRING = 2056)
Keep NPTypeCode.Char as its own type (maps to U1 or S1 internally for string operations)
Add np.bytes_ scalar type (wraps byte[])
Add np.str_ scalar type (wraps string or int[] for UTF-32)
Add np.character base class for string scalar hierarchy
Add np.flexible base class (parent of character and void)
Overview
Add NumPy-compatible string dtype support to NumSharp, enabling text data processing with full NumPy 2.x behavioral parity.
Problem
NumSharp cannot store strings in arrays, perform vectorized string operations, or interoperate with NumPy string data. This blocks text-heavy ML/data science workflows, pandas-like DataFrames, and full .npy file compatibility.
Reference
docs/plans/NUMPY_STRING_TYPES.md- 1959-line specification from NumPy v2.4.2 source analysis.Proposal
Core Type System
NPTypeCode.Bytesfor fixed-width byte strings (S, NPY_STRING = 18)NPTypeCode.Unicodefor fixed-width Unicode strings (U, NPY_UNICODE = 19)NPTypeCode.VStringfor variable-length UTF-8 strings (T, NPY_VSTRING = 2056)NPTypeCode.Charas its own type (maps toU1orS1internally for string operations)np.bytes_scalar type (wrapsbyte[])np.str_scalar type (wrapsstringorint[]for UTF-32)np.characterbase class for string scalar hierarchynp.flexiblebase class (parent ofcharacterandvoid)Dtype Parsing and Construction
S,S5,S10,|S10to bytes_ dtypeU,U5,U10to str_ dtype<U10(little-endian),>U10(big-endian),=U10(native) byte ordersTto StringDTypeaas deprecated alias forScdtype - map toNPTypeCode.Charor treat asS1np.dtype(np.bytes_),np.dtype(np.str_)constructor formsnp.dtype((np.str_, 5))for 5-char unicodedtype.charproperty (S,U,T)dtype.kindproperty (S,U,T)dtype.typeproperty (scalar type class)dtype.itemsize(bytes per element)dtype.name(e.g.,bytes80,str320,StringDType128)dtype.str(e.g.,|S10,<U10,|T16)Memory Layout - bytes_ (S)
|prefix)Memory Layout - str_ (U)
npy_ucs4=uint)<) and big-endian (>) byte orders0x00000000)Memory Layout - StringDType (T)
MISSING(0x80),INITIALIZED(0x40),OUTSIDE_ARENA(0x20),LONG(0x10)StringDType NA/Missing Support
StringDType(na_object=None)- None as NAStringDType(na_object=np.nan)- NaN as NAStringDType(coerce=True/False)- coercion controlnp.isnan()identifies NaN-like NAArray Creation
np.array(["hello", "world"], dtype="S")- auto-size to longestnp.array(["hello", "world"], dtype="U")- auto-size to longestnp.array(["hello", "world"], dtype="T")- variable lengthnp.array([b"hello"], dtype="S")- from Python bytesnp.array(["hello"], dtype="S")- encode string as ASCIIdtype="U3"results in"hel"np.zeros((3,), dtype="U10")- empty stringsnp.empty((3,), dtype="U10")- uninitialized (may contain garbage)np.full((3,), "x", dtype="U10")- fill with stringType Conversions and Casting
bytes_.astype("U")- decode as ASCII to Unicodestr_.astype("S")- encode as ASCII (fail on non-ASCII)str_.astype("T")- full Unicode preservedStringDType.astype("U10")- may truncate to widthStringDType.astype("S")- ASCII only"U10".astype("U5")truncatesint/float.astype("U")results in["1", "1.5"]["1", "2"].astype(int)results in[1, 2]bool.astype("U")results in["True", "False"]string.astype(bool)- empty = False, non-empty = Truedatetime64.astype("T")results in ISO stringStringDType.astype("datetime64")- parse ISO,"NaT"becomes NaTStringDType.astype("V5")- raw UTF-8 bytes to voidcasting="safe"vs"unsafe"NPTypeCode.Charconversion to/from string typesComparison Operations
==element-wise equality (returns bool array)!=element-wise inequality<,<=,>,>=lexicographic comparisonnp.equal,np.not_equal,np.less,np.less_equal,np.greater,np.greater_equalString Concatenation and Repetition
np.strings.add(a, b)- concatenate stringsnp.strings.multiply(a, n)- repeat string n timesmultiply(s, 0)ormultiply(s, -1)results in empty stringmultiply(s, sys.maxsize)results in OverflowError["a", "b"] + "x"results in["ax", "bx"]String Length
np.strings.str_len(arr)- returns int64 arrayString Searching
np.strings.find(a, sub, start=0, end=None)- first occurrence, -1 if not foundnp.strings.rfind(a, sub, start=0, end=None)- last occurrencenp.strings.index(a, sub, ...)- like find but raises ValueErrornp.strings.rindex(a, sub, ...)- like rfind but raises ValueErrornp.strings.count(a, sub, start=0, end=None)- count occurrencesnp.strings.startswith(a, prefix, start=0, end=None)- bool arraynp.strings.endswith(a, suffix, start=0, end=None)- bool arrayCase Operations
np.strings.upper(arr)- uppercasenp.strings.lower(arr)- lowercasenp.strings.capitalize(arr)- first char upper, rest lowernp.strings.title(arr)- titlecase each wordnp.strings.swapcase(arr)- swap caseCharacter Classification
np.strings.isalpha(arr)- all alphabeticnp.strings.isdigit(arr)- all digitsnp.strings.isalnum(arr)- all alphanumericnp.strings.isspace(arr)- all whitespacenp.strings.islower(arr)- all lowercase (cased chars)np.strings.isupper(arr)- all uppercase (cased chars)np.strings.istitle(arr)- titlecase formatnp.strings.isnumeric(arr)- numeric (Unicode-aware, str_/StringDType only)np.strings.isdecimal(arr)- decimal digits (Unicode-aware, str_/StringDType only)Whitespace Operations
np.strings.strip(arr, chars=None)- strip both endsnp.strings.lstrip(arr, chars=None)- strip leftnp.strings.rstrip(arr, chars=None)- strip rightnp.strings.center(arr, width, fillchar=" ")- center-alignnp.strings.ljust(arr, width, fillchar=" ")- left-alignnp.strings.rjust(arr, width, fillchar=" ")- right-alignnp.strings.zfill(arr, width)- zero-fill numeric stringsnp.strings.expandtabs(arr, tabsize=8)- expand tabsSplitting and Partitioning
np.strings.partition(arr, sep)- returns tuple of 3 arrays (before, sep, after)np.strings.rpartition(arr, sep)- partition from rightnp.char.split(arr, sep, maxsplit)- returns object array of listsnp.char.rsplit(arr, sep, maxsplit)- split from rightnp.char.splitlines(arr, keepends=False)- split on line boundariesnp.char.join(sep, seq)- join sequencesReplacement and Slicing
np.strings.replace(arr, old, new, count=-1)- replace occurrencesnp.strings.slice(arr, start=0, stop=None, step=1)- substring extractionnp.strings.slice(arr, None, None, -1)- reverse stringEncoding/Decoding
np.strings.encode(arr, encoding="utf-8", errors="strict")- str to bytesnp.strings.decode(arr, encoding="utf-8", errors="strict")- bytes to str"strict","ignore","replace"Formatting and Translation
np.strings.mod(format_arr, values)- printf-style formattingnp.strings.translate(arr, table, deletechars=None)- character translationArray Functions with String Support
np.sort(arr)- lexicographic sortnp.argsort(arr)- sort indicesnp.argmax(arr)- index of lexicographic maxnp.argmin(arr)- index of lexicographic minnp.searchsorted(arr, value)- binary searchnp.unique(arr)- unique stringsnp.concatenate([arr1, arr2])- join arraysnp.stack,np.vstack,np.hstack,np.dstack- stackingnp.where(cond, x, y)- conditional selectionnp.take(arr, indices)- index selectionnp.nonzero(arr)- indices of non-empty stringsnp.any(arr)- True if any non-emptynp.all(arr)- True if all non-emptynp.copy(arr)- copy arraynp.resize(arr, new_shape)- resize with repetitionnp.maximum(a, b),np.minimum(a, b)- lexicographic element-wiseStructured Arrays
[("name", "U20"), ("id", "i4")]np.rec.array) with attribute accessFile I/O
np.save()/np.load()for bytes_, str_, StringDType (.npy format)np.savez()/np.load()for string arrays (.npz archives)np.loadtxt(dtype="U")- load text as stringsnp.savetxt(fmt="%s")- save strings to textnp.genfromtxt(dtype="U")- flexible text loadingarr.tofile()/np.fromfile(dtype="S10")- binary I/O (fixed-width only)np.memmap(dtype="U100")Boolean Conversion and Truthiness
arr.astype(bool)- empty = False, non-empty = Truenp.nonzero(arr)returns indices of non-empty stringsnp.isnan)Edge Cases
np.array([""], dtype="S")results indtype="|S1"np.dtype("S0").itemsizeresults in 0"e\u0301"is 2 code points (displays as 1)NPY_MAX_STRING_SIZEmultiplyandaddLegacy Support
np.char.chararray- deprecated subclass (minimal support if needed)np.char.*functions as aliases tonp.strings.*Type Checking
np.issubdtype(arr.dtype, np.bytes_)- True for Snp.issubdtype(arr.dtype, np.str_)- True for Unp.issubdtype(arr.dtype, np.character)- True for S or Unp.issubdtype(arr.dtype, np.flexible)- True for S, U, VEvidence
docs/plans/NUMPY_STRING_TYPES.mdsrc/numpy/(v2.4.2)Scope / Non-goals
In scope: Core string dtypes, string operations API, file I/O, structured arrays, type conversions
Out of scope:
Regex)