Description
Three edge-case crashes in pythainlp.tokenize public functions:
word_detokenize(["สวัสดี", "", "ครับ"]) → IndexError — accesses w[0] on empty string token (line 71)
word_detokenize([]) → IndexError — accesses segments[0] on empty list (line 55)
sent_tokenize(["สวัสดี", 123]) → TypeError uncaught — code catches ValueError but str.join() raises TypeError (line 491)
Expected results
word_detokenize(["สวัสดี", "", "ครับ"]) # → "สวัสดีครับ"
word_detokenize([]) # → ""
sent_tokenize(["สวัสดี", 123], engine="whitespace+newline") # → []
Current results
word_detokenize(["สวัสดี", "", "ครับ"]) # IndexError: string index out of range
word_detokenize([]) # IndexError: list index out of range
sent_tokenize(["สวัสดี", 123], engine="whitespace+newline") # TypeError (uncaught)
Steps to reproduce
from pythainlp.tokenize import word_detokenize, sent_tokenize
word_detokenize(["สวัสดี", "", "ครับ"]) # crash
word_detokenize([]) # crash
sent_tokenize(["สวัสดี", 123], engine="whitespace+newline") # crash
PyThaiNLP version
5.3.3
Python version
3.13
Operating system and version
macOS
Possible solution
- Line 55: Add
if not segments: return "" guard
- Line 71: Add
if not w: continue before w[0]
- Line 491: Change
except ValueError to except TypeError (restores original from commit 2a95070, broken by 5bbf410)
Files
pythainlp/tokenize/core.py (lines 55, 71, 491)
Description
Three edge-case crashes in
pythainlp.tokenizepublic functions:word_detokenize(["สวัสดี", "", "ครับ"])→IndexError— accessesw[0]on empty string token (line 71)word_detokenize([])→IndexError— accessessegments[0]on empty list (line 55)sent_tokenize(["สวัสดี", 123])→TypeErroruncaught — code catchesValueErrorbutstr.join()raisesTypeError(line 491)Expected results
Current results
Steps to reproduce
PyThaiNLP version
5.3.3
Python version
3.13
Operating system and version
macOS
Possible solution
if not segments: return ""guardif not w: continuebeforew[0]except ValueErrortoexcept TypeError(restores original from commit 2a95070, broken by 5bbf410)Files
pythainlp/tokenize/core.py(lines 55, 71, 491)