Skip to content

cmd/cue: lexer fails on _ followed by | without whitespace #4289

@4ad

Description

@4ad

The lexer rejects _| when followed by any character other than _. It appears the lexer greedily enters the _|_ (bottom) path upon seeing _| without first checking that the third character is _. This means _ cannot be used as the left operand of | (disjunction) or || (logical OR) without inserting whitespace.

Affected: _|x for any x that is not _, including _||, _|1, _|b, _| 1.

Not affected: _ || x (space separates tokens), __|| (not bare _), _|_|| x (complete bottom before ||).

Reproducer (cmd/testscript):

testscript <<'EOD'
exec cue fmt ok_space.cue
exec cue fmt ok_bottom.cue
exec cue fmt ok_ident.cue
exec cue fmt fail_or.cue
exec cue fmt fail_disj.cue

-- ok_space.cue --
a: _ || true
-- ok_bottom.cue --
a: _|_ || true
-- ok_ident.cue --
a: b_|| true
-- fail_or.cue --
a: _|| true
-- fail_disj.cue --
a: _| 1
EOD

ok_space.cue, ok_bottom.cue, and ok_ident.cue pass. fail_or.cue and fail_disj.cue fail with:

illegal token '_|'; expected '_'

cue version v0.15.4.

Of course, all these are expected to fail evaluation, but here we observe a failure much earlier during tokenization.

This bug has been found by fuzzing with a BNFGen generative grammar. 87 out of 1,000,000 test programs (0.0087%) hit the bug. The mixlexing can cause to CUE programs to fail differently than in the example above, but they are caused by the same problem:

testscript <<'EOD'
exec cue fmt fail.cue

-- fail.cue --
""

: ( (
  _||	0XAF_eF)),
"""

""":	1
EOD

This fails with an additional spurious error:

illegal token '_|'; expected '_':
    ./fail.cue:4:3
expected ')', found 'EOF':
    ./fail.cue:7:7

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions