-
Notifications
You must be signed in to change notification settings - Fork 70
Description
An unnest query from an mdtest has been observed to trigger intermittent panics in CI, such as in this job.
panic: context canceled
goroutine 56 [running]:
github.com/brimdata/super/runtime/sam/op/subquery.(*Subquery).Eval(0xc0005a0de0, {{0x242f200?, 0xc00059eb70?}, 0xc000a8a003?, 0x1ef7800?})
/home/runner/work/super/super/runtime/sam/op/subquery/subquery.go:121 +0x7e5
github.com/brimdata/super/runtime/sam/op/unnest.(*Unnest).unnest(0xc0000bac80, {{0x242f200?, 0xc00059eb70?}, 0xc000a8a003?, 0x0?})
/home/runner/work/super/super/runtime/sam/op/unnest/unnest.go:57 +0x3d
github.com/brimdata/super/runtime/sam/op/unnest.(*Unnest).Pull(0xc0000bac80, 0x0?)
/home/runner/work/super/super/runtime/sam/op/unnest/unnest.go:46 +0x159
github.com/brimdata/super/runtime/sam/op/aggregate.(*Op).run(0xc0000bacd0)
/home/runner/work/super/super/runtime/sam/op/aggregate/aggregate.go:195 +0xa2
created by github.com/brimdata/super/runtime/sam/op/aggregate.(*Op).Pull.func1 in goroutine 55
/home/runner/work/super/super/runtime/sam/op/aggregate/aggregate.go:163 +0x66
Details
Repro is with super commit 66dc4e5.
In running the query outside of CI, I've found I can't seem to repro it on my Macbook. However, I can reliably repro it on an ubuntu-22.04 hosted GitHub Actions Runner. To get a faster repro rate, I start four concurrent instances of:
cat /dev/urandom > /dev/null &
Tying up all the cores in this way seems to aggravate the likely timing issue that's causing the panic. I need to be in the book/src/tutorials directory of the my checkout of the super repo to access the input data. Then I can start to see some panics within a minute:
$ super -version
Version: v0.1.0-11-g66dc4e5d8
$ while true; do super -S -c '
unnest {user:user.login,reviewer:requested_reviewers} into (
reviewers:=union(reviewer.login) by user
)
| groups:=union(reviewers) by user
| sort user,len(groups)
' prs.bsup; done > /dev/null
panic: context canceled
goroutine 45 [running]:
github.com/brimdata/super/runtime/sam/op/subquery.(*Subquery).Eval(0xc00044b260, {{0x2430220?, 0xc0003a5860?}, 0xc000892408?, 0x2430220?})
/home/runner/super/runtime/sam/op/subquery/subquery.go:121 +0x7e5
github.com/brimdata/super/runtime/sam/op/unnest.(*Unnest).unnest(0xc000540c80, {{0x2430220?, 0xc0003a5860?}, 0xc000892408?, 0xc00041e110?})
/home/runner/super/runtime/sam/op/unnest/unnest.go:57 +0x3d
github.com/brimdata/super/runtime/sam/op/unnest.(*Unnest).Pull(0xc000540c80, 0x0?)
/home/runner/super/runtime/sam/op/unnest/unnest.go:46 +0x159
github.com/brimdata/super/runtime/sam/op/aggregate.(*Op).run(0xc000540cd0)
/home/runner/super/runtime/sam/op/aggregate/aggregate.go:195 +0xa2
created by github.com/brimdata/super/runtime/sam/op/aggregate.(*Op).Pull.func1 in goroutine 44
/home/runner/super/runtime/sam/op/aggregate/aggregate.go:163 +0x66
panic: context canceled
goroutine 52 [running]:
github.com/brimdata/super/runtime/sam/op/subquery.(*Subquery).Eval(0xc000391560, {{0x2430220?, 0xc00054a120?}, 0xc0009074c7?, 0x0?})
/home/runner/super/runtime/sam/op/subquery/subquery.go:121 +0x7e5
github.com/brimdata/super/runtime/sam/op/unnest.(*Unnest).unnest(0xc000574d20, {{0x2430220?, 0xc00054a120?}, 0xc0009074c7?, 0xc000054070?})
/home/runner/super/runtime/sam/op/unnest/unnest.go:57 +0x3d
github.com/brimdata/super/runtime/sam/op/unnest.(*Unnest).Pull(0xc000574d20, 0x0?)
/home/runner/super/runtime/sam/op/unnest/unnest.go:46 +0x159
github.com/brimdata/super/runtime/sam/op/aggregate.(*Op).run(0xc000574d70)
/home/runner/super/runtime/sam/op/aggregate/aggregate.go:195 +0xa2
created by github.com/brimdata/super/runtime/sam/op/aggregate.(*Op).Pull.func1 in goroutine 51
/home/runner/super/runtime/sam/op/aggregate/aggregate.go:163 +0x66
...
@mattnibs has looked at this one and said it started happening when we made it such that unnest into compiles into unnest [unnest]:
$ super compile -dag -C 'unnest [1,2,3] into ( sum(this) )'
null
| unnest (
unnest [1,2,3]
| aggregate
sum:=sum(this)
| values sum
| aggregate
collect:=collect(this)
| values collect)
| output main
And regarding the cause of the panic, observed:
The problem is we don't have a way of handling the error in the inner flowgraph of the subquery. Subquery panics when it encounters an error:
b, err := s.body.Pull(false) if err != nil { panic(err) }Cached subquery returns the error as a value:
batch, err := c.body.Pull(false) if err != nil { return c.rctx.Sctx.NewError(err) }