Conversation
|
If you have a stacktrace for the swish failure I can look into it.
|
|
It's very odd, buildkite also fails but not sure if this is multi-threaded. For swish I got wrong answers, only in the cases which have nonzero bias. |
|
|
|
On ...
gradient with elu | 7 7 6.0s
gradient with gelu | 7 7 6.1s
gradient with swish | 5 2 7 8.3s
gradient with hardswish | 7 7 6.3s
gradient with selu | 7 7 6.8s
gradient with celu | 7 7 5.6s
gradient with softplus | 7 7 6.2s
gradient with softsign | 7 7 6.3s
gradient with logσ | 7 7 5.7s
gradient with logcosh | 7 7 6.9s
gradient with mish | 7 7 6.3s
gradient with tanhshrink | 5 2 7 6.6s
gradient with softshrink | 7 7 6.4s
gradient with trelu | 7 7 6.6s
...
gradient for fast_broadcast! | 4 1 5 45.9s
ERROR: Some tests did not pass: 206 passed, 4 failed, 0 errored, 1 broken.
julia> x = randn(3,4)
3×4 Matrix{Float64}:
-1.22369 0.0921121 -0.941871 -1.19349
1.31897 1.07247 -0.981244 -0.363552
-0.707036 0.328161 0.252119 0.0805549
julia> b = randn(3)
3-element Vector{Float64}:
-0.8524132980503979
1.4314247570126006
-0.2652038000170137
julia> fun = swish
swish (generic function with 2 methods)
julia> gx = ForwardDiff.gradient(x -> sum(bias_act!(fun, copy(x), b)), x)
3×4 Matrix{Float64}:
-0.094139 0.153529 -0.076764 -0.0929148
1.09521 1.09937 0.717713 0.947483
0.0808417 0.531458 0.493458 0.408198
julia> Zygote.gradient(x -> sum(bias_act!(fun, copy(x), b)), x)[1]
σ = NNlib.swish
b = [-0.8524132980503979, 1.4314247570126006, -0.2652038000170137]
3×4 Matrix{Float64}:
-0.893114 -0.229442 -0.814908 -0.886241
1.08842 1.09301 1.06381 1.07222
-0.0874094 0.479686 0.434657 0.332037
julia> gb = ForwardDiff.gradient(b -> sum(bias_act!(fun, copy(x), b)), b)
3-element Vector{Float64}:
-0.11028850033542084
3.8597769560080724
1.5139547892071894
julia> Zygote.gradient(b -> sum(bias_act!(fun, copy(x), b)), b)[1]
σ = NNlib.swish
b = [-0.8524132980503979, 1.4314247570126006, -0.2652038000170137]
3-element Vector{Float64}:
-2.8237039339595764
4.317457358561092
1.158969872333099 |
| gx2 = ForwardDiff.gradient(x -> sum(bias_act!(fun, copy(x), false)), x) | ||
| gx2plus = ForwardDiff.gradient(x -> sum(bias_act!(fun, copy(x), false)), x .- eps()) | ||
| gx2minus = ForwardDiff.gradient(x -> sum(bias_act!(fun, copy(x), false)), x .- eps()) | ||
| if !(gx2 ≈ gx2plus ≈ gx2minus) | ||
| @warn "skipping gradient tests due to discontinuity" fun x | ||
| continue |
There was a problem hiding this comment.
This slightly elaborate thing is avoiding my best guess as to why there were failures on CI: hardsigmoid has discontinuities, and if x hits them, the two gradients may not agree.
But it doesn't seem to work:
gradient with hardσ: Test Failed at /home/runner/work/NNlib.jl/NNlib.jl/test/bias_act.jl:73
Expression: gb ≈ (Zygote.gradient((b->(sum(bias_act!(fun, copy(x), b));)), b))[1]
Evaluated: [0.5, 0.6666666666666666, 0.6666666666666666] ≈ [1.5000000000000002, 0.6666666666666666, 0.6666666666666666]
…n was called without explaining anything
483f20c to
7a58e56
Compare
7a58e56 to
7b04b15
Compare
|
The above benchmark, on the same computer, give much slower times, and a much larger speedup. Detailsjulia> w, b = rand(Float32, 100, 10000), rand(Float32, 100);
julia> @btime bias_act!(relu, $w, $b);
min 141.250 μs, mean 145.076 μs (0 allocations)
julia> @btime relu.($w .+ $b);
min 107.667 μs, mean 443.560 μs (2 allocations, 3.81 MiB)
julia> @btime bias_act!(tanh, $w, $b);
min 418.125 μs, mean 425.345 μs (0 allocations)
julia> @btime tanh_fast.($w .+ $b);
min 404.042 μs, mean 772.522 μs (2 allocations, 3.81 MiB)
julia> using Zygote
julia> @btime gradient((w,b) -> sum(bias_act!(relu, w, b)), $w, $b);
min 424.875 μs, mean 818.428 μs (28 allocations, 3.82 MiB)
julia> @btime gradient((w,b) -> sum(relu.(w .+ b)), $w, $b);
min 969.541 μs, mean 1.591 ms (32 allocations, 11.45 MiB)
julia> @btime gradient((w,b) -> sum(bias_act!(tanh, w, b)), $w, $b);
min 700.292 μs, mean 1.037 ms (28 allocations, 3.82 MiB)
julia> @btime gradient((w,b) -> sum(tanh_fast.(w .+ b)), $w, $b);
min 1.217 ms, mean 1.898 ms (32 allocations, 11.45 MiB)
julia> versioninfo() # results look similar on 1.10 + 1.11
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 × Apple M1 |
|
Some buildkite jobs are not happy either. Can we constrain the inputs for the |
This was part of #346, but the conv part got complicated.
Intended as a better alternative to part of FluxML/Flux.jl#2137 --- using this in layers will remove all
identity.(x .+ false)broadcasts, with less repetition of the idea.Dismayed how long the
rrulecode is here. I couldn't see what's wrong with the second case (it fails onswish) so I commented it out for now. There's room to improve this once JuliaDiff/ChainRulesCore.jl#592 works.Benchmarks
Some min times are slower. But mean times show the effect of saving allocations.
Details
Flux:
So 50% saving on the forward pass, as you'd expect.
If I'm thinking right, then JuliaDiff/ChainRulesCore.jl#592 should get the gradient down to 4.35 MB, saving about 1/3.
PR Checklist