fix: improve error handling and retry logic in pull operations#795
fix: improve error handling and retry logic in pull operations#795ilopezluna wants to merge 6 commits intomainfrom
Conversation
ilopezluna
commented
Mar 26, 2026
- No more unnecessary retries for deterministic errors. Only 502/503/504 are retried. 500 and 4xx fail immediately.
- Unsupported model versions return HTTP 422 instead of 500. Correctly classified as a client-side issue.
- Actual error shown first, not buried. Format changed from "failed to download after N retries: " to " (failed after N retries)".
- HTML error pages from proxies/CDNs are surfaced.
- Actionable "What's next" guidance for upgrade-needed errors.
- Dynamic unsupported media type message showing actual vs supported types.
There was a problem hiding this comment.
Code Review
This pull request refactors error handling and retry logic for model pulling operations. It introduces more specific retry conditions, focusing on transient gateway/proxy errors (502, 503, 504) rather than general server errors. Error messages are improved, particularly for unsupported model configuration types and when encountering unparseable progress stream data, with new tests covering these scenarios. Feedback includes addressing an ignored error from io.ReadAll(resp.Body) that could lead to incomplete error messages, and refactoring duplicated logic for collecting and surfacing unparseable lines in progress.go into helper functions for better maintainability.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request significantly enhances error handling and user feedback across model pull and push operations. Key changes include refining the ErrUnsupportedMediaType to provide more informative messages, including upgrade suggestions, and handling it with a 422 status. The Pull operation's retry mechanism has been made more robust by limiting retries to specific transient HTTP errors (502, 503, 504), a change supported by new unit tests. Additionally, the progress stream handling now gracefully manages non-JSON content, such as proxy error pages, preventing silent failures. A notable improvement opportunity is to align the Push operation's retry logic with the more specific strategy adopted for Pull to ensure consistent and targeted error recovery.
There was a problem hiding this comment.
Hey - I've found 5 issues, and left some high level feedback:
- In DisplayProgress, the new early-return path after unexpectedProgressDataError skips display.Wait(), which can leave the progress display goroutine running; consider ensuring Wait is always called (e.g., via defer) before any return after the display is started.
- The new upgrade guidance in handleClientError relies on substring matching "try upgrading" in the error message, which is brittle; it would be more robust to key off a sentinel error such as distribution.ErrUnsupportedMediaType (e.g., via errors.Is) or a structured error type.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In DisplayProgress, the new early-return path after unexpectedProgressDataError skips display.Wait(), which can leave the progress display goroutine running; consider ensuring Wait is always called (e.g., via defer) before any return after the display is started.
- The new upgrade guidance in handleClientError relies on substring matching "try upgrading" in the error message, which is brittle; it would be more robust to key off a sentinel error such as distribution.ErrUnsupportedMediaType (e.g., via errors.Is) or a structured error type.
## Individual Comments
### Comment 1
<location path="cmd/cli/commands/utils.go" line_range="56-62" />
<code_context>
var buf bytes.Buffer
printNextSteps(&buf, []string{enableVLLM})
return fmt.Errorf("%w\n%s", err, strings.TrimRight(buf.String(), "\n"))
+ } else if strings.Contains(err.Error(), "try upgrading") {
+ // The model uses a newer config format than this client supports.
+ var buf bytes.Buffer
+ printNextSteps(&buf, []string{
+ "Upgrade Docker Desktop to the latest version to support this model",
+ })
+ return fmt.Errorf("%s: %w\n%s", message, err, strings.TrimRight(buf.String(), "\n"))
}
return fmt.Errorf("%s: %w", message, err)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Use a sentinel error or `errors.Is` instead of substring matching on the error message
This `strings.Contains(err.Error(), "try upgrading")` check is brittle because it depends on a specific English message. Since this path comes from `ErrUnsupportedMediaType` being wrapped in `checkCompat`, prefer checking the underlying cause via a sentinel like `distribution.ErrUnsupportedMediaType` with `errors.Is`, or by introducing a dedicated typed error. That way behavior remains stable even if the error text changes.
Suggested implementation:
```golang
var buf bytes.Buffer
printNextSteps(&buf, []string{enableVLLM})
return fmt.Errorf("%w\n%s", err, strings.TrimRight(buf.String(), "\n"))
} else if errors.Is(err, distribution.ErrUnsupportedMediaType) {
// The model uses a newer config format than this client supports.
var buf bytes.Buffer
printNextSteps(&buf, []string{
"Upgrade Docker Desktop to the latest version to support this model",
})
return fmt.Errorf("%s: %w\n%s", message, err, strings.TrimRight(buf.String(), "\n"))
}
return fmt.Errorf("%s: %w", message, err)
}
```
1. Ensure the file imports the `errors` package:
- Add `errors` to the existing import block if it is not already present.
2. Import the correct `distribution` package that defines `ErrUnsupportedMediaType`. Based on your comment, this is likely:
- `github.com/distribution/distribution/v3` or the equivalent package already used in the project.
Example:
```go
import (
"errors"
"github.com/distribution/distribution/v3"
// other imports...
)
```
3. If your project uses a different path or alias for the distribution package (e.g. `github.com/docker/distribution` or an aliased import like `distribution "github.com/docker/distribution"`), adjust the import and the `distribution.ErrUnsupportedMediaType` reference accordingly to match existing conventions in this codebase.
</issue_to_address>
### Comment 2
<location path="cmd/cli/desktop/desktop.go" line_range="248" />
<code_context>
}
- return "", progressShown, fmt.Errorf("failed to %s after %d retries: %w", operationName, maxRetries, lastErr)
+ return "", progressShown, fmt.Errorf("%w (failed after %d retries)", lastErr, maxRetries)
}
</code_context>
<issue_to_address>
**suggestion:** Preserve operation context in the final retry error message
The new error message drops `operationName`, which was useful for scanning logs (e.g. "failed to pull after N retries"). Please include it in the wrapped message so the high-level operation is preserved, e.g. `fmt.Errorf("%s failed after %d retries: %w", operationName, maxRetries, lastErr)`.
```suggestion
return "", progressShown, fmt.Errorf("%s failed after %d retries: %w", operationName, maxRetries, lastErr)
```
</issue_to_address>
### Comment 3
<location path="cmd/cli/desktop/progress.go" line_range="293-302" />
<code_context>
+
+// appendNonJSONLine appends line (with a newline separator) to dst, enforcing
+// a hard cap of maxNonJSONBytes total to avoid large allocations.
+func appendNonJSONLine(dst []byte, line string) []byte {
+ if len(dst) >= maxNonJSONBytes {
+ return dst
+ }
+ if len(dst) > 0 {
+ dst = append(dst, '\n')
+ }
+ remaining := maxNonJSONBytes - len(dst)
+ if len(line) > remaining {
+ line = line[:remaining]
+ }
+ return append(dst, line...)
+}
+
</code_context>
<issue_to_address>
**suggestion:** Indicate when non-JSON progress data has been truncated at `maxNonJSONBytes`
Truncation via `maxNonJSONBytes` is fine for avoiding large allocations, but the error text from `unexpectedProgressDataError` will then silently show only a prefix of the original content. Please add a clear marker (e.g. `"\n...[truncated]"`) when the cap is hit so users can tell the response was cut off, especially when debugging long HTML error pages from proxies/CDNs.
Suggested implementation:
```golang
// maxNonJSONBytes is the maximum number of bytes collected from unparseable
// non-JSON lines in the progress stream before truncation.
const maxNonJSONBytes = 4096
const nonJSONTruncationMarker = "\n...[truncated]"
```
```golang
// appendNonJSONLine appends line (with a newline separator) to dst, enforcing
// a hard cap of maxNonJSONBytes total to avoid large allocations.
func appendNonJSONLine(dst []byte, line string) []byte {
// If we've already hit the cap or already marked truncation, don't append more.
if len(dst) >= maxNonJSONBytes {
return dst
}
if bytes.HasSuffix(dst, []byte(nonJSONTruncationMarker)) {
return dst
}
// Separate lines with a newline if we already have content.
if len(dst) > 0 {
dst = append(dst, '\n')
}
remaining := maxNonJSONBytes - len(dst)
// If appending the full line plus the truncation marker would exceed the cap,
// reserve space for the marker so we can clearly indicate truncation.
if len(line)+len(nonJSONTruncationMarker) > remaining {
// If there's not even enough room for the full marker, append as much
// of the marker as fits.
if remaining <= len(nonJSONTruncationMarker) {
dst = append(dst, nonJSONTruncationMarker[:remaining]...)
return dst
}
// Otherwise, append as much of the line as we can while still leaving
// room for the full marker, then append the marker.
lineBytesCanFit := remaining - len(nonJSONTruncationMarker)
dst = append(dst, line[:lineBytesCanFit]...)
dst = append(dst, nonJSONTruncationMarker...)
return dst
}
// We have room for the full line without the marker; just append it.
return append(dst, line...)
}
for scanner.Scan() {
```
1. The `appendNonJSONLine` implementation uses `bytes.HasSuffix`, so you need to add `bytes` to the import list at the top of `cmd/cli/desktop/progress.go`:
<<<<<<< SEARCH
import (
"encoding/json"
=======
import (
"bytes"
"encoding/json"
>>>>>>> REPLACE
(Adjust the surrounding imports to match the actual file content.)
2. If `maxNonJSONBytes` was previously a package-level constant, keep it that way and place `nonJSONTruncationMarker` alongside it at the same scope; adjust the added `const` block accordingly.
</issue_to_address>
### Comment 4
<location path="cmd/cli/desktop/desktop_test.go" line_range="107" />
<code_context>
+ assert.Contains(t, err.Error(), "try upgrading")
+}
+
+func TestPullRetryOn502Error(t *testing.T) {
+ ctrl := gomock.NewController(t)
+ defer ctrl.Finish()
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding coverage for 503 and 504 retry behaviour to fully exercise the new retry policy.
The retry logic currently allows retries only for 502, 503, and 504. Right now we only assert the 502 case. A small parametrized test (or a couple of focused tests) for the remaining statuses would better validate the policy and protect against regressions in the allowed-retry set.
Suggested implementation:
```golang
assert.Contains(t, err.Error(), "try upgrading")
}
func TestPullRetriesOnTransientGatewayErrors(t *testing.T) {
testCases := []struct {
name string
statusCode int
}{
{name: "502 Bad Gateway", statusCode: http.StatusBadGateway},
{name: "503 Service Unavailable", statusCode: http.StatusServiceUnavailable},
{name: "504 Gateway Timeout", statusCode: http.StatusGatewayTimeout},
}
for _, tc := range testCases {
tc := tc
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
defer ctrl.Finish()
mockClient := NewMockHTTPClient(ctrl)
mockContext := NewContextForMock(mockClient)
client := New(mockContext)
gomock.InOrder(
// First attempt: transient, retryable gateway error.
mockClient.EXPECT().Do(gomock.Any()).Return(&http.Response{
StatusCode: tc.statusCode,
Body: io.NopCloser(bytes.NewBufferString("transient gateway error")),
}, nil),
// Second attempt: non-retryable 500 to terminate the loop.
mockClient.EXPECT().Do(gomock.Any()).Return(&http.Response{
StatusCode: http.StatusInternalServerError,
Body: io.NopCloser(bytes.NewBufferString("terminal error after retry")),
}, nil),
)
printer := NewSimplePrinter(func(string) {})
_, _, err := client.Pull(modelName, printer)
assert.Error(t, err)
})
}
}
```
1. Ensure the mock type name `NewMockHTTPClient` matches the actual gomock-generated constructor in this package (e.g. it might be `NewMockDoer`, `NewMockHTTPDoer`, etc.). Update it accordingly in the new test.
2. If `t.Parallel()` is not desired in this test suite (or interferes with shared global state in other tests), you can remove the `t.Parallel()` line.
3. If `modelName` is not a package-level constant/variable already used by the other pull tests, replace `modelName` with whatever identifier the existing pull tests are using.
4. No new imports should be required beyond what is already in this test file (`testing`, `net/http`, `bytes`, `io`, `github.com/golang/mock/gomock`, `github.com/stretchr/testify/assert`), but if some are missing, add them to the import block.
</issue_to_address>
### Comment 5
<location path="cmd/cli/desktop/desktop_test.go" line_range="63" />
<code_context>
}
-func TestPullRetryOn5xxError(t *testing.T) {
+func TestPullNoRetryOn500Error(t *testing.T) {
ctrl := gomock.NewController(t)
defer ctrl.Finish()
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test for the case where reading the error response body fails, to pin down the new error-message formatting.
Since we now emit `failed to read response body: ...` when `io.ReadAll(resp.Body)` fails, please add a test that uses a body whose `Read` returns an error. Assert that the error string includes `failed to read response body` and that we still do not retry on a 500 response, so the new error-handling branch is covered and its behavior is verified.
Suggested implementation:
```golang
"fmt"
"io"
"net/http"
"strings"
"testing"
```
```golang
mockdesktop "github.com/docker/model-runner/cmd/cli/mocks"
assert.Contains(t, err.Error(), "Model not found")
}
// errorReadCloser is a ReadCloser whose Read always fails, used to exercise
// the error-handling path when reading the response body fails.
type errorReadCloser struct{}
func (e *errorReadCloser) Read(p []byte) (int, error) {
return 0, fmt.Errorf("boom: read error")
}
func (e *errorReadCloser) Close() error {
return nil
}
func TestPullNoRetryOn500Error(t *testing.T) {
```
```golang
// 500 is not retried (deterministic server error), so only 1 call.
mockClient.EXPECT().Do(gomock.Any()).Return(&http.Response{
StatusCode: http.StatusInternalServerError,
Body: io.NopCloser(bytes.NewBufferString("Internal server error")),
}, nil).Times(1)
// TODO: invoke the pull operation under test and assert the non-retry behavior.
// For example (adjust to actual API):
// err := client.Pull(/* args... */)
// assert.Error(t, err)
}
func TestPullNoRetryOn500ErrorBodyReadFailure(t *testing.T) {
ctrl := gomock.NewController(t)
defer ctrl.Finish()
// TODO: construct the same mockClient type as in TestPullNoRetryOn500Error.
// For example, if the existing test does:
// mockClient := mockdesktop.NewMockHTTPClient(ctrl)
// use the same here.
mockClient := mockdesktop.NewMockHTTPClient(ctrl)
mockContext := NewContextForMock(mockClient)
client := New(mockContext)
// 500 is not retried even when reading the body fails, so only 1 call.
mockClient.EXPECT().Do(gomock.Any()).Return(&http.Response{
StatusCode: http.StatusInternalServerError,
Body: &errorReadCloser{},
}, nil).Times(1)
// TODO: invoke the same pull operation as in TestPullNoRetryOn500Error.
// For example (adjust to actual API and arguments):
// err := client.Pull(/* args... */)
// assert.Error(t, err)
// assert.Contains(t, err.Error(), "failed to read response body")
```
To fully integrate these changes:
1. In `TestPullNoRetryOn500Error`, call the same function under test that the file’s other pull tests use (e.g. `client.Pull(...)`) and assert the expected error/non-retry behavior, mirroring the existing test pattern in this file.
2. In `TestPullNoRetryOn500ErrorBodyReadFailure`, construct `mockClient` exactly as in `TestPullNoRetryOn500Error` (or whichever helper is used in this file) instead of the placeholder `mockdesktop.NewMockHTTPClient(ctrl)` if a different constructor is actually used.
3. Replace the `TODO` call sites with the real pull invocation and assertions, ensuring:
- The error is checked (e.g. `assert.Error(t, err)`).
- `assert.Contains(t, err.Error(), "failed to read response body")` is used to pin down the error message text.
- No additional retry-specific assertions are needed beyond the `.Times(1)` expectation, which already verifies that a retry does not occur.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
3ae86a9 to
e79e402
Compare
# Conflicts: # pkg/distribution/distribution/client.go
e79e402 to
b897489
Compare