Skip to content

[FLINK-39366][connector-http] Add retry support with exponential backoff for HTTP Sink#33

Open
featzhang wants to merge 3 commits into
apache:mainfrom
featzhang:feature/add-sink-retry-support
Open

[FLINK-39366][connector-http] Add retry support with exponential backoff for HTTP Sink#33
featzhang wants to merge 3 commits into
apache:mainfrom
featzhang:feature/add-sink-retry-support

Conversation

@featzhang
Copy link
Copy Markdown
Member

What is the purpose of the change

This PR adds configurable retry support for HTTP Sink requests. When an HTTP Sink request fails due to a network error (IOException), the connector will automatically retry the request up to a configurable number of times using an exponential backoff strategy.

Brief change log

  • Add SINK_HTTP_RETRY_TIMES constant to HttpConnectorConfigConstants
  • Add RETRY_TIMES ConfigOption (default: 3) to HttpDynamicSinkConnectorOptions
  • Register RETRY_TIMES as an optional option in HttpDynamicTableSinkFactory
  • Pass retry configuration via Properties in HttpDynamicSink
  • Implement submitWithRetry() method in HttpSinkWriter with exponential backoff (1s, 2s, 4s, ...)
  • Add unit test testRetryOnError to verify retry behavior
  • Update documentation (English and Chinese) with new http.sink.retry.times option

Verifying this change

This change adds tests and can be verified as follows:

  • HttpSinkWriterTest#testRetryOnError: Verifies that the writer retries the configured number of times on IOException
  • HttpSinkWriterTest#testErrorMetric: Verifies that error metrics are recorded correctly when retries are exhausted

Does this pull request potentially affect one of the following areas

  • Dependencies (does it add or upgrade a dependency): No
  • The public API, i.e., is any changed class annotated with @Public(Evolving): No
  • Build infrastructure: No
  • HTTP Sink behavior: Yes — requests failing with IOException will now be retried (configurable, default 3 times)

Documentation

  • Does this pull request introduce a new feature? Yes
  • If yes, how is the feature documented? Documented in connector table options (.md files)

Usage Example

CREATE TABLE http_sink (
  id BIGINT,
  name STRING
) WITH (
  'connector' = 'http-sink',
  'url' = 'http://localhost:8080/api',
  'format' = 'json',
  'http.sink.retry.times' = '5'  -- retry up to 5 times, set to 0 to disable
);

@featzhang featzhang changed the title [FLINK-HTTP-4][connector-http] Add retry support for HTTP Sink [connector-http] Add retry support for HTTP Sink Mar 12, 2026
@davidradl
Copy link
Copy Markdown
Contributor

@featzhang I will look at this tomorrow

@featzhang featzhang changed the title [connector-http] Add retry support for HTTP Sink [FLINK-39366][connector-http] Add retry support with exponential backoff for HTTP Sink Mar 31, 2026
@featzhang featzhang force-pushed the feature/add-sink-retry-support branch from d0052d0 to 2b9ecf6 Compare March 31, 2026 09:41
Comment thread docs/content.zh/docs/connectors/table/http.md Outdated
@featzhang
Copy link
Copy Markdown
Member Author

You're absolutely right — the lookup source has a comprehensive retry configuration system with retry-strategy.type, retry-codes, success-codes, and separate config for fixed-delay vs exponential-delay strategies. The sink should have the same level of control for consistency.

The current PR provides a minimal viable retry implementation with just http.sink.retry.times and a simple exponential backoff (1s, 2s, 4s, ...). This is a significant improvement over no retry support at all, but it's clearly not feature-complete compared to the lookup source.

I see two paths forward:

Option 1: Extend this PR to match the lookup source pattern:

  • Add http.sink.retry-strategy.type (fixed-delay | exponential-delay)
  • Add http.sink.retry-codes (which HTTP status codes trigger retry)
  • Add http.sink.success-codes (define success conditions)
  • Add all the delay configuration options (initial-backoff, max-backoff, multiplier)
  • This would make the sink and lookup source retry configurations consistent

Option 2: Merge basic retry now, extend in follow-up:

  • Current PR provides immediate value (basic retry is better than none)
  • Follow-up PR adds full retry strategy configuration to match lookup source
  • Allows incremental delivery

Which approach would you prefer? I'm happy to extend this PR to include the full retry configuration system if that's the direction you'd like to go. It would take a bit more time to implement and test, but would give us feature parity with the lookup source in one shot.

@davidradl
Copy link
Copy Markdown
Contributor

You're absolutely right — the lookup source has a comprehensive retry configuration system with retry-strategy.type, retry-codes, success-codes, and separate config for fixed-delay vs exponential-delay strategies. The sink should have the same level of control for consistency.

The current PR provides a minimal viable retry implementation with just http.sink.retry.times and a simple exponential backoff (1s, 2s, 4s, ...). This is a significant improvement over no retry support at all, but it's clearly not feature-complete compared to the lookup source.

I see two paths forward:

Option 1: Extend this PR to match the lookup source pattern:

  • Add http.sink.retry-strategy.type (fixed-delay | exponential-delay)
  • Add http.sink.retry-codes (which HTTP status codes trigger retry)
  • Add http.sink.success-codes (define success conditions)
  • Add all the delay configuration options (initial-backoff, max-backoff, multiplier)
  • This would make the sink and lookup source retry configurations consistent

Option 2: Merge basic retry now, extend in follow-up:

  • Current PR provides immediate value (basic retry is better than none)
  • Follow-up PR adds full retry strategy configuration to match lookup source
  • Allows incremental delivery

Which approach would you prefer? I'm happy to extend this PR to include the full retry configuration system if that's the direction you'd like to go. It would take a bit more time to implement and test, but would give us feature parity with the lookup source in one shot.

Option 1 looks good - as it is then consistent with the lookup

featzhang added 3 commits May 8, 2026 15:18
- Add SINK_HTTP_RETRY_TIMES constant to HttpConnectorConfigConstants
- Add RETRY_TIMES ConfigOption (default: 3) to HttpDynamicSinkConnectorOptions
- Register RETRY_TIMES as optional option in HttpDynamicTableSinkFactory
- Pass retry times via Properties in HttpDynamicSink
- Implement exponential backoff retry logic in HttpSinkWriter
  - Only retries on IOException (network errors)
  - Uses exponential backoff with initial delay of 1 second
  - Set retry.times=0 to disable retries
- Update unit tests to cover retry behavior
- Update documentation (English and Chinese) with new option description
- Remove premature numRecordsSendErrorsCounter.inc() before retry attempts;
  only count errors after all retries are exhausted
- Add retry support for HTTP-level failures (getFailedRequests() non-empty),
  consistent with IOException retry behavior
- Extract scheduleRetry() helper to reduce code duplication
- Add test cases: testRetryOnHttpFailedRequests, testNoRetryWhenDisabled
…kup source

Extend the HTTP sink retry model so it matches the lookup source feature
set (Option 1 in PR review):

- Add http.sink.retry-strategy.type (fixed-delay | exponential-delay).
- Add http.sink.retry-codes and http.sink.success-codes so users can
  precisely define which HTTP status codes are transient vs fatal.
- Add fixed-delay / exponential-delay tuning keys (initial-backoff,
  max-backoff, backoff-multiplier, delay) mirroring the ones the lookup
  source already exposes.
- Reuse the existing retry utilities: HttpResponseChecker drives the
  status-code classification, resilience4j's IntervalFunction computes
  the backoff, and RetryConfigProvider is generalised through a
  RetryOptionKeys bag so sink and lookup share the same strategy logic.
- SinkHttpClientResponse now distinguishes retryable (transient) and
  fatal failures; HttpSinkWriter retries only the retryable ones and
  counts fatal failures against numRecordsSendErrorsCounter
  immediately.
- The legacy http.sink.error.code* options remain untouched: when the
  new retry-codes / success-codes options are not set the client falls
  back to the previous one-bucket behaviour for backwards compatibility.
- Update both the English and the Chinese connector docs with the new
  options and a dedicated 'Retries and handling errors (Sink)' section.
- Add SinkRetryConfigTest and a new HttpSinkWriterTest case for fatal
  failures.
@featzhang featzhang force-pushed the feature/add-sink-retry-support branch from 2b9ecf6 to 82b9450 Compare May 8, 2026 07:35
@featzhang
Copy link
Copy Markdown
Member Author

Addressed the review comment — went with Option 1 and aligned the sink retry configuration with the lookup source. Latest commit: 82b9450.

What changed

New sink options (mirroring http.source.lookup.*):

Option Default Description
http.sink.retry-strategy.type exponential-delay fixed-delay or exponential-delay
http.sink.retry-codes 500,503,504 Status codes treated as transient (retried)
http.sink.success-codes 2XX Status codes treated as success
http.sink.retry-strategy.fixed-delay.delay 1s Delay for fixed-delay
http.sink.retry-strategy.exponential-delay.initial-backoff 1s Initial backoff for exponential-delay
http.sink.retry-strategy.exponential-delay.max-backoff 1min Max backoff cap
http.sink.retry-strategy.exponential-delay.backoff-multiplier 2.0 Multiplier per attempt

http.sink.retry.times keeps its original meaning (max retry attempts).

Implementation notes

  • Reused HttpResponseChecker + HttpCodesParser for the status-code classification — same machinery the lookup source uses.
  • Generalised RetryConfigProvider through a RetryOptionKeys bag so both source and sink share the strategy logic (fixed-delay / exponential-delay with max retries).
  • Sink retries use resilience4j's IntervalFunction, same as the lookup source, so behaviour is consistent.
  • SinkHttpClientResponse now distinguishes retryable vs fatal failures: only retryable ones are retried; fatal status codes count against numRecordsSendErrorsCounter immediately and do not block the pipeline.
  • The legacy http.sink.error.code / http.sink.error.code.exclude options remain untouched for backwards compatibility — if neither retry-codes nor success-codes is set, behaviour is unchanged.
  • Both English and Chinese docs updated with the new options and a dedicated Retries and handling errors (Sink) section.

Tests

  • New SinkRetryConfigTest covering: defaults, fixed-delay, exponential-delay, ISO-8601 + Flink duration parsing, custom retry/success codes, and error cases.
  • New HttpSinkWriterTest#testFatalFailuresAreNotRetried asserting fatal status codes are not retried even when retry.times > 0.
  • All previously-passing sink tests still green:
    • HttpSinkWriterTest (5), SinkRetryConfigTest (7), HttpDynamicTableSinkFactoryTest (4), JavaNetSinkHttpClientConnectionTest (13), BatchRequestHttpDynamicSinkInsertTest (13), PerRequestHttpDynamicSinkInsertTest (5), HttpPostRequestCallbackFactoryTest (3).

Ready for another look 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants