Skip to content

feat(index): Add Indexer abstraction and refactor metadata table init…#18348

Open
cshuo wants to merge 5 commits intoapache:masterfrom
cshuo:indexer_abstraction
Open

feat(index): Add Indexer abstraction and refactor metadata table init…#18348
cshuo wants to merge 5 commits intoapache:masterfrom
cshuo:indexer_abstraction

Conversation

@cshuo
Copy link
Copy Markdown
Collaborator

@cshuo cshuo commented Mar 19, 2026

…ialization logic

Describe the issue this Pull Request addresses

The metadata table initialization path had index-specific logic scattered across multiple metadata writer/util classes, which made the code hard to extend and maintain as new index types are added.

This PR addresses that by introducing a dedicated Indexer abstraction and refactoring initialization flows to use it consistently across engine-specific implementations.

This PR is based on #12983

Summary and Changelog

Commit 1: Add Indexer abstraction and related POJOs

  • Introduced a new metadata indexing abstraction layer:
    • Indexer, BaseIndexer, IndexerFactory
    • record-level indexers (RecordIndexer, PartitionedRecordIndexer, BaseRecordIndexer)
    • concrete indexers for files, bloom filters, column stats, partition stats, expression, and secondary index
  • Added expression index record generators (including unsupported/spark variants).
  • Added supporting model POJOs for index partition initialization/data flow.
  • Added focused unit tests for factory and each major indexer implementation.

Commit 2: Refactor metadata table initialization to use the abstraction

  • Refactored metadata table initialization logic in metadata writer paths to rely on the new Indexer abstraction.
  • Simplified and consolidated previously duplicated logic in:
    • HoodieBackedTableMetadataWriter
    • engine-specific metadata writer implementations (Spark/Flink/Java)
    • related metadata utility classes
  • Updated related tests to align with the new abstraction-based flow.
  • Net effect: significant code reduction and cleaner separation of responsibilities.

Impact

  • Functional impact: No intended behavior change in index outcomes; this is primarily an architectural refactor plus new abstraction scaffolding.
  • Maintainability: Major improvement via centralized index initialization contract and reduced duplication.
  • Extensibility: Easier to add new metadata index types with minimal touch points.

Risk Level

low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

* including file-group initialization, commit, and partition state update.
*/
@Slf4j
public abstract class BaseIndexer implements Indexer {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To break the PR down, could we first add these abstraction classes to ease the reviews?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I split changes in the PR into 2 commits, since the indexer abstraction and its implementations involve some refactor of util class, which also affects metadata writer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg

@cshuo cshuo force-pushed the indexer_abstraction branch from d2b4e7e to 116a487 Compare March 19, 2026 03:50
@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Mar 19, 2026
@cshuo cshuo force-pushed the indexer_abstraction branch 2 times, most recently from 42e6f52 to ec26eb4 Compare March 20, 2026 02:25
@cshuo cshuo changed the title [WIP] feat(index): Add Indexer abstraction and refactor metadata table init… feat(index): Add Indexer abstraction and refactor metadata table init… Mar 20, 2026
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocked on my review due to critical logic

@cshuo cshuo force-pushed the indexer_abstraction branch 3 times, most recently from bdfa722 to 3237e4d Compare March 23, 2026 02:09
@cshuo cshuo force-pushed the indexer_abstraction branch 3 times, most recently from 08eecef to db53699 Compare March 25, 2026 09:03
@cshuo cshuo force-pushed the indexer_abstraction branch from db53699 to 8a03ea2 Compare March 26, 2026 03:42
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review
Style & readability pass — a few naming and duplication nits below, otherwise the code is well-structured and readable. Style & readability pass — a few small naming and simplification nits below, nothing blocking. Style & readability pass — a handful of naming and readability suggestions below, mostly around the new Indexer abstraction and the refactored initialization flow. Style & readability pass — a few naming nits below, primarily around boolean accessor conventions (flag()isDeleted()), a misleading name() accessor that returns a full path, an unused @Slf4j, and a comment that drifted away from the logic it describes after the refactor. Style & readability pass — a couple of minor naming and simplification opportunities in the test code, plus a comment that looks like it has a stray word and may not match its adjacent assertion. Style & readability pass — a couple of small naming and readability nits across the new test files; production code changes look clean. Style & readability pass — a few naming and consistency nits below, mostly around a stale Javadoc reference, an opaque boolean field name, a parameter/field name mismatch, and a mutability inconsistency across the new data-holder classes.

reader.close();
}
} catch (IOException e) {
throw new HoodieIOException("Error reading total records from file " + path, e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: small typo in the Javadoc — "context ot use" should be "context to use".

*/
@Slf4j
public class SecondaryIndexer extends BaseIndexer {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: RECORD_INDEX_AVERAGE_RECORD_SIZE = 48 is also defined in BaseRecordIndexer. Since SecondaryIndexer can't inherit it, could this constant live in a shared place (e.g. HoodieTableMetadataUtil or a RecordIndexConstants class) to avoid the duplication?

import static org.apache.hudi.metadata.HoodieTableMetadataUtil.createRecordIndexDefinition;
import static org.apache.hudi.metadata.MetadataPartitionType.RECORD_INDEX;

/**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the Javadoc says "Implementation of the global RECORD_INDEX index" — but this class is the partitioned variant. RecordIndexer is the global one. Might be worth updating to avoid confusion.

Map<String, DataPartitionAndRecords> fileGroupCountAndRecordsPairMap = new HashMap<>(partitionFileSlicePairsMap.size());
int maxParallelismPerHudiPartition = partitionFileSlicePairsMap.isEmpty()
? 1 : Math.max(1, dataTableWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism() / partitionFileSlicePairsMap.size());
int totalFileGroupCount = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: iterating over keySet() and then calling partitionFileSlicePairsMap.get(partition) inside is a classic two-lookup pattern. The second loop in this same method already uses entrySet() — could you align the first loop to do the same?

DataPartitionAndRecords init = new DataPartitionAndRecords(2, Option.empty(), records);
ExposedRecordIndexer indexer = new ExposedRecordIndexer(engineContext, writeConfig, metaClient, init);

try (MockedStatic<org.apache.hudi.metadata.HoodieTableMetadataUtil> mockedUtil = mockStatic(org.apache.hudi.metadata.HoodieTableMetadataUtil.class)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: same fully-qualified org.apache.hudi.metadata.HoodieTableMetadataUtil usage as in TestPartitionedRecordIndexer — an import would clean this up and it repeats on lines 63, 111, and 116 too.

* A class which represents a directory and the files and directories inside it.
* <p>
* A {@code PartitionFileInfo} object saves the name of the partition and various properties requires of each file
* required for initializing the metadata table. Saving limited properties reduces the total memory footprint when
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the Javadoc references PartitionFileInfo but the class is DirectoryInfo — looks like a stale copy-paste. Could you update it to match?

private final String partitionPath;
private final String fileName;
private final boolean flag;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: flag is pretty opaque — it's unclear what this boolean represents at the call site. Something like isDeleted or isAdded (or whatever semantic it carries) would read more clearly.

}

public static FileInfoAndPartition of(String partition, String path, long size) {
return new FileInfoAndPartition(partition, path, size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the factory parameter is named path but the backing field is name — could you align them so readers don't have to mentally translate at the call site?

*/
@AllArgsConstructor
@Getter
@Setter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: @Setter makes this mutable, which is inconsistent with FileInfo and FileInfoAndPartition in the same package — both use final fields and no setters. If mutation isn't needed here, have you considered dropping @Setter and making the fields final?

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Nice work on the Indexer abstraction refactor! This architectural improvement significantly simplifies the metadata initialization path and enhances test coverage, though there are several functional issues regarding initialization edge cases, unmodifiable maps, and serialization that require attention.

String instantTime) {
if (metaClient.getTableConfig().getTableVersion().lesserThan(HoodieTableVersion.EIGHT)) {
throw new HoodieNotSupportedException("Table version 6 and below does not support expression index");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The error message says "Table version 6 and below" but the check is lesserThan(HoodieTableVersion.EIGHT), which also covers version 7 (0.16.0). Should this say "Table version 7 and below" or "versions before 8"?

if (partitionToAllFilesMap.isEmpty()) {
this.columnsToIndex = Lazy.lazily(Collections::emptyList);
return Collections.singletonList(IndexPartitionInitialization.of(fileGroupCount, COLUMN_STATS.getPartitionPath(), engineContext.emptyHoodieData()));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Overwriting this.columnsToIndex with an empty list when partitionToAllFilesMap is empty means postInitialization will register the column_stats index definition with zero source fields. Is that intentional? If this indexer instance is reused for a subsequent call with non-empty partitions, the columns would also be lost.

Map<String, Long> fileNameToSizeMap = partitionInfo.getValue().stream()
.collect(Collectors.toMap(FileInfo::fileName, FileInfo::size));
return HoodieMetadataPayload.createPartitionFilesRecord(
partitionInfo.getKey(), fileNameToSizeMap, Collections.emptyList());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Calling fileListRecords.count() triggers a full Spark action just for validation. For tables with many partitions, this doubles the initialization cost of this step. Could you validate using partitionToAllFilesMap.size() against partitions.size() instead (which is always true by construction), or remove this check?

Copy link
Copy Markdown
Collaborator Author

@cshuo cshuo Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitions is the key set of partitionToAllFilesMap, so partitions.size() is always equals to partitionToAllFilesMap.size(). It seems the validation here is unnecessary for the current code base, and it should be safe to be removed.

if (partitionFileSlicePairs.isEmpty()) {
return Collections.singletonList(IndexPartitionInitialization.of(fileGroupCount, indexName, engineContext.emptyHoodieData()));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The Lazy wrapper here is created and immediately .get()-ed on the next line, so it doesn't actually provide any lazy evaluation benefit. Could simplify to a direct call: HoodieSchema tableSchema = HoodieTableMetadataUtil.tryResolveSchemaForTable(dataTableMetaClient).orElseThrow(...);

List<String> filesToDelete = filesDeleted.getOrDefault(partition, Collections.emptyList());
fileDeleteCount[0] += filesToDelete.size();
HoodieRecord record = HoodieMetadataPayload.createPartitionFilesRecord(partition, filesToAdd, filesToDelete);
Map<String, Long> filesToAddMap = filesToAdd.stream().collect(Collectors.toMap(FileInfo::fileName, FileInfo::size));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The old signature used Map<String, Long> which inherently prevented duplicate file names. The new List<FileInfo> can potentially contain duplicates, and Collectors.toMap(FileInfo::fileName, FileInfo::size) will throw IllegalStateException on duplicate keys. Could you add a merge function (e.g., Long::max) to handle this defensively, or document why duplicates can't occur?

@EnumSource(classOf[HoodieTableType])
override def testRLIUpsertAndDropIndex(tableType: HoodieTableType): Unit = {
val hudiOpts = commonOpts ++ Map(DataSourceWriteOptions.TABLE_TYPE.key -> tableType.name(),
HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key -> "true")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The comment mentions that partition stats should be present, but the assertion checks for a size of 2. If only files and col stats are expected, the count is correct, but the comment might be misleading.

// FILES partition should always be initialized first if enabled
if (!filesPartitionAvailable) {
initializeMetadataPartition(FILES, indexerMapForPartitionsToInit.get(FILES),
dataTableInstantTime, partitionIdToAllFilesMap, lazyLatestMergedPartitionFileSliceList);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This will NPE if the metadata table directory exists but the FILES partition is not marked as available.

this.enabledIndexerMap.remove(RECORD_INDEX);
indexerMapForPartitionsToInit.remove(RECORD_INDEX);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This will throw UnsupportedOperationException because enabledIndexerMap is unmodifiable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, now RECORD_INDEX partition will be filtered during initializing enabledIndexerMap.

protected final transient HoodieEngineContext engineContext;
@Getter
protected final List<MetadataPartitionType> enabledPartitionTypes;
protected final transient Map<MetadataPartitionType, Indexer> enabledIndexerMap;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Making these fields transient breaks the Serializable contract of HoodieTableMetadataWriter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking these fields as transient is intentional. HoodieBackedTableMetadataWriter can be captured in Spark execution closures, so any field reachable from it must either be serializable or excluded from serialization. enabledIndexerMap / expressionIndexRecordGenerator and the existing field engineContext are runtime-only collaborators and are not serializable as part of the Spark closure. So the goal here is to avoid closure-serialization failures, not to make the writer meaningfully deserializable for later reuse.

}
return Collections.unmodifiableMap(Arrays.stream(MetadataPartitionType.getValidValues(dataTableMetaClient.getTableConfig().getTableVersion()))
.filter(partitionType ->
(partitionType.isMetadataPartitionEnabled(dataTableWriteConfig.getMetadataConfig(), dataTableMetaClient.getTableConfig())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Is there a versioned getValidValues method added to MetadataPartitionType? This might cause a compilation error.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a getValidValues method accepting a parameter of type HoodieTableVersion.

@cshuo cshuo force-pushed the indexer_abstraction branch 2 times, most recently from 87975f4 to 397cf29 Compare April 7, 2026 10:03
@cshuo cshuo force-pushed the indexer_abstraction branch from 397cf29 to e4ee525 Compare April 7, 2026 12:28
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — The refactoring is clean and well-organized overall. One minor readability concern: the parameter order of the factory method doesn't match the constructor's field order, which could confuse developers. code looks clean overall, with one minor style inconsistency in variable naming within a stream pipeline.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Good progress on the Indexer abstraction — the decomposition into per-index-type indexers is clean and should make adding new index types much easier. A few items need attention before merging: FileAndPartitionFlag is missing Serializable (which will cause runtime failures on Spark), and there are edge cases to verify around empty-partition handling, HashMap iteration ordering for partition initialization, and ensuring hasPartitionsStateChanged is set correctly when initialization is skipped.

dataTableWriteConfig.getMetadataConfig().getMaxReaderBufferSize(),
columnsToIndex.get());
return Collections.singletonList(IndexPartitionInitialization.of(fileGroupCount, COLUMN_STATS.getPartitionPath(), records));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 When columnsToIndex is empty and buildInitialization returns empty data, postInitialization will still be called and will register an index definition with empty source fields. Is that intentional? It seems like registering a column stats index with no source fields could confuse downstream readers.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the logic of the current master branch, the behavior is same as this pr. Not sure if it's intentional cc @yihua

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an comment explain this case:

if (columnsToIndex.isEmpty()) {
// this can only happen if meta fields are disabled and cols to index is not explicitly overridden.
return Pair.of(columnsToIndex, Pair.of(fileGroupCount, engineContext.emptyHoodieData()));
}

log.info("Initializing partitioned record index from data partition {}", partition);
DataPartitionAndRecords dataPartitionAndRecords = initializeRecordIndexPartition(partition, fileSliceAndPartitions, maxParallelismPerHudiPartition);
fileGroupCountAndRecordsPairMap.put(partition, dataPartitionAndRecords);
totalFileGroupCount += dataPartitionAndRecords.numFileGroups();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 When partitionFileSlicePairsMap is empty (e.g., empty table), totalFileGroupCount will be 0 and initializationList will be empty, but the code still returns an IndexPartitionInitialization with 0 file groups. Could this cause issues downstream? The non-partitioned RecordIndexer path goes through estimateFileGroupCount which guarantees at least 1 file group via Math.max(1, ...). Is an empty-table scenario expected to be gated earlier in the call chain?

Copy link
Copy Markdown
Collaborator Author

@cshuo cshuo Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same logic exists in the master branch, and I think it's indeed an issue. If partitionFileSlicePairsMap is empty, it still returns a non-empty buildInitialization() result with totalFileGroupCount = 0 and no DataPartitionAndRecords. That means the writer will not skip initialization; instead it can mark the record index partition as available without creating any file groups.
I think we should early return a empty initialization result if partitionFileSlicePairsMap is empty, then metadata writer will skip initializing the partitioned RLI. WDYT? @yihua

@Override
public void postInitialization(HoodieTableMetaClient metadataMetaClient, HoodieData<HoodieRecord> records, int fileGroupCount, String relativePartitionPath) {
super.postInitialization(metadataMetaClient, records, fileGroupCount, relativePartitionPath);
// Validate record index after commit if validation is enabled
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Worth noting that since records passed to postInitialization is the union of all per-partition records (built in initializeMetadataPartition), calling records.unpersist() on a Spark union RDD does not automatically unpersist the parent RDDs that were individually persisted inside estimateFileGroupCount. For non-partitioned RLI this is fine (single source), but for partitioned RLI, the individual per-partition records that were persisted may leak. The old code had the same pattern though, so it's not a regression.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.26441% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.81%. Comparing base (fd20018) to head (e018541).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
.../hudi/metadata/index/record/BaseRecordIndexer.java 71.84% 25 Missing and 4 partials ⚠️
...hudi/metadata/HoodieBackedTableMetadataWriter.java 92.42% 3 Missing and 2 partials ⚠️
...dex/UnsupportedExpressionIndexRecordGenerator.java 42.85% 4 Missing ⚠️
...i/metadata/index/expression/ExpressionIndexer.java 87.87% 1 Missing and 3 partials ⚠️
...org/apache/hudi/metadata/index/IndexerFactory.java 87.50% 2 Missing and 1 partial ⚠️
...ata/index/SparkExpressionIndexRecordGenerator.java 78.57% 2 Missing and 1 partial ⚠️
...udi/metadata/index/secondary/SecondaryIndexer.java 90.47% 0 Missing and 2 partials ⚠️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 93.33% 2 Missing ⚠️
...metadata/index/columnstats/ColumnStatsIndexer.java 96.96% 0 Missing and 1 partial ⚠️
...etadata/index/record/PartitionedRecordIndexer.java 96.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18348      +/-   ##
============================================
+ Coverage     68.76%   68.81%   +0.04%     
- Complexity    28086    28200     +114     
============================================
  Files          2451     2471      +20     
  Lines        134839   135110     +271     
  Branches      16328    16367      +39     
============================================
+ Hits          92718    92970     +252     
- Misses        34793    34795       +2     
- Partials       7328     7345      +17     
Flag Coverage Δ
common-and-other-modules 44.54% <59.24%> (+0.09%) ⬆️
hadoop-mr-java-client 44.92% <54.64%> (-0.01%) ⬇️
spark-client-hadoop-common 48.41% <44.80%> (-0.08%) ⬇️
spark-java-tests 48.92% <76.64%> (+0.10%) ⬆️
spark-scala-tests 45.51% <69.26%> (+0.01%) ⬆️
utilities 38.29% <66.66%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...oodieBackedTableMetadataWriterTableVersionSix.java 76.41% <100.00%> (+0.73%) ⬆️
...apache/hudi/metadata/HoodieMetadataWriteUtils.java 83.44% <100.00%> (+0.84%) ⬆️
.../metadata/SecondaryIndexRecordGenerationUtils.java 88.27% <100.00%> (ø)
...va/org/apache/hudi/metadata/index/BaseIndexer.java 100.00% <100.00%> (ø)
...tadata/index/bloomfilters/BloomFiltersIndexer.java 100.00% <100.00%> (ø)
...apache/hudi/metadata/index/files/FilesIndexer.java 100.00% <100.00%> (ø)
...data/index/model/IndexPartitionInitialization.java 100.00% <100.00%> (ø)
...ta/index/partitionstats/PartitionStatsIndexer.java 100.00% <100.00%> (ø)
...ache/hudi/metadata/index/record/RecordIndexer.java 100.00% <100.00%> (ø)
...c/main/java/org/apache/hudi/table/HoodieTable.java 89.49% <100.00%> (ø)
... and 23 more

... and 25 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Apr 8, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@cshuo
Copy link
Copy Markdown
Collaborator Author

cshuo commented Apr 9, 2026

@yihua all comments are resolved or replied. Can you take another look.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Nice updates — all the feedback from the previous review has been addressed cleanly:

  • FileAndPartitionFlag Serializable (BUG): Fixed. All model POJOs (FileAndPartitionFlag, FileInfo, FileInfoAndPartition, FileSliceAndPartition, DirectoryInfo) now implement Serializable with explicit serialVersionUID.
  • hasPartitionsStateChanged set unconditionally (BUG): Fixed. The flag is now set at the end of initializeMetadataPartition, after the early-return guard for empty initialization lists, so it's only set when work actually happens.
  • Empty filesToIndex causing zero parallelism (BUG): Fixed. The emptiness check in ExpressionIndexer now correctly guards on filesToIndex.isEmpty() instead of partitionFileSlicePairs.isEmpty().
  • records.unpersist() not recursive (QUESTION): Fixed. Changed to unpersistWithDependencies() to clean up per-partition persisted datasets.
  • Empty columnsToIndex registering empty source fields (QUESTION): Fixed. postInitialization now checks records.isEmpty() and registers with an empty list when there are no records.
  • SecondaryIndexer partition type and HashMap ordering questions were answered by the author with reasonable justifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants