SOLR-18194: fix nested docs detection false positive#4279
SOLR-18194: fix nested docs detection false positive#4279kotman12 wants to merge 6 commits intoapache:mainfrom
Conversation
|
Btw @dsmiley I got Claude to factor this with an EmbeddedSolrServerTest here (as you suggested on the initial PR) but I am not sure what to think about it. Part of me thinks the collection-centric language, i.e. |
dsmiley
left a comment
There was a problem hiding this comment.
Part of me thinks the collection-centric language, i.e. .newCollection() is a bit weird to put in a test of a feature that is explicitly not supported in cloud mode. That being said it is just a naming thing and the test does work. So its perhaps more of a problem with the solrTestRule. Perhaps I have some trouble grasping the finer details as I've only ever interacted with Solr in Cloud mode.
The SolrTestRule is an abstraction that will grow to SolrCloud. So I think there should be one terminology instead of bifurcated. On the other hand, I could see adding an alias for non-cloud ones when we're sure the logic of the test is very fundamentally about cores.
| } | ||
| } | ||
|
|
||
| // --- Child docs detection tests --- |
There was a problem hiding this comment.
[0] Wow lots of tests... and they seem only slightly tweaked amongst each. My observation is that Lucene/Solr old-timers (like me) prefer to write a minimal number of randomized test instead of writing many repeated tests that tweak something. I noticed LLMs prefer the latter. But it's fine, really!
There was a problem hiding this comment.
These are indeed LLM generated but I was very intentional about what scenarios to test, i.e. hasChildDocs X hasUpdates X hasDeletes. These variables have different effects on the distinct count vs docCount of both id and _root_ and would like any hypothetical future refactoring to take into account each since the logic is a function of these variables. I am not convinced a randomized test would have been effective here without knowing what edge cases you are looking for a priori. I suppose in theory it could have caught other things I hadn't considered although seems unlikely given the relatively simple logic and few moving parts (famous last words). Thus, there is advantage to deterministically validating these known edge cases with each build. That being said, it feels like overkill to test these in UpgradeCoreIndex as this check is a rather small component of the overall operation. Perhaps this check should live in some utility function? Then we could also call it from, say, the LukeHandler and expose a hasNested/ChildDocs in the index info. The benefit of such thorough testing would be more obvious in a targeted utility. What do you think?
As an aside, is your reflexive urge to use randomized testing here motivated by aesthetic reasons? If so, I could consider parameterizing them. I think it should be possible.
There was a problem hiding this comment.
Thanks for explaining why the tests are what they are. Seems perfect then :-)
No; not aesthetic reasons. Well okay slightly but whatever.
I love the idea of a utility function somewhere! SolrCore? I hate to suggest that place given SolrCore's size but it would at least have some logical sense to its placement there, and discoverability, which matters.
…reIndex.java Co-authored-by: David Smiley <[email protected]>
| // in segments (since NoMergePolicy prevents segment merges from purging them). | ||
|
|
||
| @Test | ||
| public void testChildDocsDetection_noChildDocsJustAdd() throws Exception { |
There was a problem hiding this comment.
I understand the intent that we are trying to be explicit about the edge cases of the child docs check. In certain cases however, I'd argue that certain tests like this one are effectively testing the happy path which is already getting covered inherently in existing tests like testNoUpgradeNeededWhenAllSegmentsCurrent().
I appreciate the thorough tests, but I think we have an opportunity to keep the coverage tight here by relying on certain existing tests like above. Another reasoning is that every test is "just another test" by itself, but can gradually add to the build times.
| } | ||
|
|
||
| @Test | ||
| public void testChildDocsDetection_withChildDocsJustAdd() throws Exception { |
There was a problem hiding this comment.
Same argument on duplicate coverage as above through existing test testUpgradeCoreIndexFailsWithNestedDocuments() (now renamed to testUpgradeCoreIndexFailsWithChildDocuments() in this PR)
| } | ||
|
|
||
| @Test | ||
| public void testChildDocsDetection_withChildDocsWithWithinCommitUpdates() throws Exception { |
There was a problem hiding this comment.
I'd argue that all tests of the pattern testChildDocsDetection_withChildDocs* are getting covered through the existing test testUpgradeCoreIndexFailsWithChildDocuments() since the with vs within(CommitUpdates) condition is not a real differentiator in that case.
testChildDocsDetection_noChildDocsWithWithinCommitUpdates() and testChildDocsDetection_noChildDocsWithWithinCommitDeletesAndUpdates() is all we need?
There was a problem hiding this comment.
In the current algorithm they are equivalent but things can change so this is more for regression prevention. I think it's possible to imagine some future code path that only gets exercised when unique(id) == idTerms.size() (which is the case only if all deletes are via update) or vice versa. We can also randomize the counts of docs added vs deleted vs updated if we are really worried about the incremental build time here. All that being said, it feels like overkill for the UpgradeCoreIndex tests because this isn't even the main point of this class.
That's why I proposed moving the child doc check to a central location (where they can also be exposed by some info endpoint as well). I know @dsmiley expressed interest in exposing the child doc check via some info endpoint in the original PR (luke or maybe in core/collection status?). If we were to go that direction then we'd probably want to move this check to some centralized utility.
Description
The upgrade core index endpoint doesn't currently support child docs and so it guards against running on an index with such documents. The logic to check for the existence of child docs contingent upon the
_root_field existing could be improved. If you just compare the doc count of_root_vs the unique count (current behavior), it will falsely flag any segment with updated docs (where the update happened within one segment) as containing child docs when in actuality it may have no child docs. I actually ran into such a scenario devising my own version of this check.Solution
Compare terms size of id field with the terms size of the
_root_field to ensure no documents share a root. We do this for each segment separately as was done for the previous check.Tests