SOLR-18187: Document enrichment with LLMs#4259
SOLR-18187: Document enrichment with LLMs#4259nicolo-rinaldi wants to merge 14 commits intoapache:mainfrom
Conversation
…tUpdateProcessorFactory
- multivalued outputField - outputField different from Str/Text, with numeric, boolean and date
…t with LLMs' module
...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java
Outdated
Show resolved
Hide resolved
...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java
Outdated
Show resolved
Hide resolved
...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java
Outdated
Show resolved
Hide resolved
...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java
Show resolved
Hide resolved
...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java
Outdated
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
solr/modules/language-models/src/test-files/modelChatExamples/dummy-chat-model-ambiguous.json
Show resolved
Hide resolved
solr/modules/language-models/src/test-files/solr/collection1/conf/schema-language-models.xml
Show resolved
Hide resolved
...files/solr/collection1/conf/solrconfig-document-enrichment-update-request-processor-only.xml
Show resolved
Hide resolved
...ules/language-models/src/test-files/solr/collection1/conf/solrconfig-document-enrichment.xml
Show resolved
Hide resolved
...models/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactoryTest.java
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Outdated
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Outdated
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Outdated
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Outdated
Show resolved
Hide resolved
| restTestHarness.delete(ManagedChatModelStore.REST_END_POINT + "/model1"); | ||
| } | ||
|
|
||
| private UpdateRequestProcessor createUpdateProcessor( |
There was a problem hiding this comment.
Can't this always be generalised and used for all the tests? In some of them, you are now repeating this code with small changes...
There was a problem hiding this comment.
this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?
There was a problem hiding this comment.
I created a function initializeUpdateProcessorFactory that is used inside createUpdateProcessor. In this way, the code inside the first one can be reused
There was a problem hiding this comment.
why some test could not use these new functions?
e.g. init_multipleInputFields_shouldInitAllFields
There was a problem hiding this comment.
I kept them unrelated to the model creation, just to see the proper initialization of the Factory. I can see if this can be changed if you want
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Test | ||
| public void init_promptFileWithMissingPlaceholder_shouldThrowExceptionInInform() { | ||
| NamedList<String> args = new NamedList<>(); |
There was a problem hiding this comment.
this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?
There was a problem hiding this comment.
changed and fixed tests
...java/org/apache/solr/languagemodels/documentenrichment/store/rest/ManagedChatModelStore.java
Outdated
Show resolved
Hide resolved
...java/org/apache/solr/languagemodels/documentenrichment/store/rest/ManagedChatModelStore.java
Outdated
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Show resolved
Hide resolved
solr/modules/language-models/src/test-files/solr/collection1/conf/schema-language-models.xml
Show resolved
Hide resolved
| restTestHarness.delete(ManagedChatModelStore.REST_END_POINT + "/model1"); | ||
| } | ||
|
|
||
| private UpdateRequestProcessor createUpdateProcessor( |
There was a problem hiding this comment.
this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?
|
|
||
| This module brings the power of *Large Language Models* to Solr. | ||
|
|
||
| More specifically, it provides the capability, at indexing time, given a prompt and a set of input fields, of calling an |
There was a problem hiding this comment.
More specifically, it enables calling an LLM at indexing time to enrich documents with additional/generated/extracted data. Given a prompt and a set of input fields, for each document, the LLM is invoked through https://github.com/langchain4j/langchain4j[LangChain4j], and the result is stored in an outputField, which can support multiple types and may also be multivalued.
| LLM through https://github.com/langchain4j/langchain4j[LangChain4j] for each document and store the result of the call | ||
| in an `outputField`, that can be of multiple types and even multivalued. | ||
|
|
||
| _Without_ this module, the LLM calls must be done _outside_ Solr, before indexing. |
There was a problem hiding this comment.
Without this module, the LLM calls to enrich documents must be done outside Solr, before indexing.
|
|
||
| ==== | ||
|
|
||
| At the moment a subset of LLM providers supported by LangChain4j is supported by Solr. |
There was a problem hiding this comment.
At the moment, Solr supports a subset of the LLM providers available in LangChain4j.
| ---- | ||
| [NOTE] | ||
| ==== | ||
| If no component is configured in `solrconfig.xml`, the `ChatModel` store will not be registered and requests to |
| `/schema/chat-model-store` will return an error. | ||
| ==== | ||
|
|
||
| == Chat Model Configuration |
There was a problem hiding this comment.
Mmmm.. maybe "Chat Model setup?"
|
|
||
| Another important feature of this module is that one (or more) `inputField` needs to be injected in the prompt. This is | ||
| done by some special tokens, that are the `fieldName` surrounded by curly brackets (e.g., `{string_field}`, in the | ||
| example above). These tokens are _mandatory_ for this module to work properly. Solr will throw an error if the |
| example above). These tokens are _mandatory_ for this module to work properly. Solr will throw an error if the | ||
| parameters are not properly defined. | ||
| For example, both the prompt and the content of the file prompt.txt, must contain the text '{string_field}', which | ||
| will be substituted with the content of the `string_field` field for each document. An example of a valid prompt with |
There was a problem hiding this comment.
I think the part so far could be explained in a more schematic and better understandable way.
| </updateRequestProcessorChain> | ||
| ---- | ||
|
|
||
| Another way of using more than one `inputField` is by using the following notation, instead of more than one parameter |
There was a problem hiding this comment.
Multiple inputField could also be defined by using the following notation:
| </arr> | ||
| ---- | ||
|
|
||
| The LLM response is mapped to the specified `outputField`. Note that this module only supports a subset of Solr's |
There was a problem hiding this comment.
Maybe we can also specify that only one outputField is supported
| ==== | ||
|
|
||
| === Index first and enrich your documents on a second pass | ||
| LLM calls are usually quite slow, so, depending on your use case it could be a good idea to index first your documents |
There was a problem hiding this comment.
LLM calls are typically slow, so depending on your use case, it may be preferable to first index your documents and enrich them with LLM-generated fields at a later stage.
https://issues.apache.org/jira/browse/SOLR-18187
Description
The goal of this PR is to add a way to integrate LLMs directly into Solr at index time to fill fields that might be useful (e.g., categories, tags, etc.)
Solution
This PR adds LLM-based document enrichment capabilities to Solr's indexing pipeline via a new DocumentEnrichmentUpdateProcessorFactory in the language-models module. The processor allows users to enrich documents at index time by calling an LLM (via https://github.com/langchain4j/langchain4j) with a configurable prompt built from one or more existing document fields (inputFields), and storing the model's response into an output field. The output field can be of different types (i.e., string, text, int, long, float, double, boolean, and date) and can be single-valued or multi-valued. The structured output has been used to adapt to the output field type.
The implementation has taken inspiration from the text-to-vector feature in the same module. This has been done to keep the implementation consistent with conventions already in the language-models module.
Note: this PR was developed with assistance from Claude Code (Anthropic).
Tests
Tests covering configuration validation (missing required params, conflicting params, invalid field types, placeholder mismatches), and processor initialization.
Tests covering single-valued and multi-valued output fields of all supported types, multi-input-field prompts, prompt file loading, error handling (model exceptions, ambiguous/malformed JSON responses, unsupported model types), and skipNullOrMissingFieldValues behaviour. All the supported models have been tested.
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.