This guide provides step-by-step instructions for setting up the machine learning inference and ingest pipeline pieces in your Elasticsearch cluster to support the Semantic Code Indexer.
This setup enables two powerful features:
- Semantic Search: Using Elastic's ELSER model via an Elasticsearch inference endpoint (used by the
semantic_textfield). - Code Similarity Search: Using Microsoft's CodeBERT model to generate dense vectors for future KNN (k-Nearest Neighbor) features, such as a
find_similar_codetool.
- An Elasticsearch cluster (v8.0 or later).
- Access to the Kibana Dev Tools console.
- Python and
pipinstalled on your local machine. - The
elandPython library installed (pip install eland).
ELSER (Elastic Learned Sparse EncodeR) powers the primary semantic search functionality.
The index mapping uses Elasticsearch’s semantic_text field with an inference_id. Configure that via SCS_IDXR_ELASTICSEARCH_INFERENCE_ID in the indexer.
- Navigate to the Dev Tools console in your Kibana instance.
- Create (or reuse) an inference endpoint ID that resolves to an ELSER sparse embedding endpoint, then set
SCS_IDXR_ELASTICSEARCH_INFERENCE_IDto that ID.
The exact setup depends on your Elasticsearch deployment and version. The indexer repository includes an example of creating an inference endpoint for local testing in scripts/setup-integration-tests.sh (see how it configures an endpoint like elser-inference-test via /_inference/sparse_embedding/...).
- Verify the inference endpoint is available and healthy before proceeding.
For the planned find_similar_code feature, we need a dense vector model that is specifically trained on source code. We will use microsoft/codebert-base from the Hugging Face Hub.
-
Install Eland: If you haven't already, install the
elandclient.pip install eland
-
Troubleshooting: Upgrade PyTorch: The
elandtool depends on PyTorch. Due to a recent security vulnerability, you may need to upgradetorchto version 2.6 or higher. If you encounter an error during the model import, run this command first:pip install --upgrade torch
-
Import the Model: Run the following command in your terminal. Replace the environment variables with your actual Elasticsearch credentials. This command will download the CodeBERT model, convert it to a TorchScript representation, and upload it to your cluster.
eland_import_hub_model \ --cloud-id $ELASTICSEARCH_CLOUD_ID \ --hub-model-id microsoft/codebert-base \ --task-type text_embedding \ --es-api-key $ELASTICSEARCH_API_KEY \ --start
This process will create a new model in Elasticsearch with the ID
microsoft__codebert-base.
This ingest pipeline will selectively use the CodeBERT model to generate dense vector embeddings for the most valuable code chunks, saving significant computational resources.
- Navigate to the Dev Tools console in Kibana.
- Run the following command to create the pipeline:
PUT _ingest/pipeline/code-similarity-pipeline
{
"description": "Pipeline to selectively generate dense vector embeddings for substantive code chunks.",
"processors": [
{
"grok": {
"field": "kind",
"patterns": ["^(call_expression|import_statement|lexical_declaration)$"],
"on_failure": [
{
"inference": {
"model_id": "microsoft__codebert-base",
"target_field": "code_vector",
"field_map": {
"content": "text_field"
}
}
},
{
"set": {
"field": "code_vector",
"copy_from": "code_vector.predicted_value"
}
}
]
}
}
]
}This pipeline intelligently inspects the kind of each code chunk. If it's a low-value type (like a function call or import), it skips the expensive embedding process. For high-value chunks (like functions and classes), it generates the dense vector and stores it in the code_vector field.
After completing these steps, your Elasticsearch cluster is fully configured. You have:
- An ELSER-backed inference endpoint configured for
semantic_textsemantic search. - The microsoft__codebert-base model deployed for dense vector code similarity.
- The optimized code-similarity-pipeline ready to selectively and automatically generate dense vectors during indexing.
The application will automatically create the index with the correct mapping on its first run.