This guide outlines how to deploy the semantic code indexer on a Google Cloud Platform (GCP) VM as a periodic, multi-repository service.
The setup consists of a single main component managed by cron:
- A periodic cron job: This job triggers the indexer which, for each configured repository, scans for changes, enqueues them, and then immediately processes the queue to update the search index.
You can create a new VM instance using the gcloud command-line tool. Here is a sample command to create a powerful Ubuntu 22.04 VM suitable for running the indexer:
gcloud compute instances create simainhacker-code-search-indexer \
--project=$GCP_PROJECT_ID \
--zone=us-central1-a \
--machine-type=n2-standard-8 \
--create-disk=auto-delete=yes,boot=yes,device-name=your-vm-name-boot-disk,disk-resource-policy=projects/your-project/regions/your-region/diskResourcePolicies/your-policy \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--image-family=ubuntu-2204-lts \
--image-project=ubuntu-os-cloud \
--description="Indexer for Semantic Code Search."Note: You will need to replace the placeholder values for --project and --disk-resource-policy with your specific GCP project details.
After running this command, you can SSH into your new VM using:
gcloud compute ssh simainhacker-code-search-indexer --project="your-gcp-project-id" --zone="us-central1-a"Once you have SSH'd into your VM, you need to install the necessary dependencies.
First, update your package manager's list of available packages:
sudo apt-get updateInstall git for version control, tmux for managing persistent sessions, flock for preventing cron job overlaps, and build-essential for compiling npm packages.
sudo apt-get install -y git tmux build-essential flockThe indexer requires Node.js v20 or higher. The following commands will add the official NodeSource repository and install the latest Node.js v20.
# Download and execute the NodeSource setup script
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
# Install Node.js
sudo apt-get install -y nodejsYou can verify that everything was installed correctly by checking the versions:
node -v
npm -v
git --version- A GCP project with a running VM instance (e.g., Ubuntu 22.04).
git,node.js(v20+),npm, andflockinstalled on the VM.- The indexer project code cloned to a directory (e.g.,
/opt/semantic-code-search-indexer). - An Elasticsearch instance that the VM can connect to.
The application's configuration is managed by a .env file. Create this file in the root of the project directory (/opt/semantic-code-search-indexer/.env).
# /opt/semantic-code-search-indexer/.env
# Elasticsearch Configuration
ELASTICSEARCH_ENDPOINT="https://your-es-endpoint.elastic-cloud.com:9243"
ELASTICSEARCH_API_KEY="YourEncodedApiKey"
# OpenTelemetry Configuration (optional)
# Enable logs and metrics export to OpenTelemetry Collector
SCS_IDXR_OTEL_LOGGING_ENABLED="true"
SCS_IDXR_OTEL_METRICS_ENABLED="true"
OTEL_SERVICE_NAME="semantic-code-search-indexer"
OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4318"
# Optional: separate endpoints for logs and metrics
# OTEL_EXPORTER_OTLP_LOGS_ENDPOINT="http://otel-collector:4318/v1/logs"
# OTEL_EXPORTER_OTLP_METRICS_ENDPOINT="http://otel-collector:4318/v1/metrics"
# Optional: authentication headers
# OTEL_EXPORTER_OTLP_HEADERS="authorization=Bearer your-token"
# Optional: metric export interval (default: 60000ms = 60s)
# SCS_IDXR_OTEL_METRIC_EXPORT_INTERVAL_MILLIS="60000"
# Application Configuration
# Base directory where all queue databases will be stored.
# Each repository gets its own SQLite queue at SCS_IDXR_QUEUE_BASE_DIR/<repo-name>/queue.db
SCS_IDXR_QUEUE_BASE_DIR="/var/lib/indexer/queues"
# GitHub token (required for private repositories)
GITHUB_TOKEN="ghp_YourToken"We will use cron, a standard time-based job scheduler, to run the indexer periodically. The unified npm run index command handles both scanning and indexing in a single operation.
-
Open the Crontab: Open the crontab file for the current user for editing.
crontab -e
-
Add the Cron Job: Add the following line to the end of the file. This configuration will run the indexer every 10 minutes for multiple repositories.
*/10 * * * * cd /opt/semantic-code-search-indexer && /usr/bin/flock -n /tmp/indexer.lock npm run index -- /var/lib/indexer/repos/repo-one:repo-one-index /var/lib/indexer/repos/repo-two:repo-two-index --pull >> /opt/semantic-code-search-indexer/indexer.log 2>&1
Command Breakdown:
*/10 * * * *: The schedule, meaning "at every 10th minute."cd /opt/semantic-code-search-indexer: Change to the project directory./usr/bin/flock -n /tmp/indexer.lock: This is a crucial command for reliability. It ensures that only one instance of the indexer can run at a time. If a previous run is still active, the new one will not start, preventing resource contention and potential data corruption.npm run index -- <repos...>: The unified index command that handles both scanning and indexing in one pass./var/lib/indexer/repos/repo-one:repo-one-index: Repository path with custom index name.--pull: Git pull before indexing to get latest changes.GITHUB_TOKEN: GitHub token for private repositories (optional; set via.envor environment).>> /opt/semantic-code-search-indexer/indexer.log 2>&1: This redirects all output (both standard output and standard error) to a log file within the project directory. You must ensure this file is writable by the user running the cron job.
For watch mode (continuous indexing): If you prefer to run the indexer as a long-running process that watches for changes, use the
--watchflag and run it as a systemd service instead of a cron job:npm run index -- /var/lib/indexer/repos/repo-one:repo-one-index --watch
-
Save and Exit: Save the file and exit your editor.
cronwill automatically install the new job.
-
Build the Project: Ensure the project is fully built by running
npm installandnpm run build. -
Check the Status: You can check that your cron job is installed by running:
crontab -l
After the next 10-minute interval, you can check the log file for output:
tail -f /opt/semantic-code-search-indexer/indexer.log
For production monitoring, deploy an OpenTelemetry Collector to receive logs and metrics from the indexer. The collector buffers telemetry data, adds resource attributes, and exports to Elasticsearch.
-
Download and Install OpenTelemetry Collector:
# Download the latest collector-contrib release wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.91.0/otelcol-contrib_0.91.0_linux_amd64.tar.gz tar -xzf otelcol-contrib_0.91.0_linux_amd64.tar.gz sudo mv otelcol-contrib /usr/local/bin/ -
Create Collector Configuration:
Copy the example collector configuration from the repository:
sudo mkdir -p /etc/otelcol sudo cp /opt/semantic-code-search-indexer/docs/otel-collector-config.yaml /etc/otelcol/config.yaml
Edit the configuration to add your Elasticsearch credentials:
sudo nano /etc/otelcol/config.yaml
-
Create systemd Service for Collector:
Create
/etc/systemd/system/otelcol.service:[Unit] Description=OpenTelemetry Collector After=network.target [Service] Type=simple ExecStart=/usr/local/bin/otelcol-contrib --config=/etc/otelcol/config.yaml Restart=on-failure RestartSec=30 Environment="ELASTICSEARCH_ENDPOINT=https://your-es-endpoint.elastic-cloud.com:9243" Environment="ELASTICSEARCH_API_KEY=YourEncodedApiKey" [Install] WantedBy=multi-user.target
-
Start the Collector:
sudo systemctl daemon-reload sudo systemctl enable otelcol sudo systemctl start otelcol sudo systemctl status otelcol
The indexer exports telemetry to the following Elasticsearch data streams:
- Logs:
logs-semanticcode.otel-default - Metrics:
metrics-semanticcode.otel-default
These follow Elasticsearch's data stream naming conventions:
logs-*/metrics-*: Type prefixsemanticcode.otel: Dataset identifierdefault: Namespace
All logs and metrics include repo.name and repo.branch attributes, enabling repository-specific monitoring:
-
Access Kibana Discover: Navigate to Kibana → Discover
-
Filter by Repository:
repo.name: "kibana" AND repo.branch: "main" -
Create Visualizations: Use Kibana's visualization tools to create charts for:
- Files processed over time:
parser.files.processedmetric - Queue depth:
queue.size.pendinggauge - Batch processing duration:
indexer.batch.durationhistogram - Indexing failures:
indexer.batch.failedcounter
- Files processed over time:
-
Build Dashboards: Combine visualizations into repository-specific dashboards:
- Overview dashboard: All repositories
- Per-repository dashboards: Filter by
repo.name - Per-branch dashboards: Filter by
repo.nameandrepo.branch
Set up alerts in Kibana for:
- High failure rate:
indexer.batch.failed/indexer.batch.processed> 0.1 - Queue backlog:
queue.size.pending> 10000 - Stale processing: No
parser.files.processedmetrics in last hour
| Metric | What to Watch | Alert Threshold |
|---|---|---|
parser.files.processed |
Files indexed per minute | < 10/min may indicate issues |
queue.size.pending |
Documents waiting to be indexed | > 10000 indicates backlog |
indexer.batch.failed |
Failed batch operations | > 5% failure rate |
indexer.batch.duration |
Indexing performance | P95 > 30s may indicate ES issues |