Skip to content

Intranet Crawler

The Intranet Crawler is a self-hosted MCP server that indexes your organization’s intranet and makes its content searchable for AI assistants — without any intranet content leaving your own infrastructure.

The service operates as two separate processes: a background crawl that periodically fetches and indexes pages from your intranet, and a query interface (MCP server) that Intric AI calls when a user’s question requires searching the indexed content.

The crawl runs entirely inside your own Kubernetes cluster on a schedule. No user data or conversation history is involved. The only data that leaves the pod during indexing is plain text chunks sent to the external embedding API to be converted into search vectors.

All data in transit is protected by TLS 1.2 or higher.

APScheduler triggers a crawl automatically — a full crawl every night at 02:00 UTC and an incremental update every 30 minutes.

No user data is involved. The scheduler runs inside the Kubernetes pod and communicates with the crawler via an internal asyncio event loop.

1 / 5

When a user asks a question that requires searching the intranet, Intric AI calls the intranet crawler’s MCP server. Intric always acts as the intermediary — the language model never contacts the MCP server directly.

All data in transit is protected by TLS 1.2 or higher.

Step 1 — User interacts with Intric in the browser

Section titled “Step 1 — User interacts with Intric in the browser”

The user writes a message to an assistant that has the intranet crawler MCP tool configured.

Data sent to Intric’s server:

  • The user’s message
  • Chat history
  • Any attached files
1 / 4

The intranet crawler is designed so that your organization’s intranet content stays within your own infrastructure. Below is a summary of what is sent to external services and what remains internal.

The only external service that receives data during indexing is the embedding API. Only plain text chunks are sent — no URLs, page titles, or any metadata accompany the chunks.

Sent to the Embedding APINot sent to the Embedding API
  • Plain text chunks extracted from intranet pages (no identity metadata)
  • Source URLs and page titles
  • Intranet login credentials
  • Any user or conversation data from Intric
  • Personal data about users in Intric:

    • Name
    • Email
    • IP address
    • Organization affiliation

The MCP server is hosted in your own Kubernetes cluster — it is not a third-party service. Only the search query and a JWT for authentication are sent from Intric to the MCP server.

Sent to the MCP serverNot sent to the MCP server
  • The search query (natural language string generated by the language model)
  • JWT bearer token (for authentication — contains no user personal data)
  • The user’s original prompt in full
  • Chat history
  • Attached files
  • Personal data about the user in Intric interacting with the assistant, provided it does not appear in the message to the assistant:

    • Name
    • Email
    • IP address
    • Organization affiliation

MCP servers hosted by Intric are hosted in Sweden using the subprocessor Glesys AB. Custom developed MCP servers deployed in a customer’s environment run in the customer’s environment.

When hosted in a customer specific instance, each MCP server runs in its own Kubernetes pod with dedicated infrastructure, databases, and a strict set of rules governing what it can and cannot access. MCP servers are logically isolated from one another — one server cannot reach another unless an explicit connection between them is established.

All secrets used by the intranet crawler are stored in Kubernetes Secrets and injected as environment variables at pod startup — they are never exposed to users or the browser.

  • MCP server: JWT (HS256) using a shared secret (MCP_SERVER_JWT_SECRET). Intric AI and the MCP server share this secret to authenticate every search call.
  • Intranet login: If the intranet requires a login, form-based credentials are stored on a Persistent Volume encrypted with Fernet (AES-128-CBC). The encryption key is mounted as a Kubernetes Secret volume.
  • Embedding API: The API key (EMBEDDING_API_KEY) is stored as a Kubernetes Secret.
  • Admin interface: The web-based admin panel (/admin/) is protected with HTTP Basic Auth. An IP allowlist (ALLOWED_IPS) restricts which addresses can reach the MCP server endpoint.

Conversation history where the intranet crawler MCP tool has been used follows the same deletion rules as other assistants.

The crawled content and vector index are stored on the Persistent Volume within your Kubernetes cluster. Pages that have not changed since the last crawl (detected via ETag/Last-Modified headers) are skipped. To clear the index, the ChromaDB collection can be reset via the admin interface or by removing the PVC.

Administrators can monitor how the service is used via the audit log where enabled.