Intranet Crawler

The Intranet Crawler is a self-hosted MCP server that indexes your organization’s intranet and makes its content searchable for AI assistants — without any intranet content leaving your own infrastructure.

The service operates as two separate processes: a background crawl that periodically fetches and indexes pages from your intranet, and a query interface (MCP server) that Intric AI calls when a user’s question requires searching the indexed content.

How content is indexed

The crawl runs entirely inside your own Kubernetes cluster on a schedule. No user data or conversation history is involved. The only data that leaves the pod during indexing is plain text chunks sent to the external embedding API to be converted into search vectors.

All data in transit is protected by TLS 1.2 or higher.

Step 1 — Scheduled crawl trigger

APScheduler triggers a crawl automatically — a full crawl every night at 02:00 UTC and an incremental update every 30 minutes.

No user data is involved. The scheduler runs inside the Kubernetes pod and communicates with the crawler via an internal asyncio event loop.

1 / 5

How queries are handled

When a user asks a question that requires searching the intranet, Intric AI calls the intranet crawler’s MCP server. Intric always acts as the intermediary — the language model never contacts the MCP server directly.

All data in transit is protected by TLS 1.2 or higher.

Step 1 — User interacts with Intric in the browser

The user writes a message to an assistant that has the intranet crawler MCP tool configured.

Data sent to Intric’s server:

The user’s message
Chat history
Any attached files

Step 2 — Intric calls the MCP server

Intric AI determines that a search in the intranet is needed and calls the MCP server’s search_intranet tool. The call is authenticated using a JWT bearer token (HS256).

Data sent from Intric’s server to the MCP server:

The search query (a natural language string generated by the language model)
JWT bearer token (in the Authorization header)

The user’s identity — name, email, IP address — is not included in the search call. The MCP server verifies the JWT using a shared secret stored in a Kubernetes Secret.

What happens inside the MCP server:

The query is converted to an embedding vector using the same embedding API used during indexing
ChromaDB performs a cosine similarity search and returns the most relevant text chunks
Any user-submitted corrections for this query are prioritized in the results

1 / 4

The intranet crawler is designed so that your organization’s intranet content stays within your own infrastructure. Below is a summary of what is sent to external services and what remains internal.

Crawl — Embedding API

The only external service that receives data during indexing is the embedding API. Only plain text chunks are sent — no URLs, page titles, or any metadata accompany the chunks.

Sent to the Embedding API	Not sent to the Embedding API
Plain text chunks extracted from intranet pages (no identity metadata)	Source URLs and page titles Intranet login credentials Any user or conversation data from Intric Personal data about users in Intric: Name Email IP address Organization affiliation

Query — MCP Server

The MCP server is hosted in your own Kubernetes cluster — it is not a third-party service. Only the search query and a JWT for authentication are sent from Intric to the MCP server.

Sent to the MCP server	Not sent to the MCP server
The search query (natural language string generated by the language model) JWT bearer token (for authentication — contains no user personal data)	The user’s original prompt in full Chat history Attached files Personal data about the user in Intric interacting with the assistant, provided it does not appear in the message to the assistant: Name Email IP address Organization affiliation

Hosting, authentication and credentials

Hosting

MCP servers hosted by Intric are hosted in Sweden using the subprocessor Glesys AB. Custom developed MCP servers deployed in a customer’s environment run in the customer’s environment.

When hosted in a customer specific instance, each MCP server runs in its own Kubernetes pod with dedicated infrastructure, databases, and a strict set of rules governing what it can and cannot access. MCP servers are logically isolated from one another — one server cannot reach another unless an explicit connection between them is established.

Authentication and credentials

All secrets used by the intranet crawler are stored in Kubernetes Secrets and injected as environment variables at pod startup — they are never exposed to users or the browser.

MCP server: JWT (HS256) using a shared secret (MCP_SERVER_JWT_SECRET). Intric AI and the MCP server share this secret to authenticate every search call.
Intranet login: If the intranet requires a login, form-based credentials are stored on a Persistent Volume encrypted with Fernet (AES-128-CBC). The encryption key is mounted as a Kubernetes Secret volume.
Embedding API: The API key (EMBEDDING_API_KEY) is stored as a Kubernetes Secret.
Admin interface: The web-based admin panel (/admin/) is protected with HTTP Basic Auth. An IP allowlist (ALLOWED_IPS) restricts which addresses can reach the MCP server endpoint.

Data retention and deletion

Conversation history where the intranet crawler MCP tool has been used follows the same deletion rules as other assistants.

Data Retention

The crawled content and vector index are stored on the Persistent Volume within your Kubernetes cluster. Pages that have not changed since the last crawl (detected via ETag/Last-Modified headers) are skipped. To clear the index, the ChromaDB collection can be reset via the admin interface or by removing the PVC.

Administrators can monitor how the service is used via the audit log where enabled.

Audit Log

Intranet Crawler

How content is indexed

Step 1 — Scheduled crawl trigger

Step 2 — Crawler fetches pages from the intranet

Step 3 — Text extraction and chunking (internal)

Step 4 — Text chunks sent to the Embedding API

Step 5 — Vectors and metadata stored in ChromaDB

How queries are handled

Step 1 — User interacts with Intric in the browser

Step 2 — Intric calls the MCP server

Step 3 — Results returned to Intric

Step 4 — User sees the response in the browser

Crawl — Embedding API

Query — MCP Server

Hosting, authentication and credentials

Hosting

Authentication and credentials

Data retention and deletion

Intranet Crawler

How content is indexed

Step 1 — Scheduled crawl trigger

Step 2 — Crawler fetches pages from the intranet

Step 3 — Text extraction and chunking (internal)

Step 4 — Text chunks sent to the Embedding API

Step 5 — Vectors and metadata stored in ChromaDB

How queries are handled

Step 1 — User interacts with Intric in the browser

Step 2 — Intric calls the MCP server

Step 3 — Results returned to Intric

Step 4 — User sees the response in the browser

Data sharing and privacy

Crawl — Embedding API

Query — MCP Server

Hosting, authentication and credentials

Hosting

Authentication and credentials

Data retention and deletion