Intranet Crawler
Dieser Inhalt ist noch nicht in deiner Sprache verfügbar.
The Intranet Crawler is a self-hosted MCP server that indexes your organization’s intranet and makes its content searchable for AI assistants — without any intranet content leaving your own infrastructure.
The service operates as two separate processes: a background crawl that periodically fetches and indexes pages from your intranet, and a query interface (MCP server) that Intric AI calls when a user’s question requires searching the indexed content.
How content is indexed
Section titled “How content is indexed”The crawl runs entirely inside your own Kubernetes cluster on a schedule. No user data or conversation history is involved. The only data that leaves the pod during indexing is plain text chunks sent to the external embedding API to be converted into search vectors.
All data in transit is protected by TLS 1.2 or higher.
Step 1 — Scheduled crawl trigger
Section titled “Step 1 — Scheduled crawl trigger”APScheduler triggers a crawl automatically — a full crawl every night at 02:00 UTC and an incremental update every 30 minutes.
No user data is involved. The scheduler runs inside the Kubernetes pod and communicates with the crawler via an internal asyncio event loop.
Step 2 — Crawler fetches pages from the intranet
Section titled “Step 2 — Crawler fetches pages from the intranet”The Playwright-based crawler sends HTTP requests to your intranet. JavaScript-rendered pages are handled using headless Chromium; simpler pages use a lightweight async HTTP client.
Data sent to the intranet:
- HTTP GET requests (no user data from Intric)
Authentication: If the intranet requires a login, form-based credentials are used. These credentials are stored on a Persistent Volume within your cluster, encrypted with Fernet (AES-128-CBC). They are not shared with any external service.
Data returned from the intranet:
- HTML pages and linked documents (PDF, DOCX)
- Change-detection headers (ETag, Last-Modified) used to skip unchanged pages
Step 3 — Text extraction and chunking (internal)
Section titled “Step 3 — Text extraction and chunking (internal)”The crawler extracts readable text from each fetched page:
- Navigation, headers, footers, and sidebars are removed using CSS selectors
- PDFs are parsed with pypdf/pdfplumber; DOCX files with python-docx
- Pages with fewer than 200 characters are not indexed
- Text is split into overlapping chunks of approximately 700 characters (150-character overlap), with a maximum of 30 chunks per page
This step runs entirely in-process — no data leaves the crawler.
Step 4 — Text chunks sent to the Embedding API
Section titled “Step 4 — Text chunks sent to the Embedding API”The crawler sends batches of up to 5 text chunks per request to the external embedding API, which converts each chunk into a numerical vector (float32) that enables semantic search.
Data sent to the Embedding API:
- Plain text chunks (no user identity, no conversation history, no URLs or page titles)
Data returned:
- Float32 embedding vectors, one per chunk
The Embedding API authenticates the request using a server-side API key stored in a Kubernetes Secret — it is never exposed to users or the browser.
Step 5 — Vectors and metadata stored in ChromaDB
Section titled “Step 5 — Vectors and metadata stored in ChromaDB”The embedding vectors and associated metadata are written to ChromaDB, which runs inside the Kubernetes pod.
Data stored (on your own infrastructure):
- Embedding vector
- Source URL
- Page title
- Text chunk
- Timestamp
- Source tag
Data is written to a Persistent Volume Claim (PVC) on your cluster. It does not leave your infrastructure. ChromaDB uses SQLite for metadata and an HNSW approximate nearest-neighbor index for vector search.
How queries are handled
Section titled “How queries are handled”When a user asks a question that requires searching the intranet, Intric AI calls the intranet crawler’s MCP server. Intric always acts as the intermediary — the language model never contacts the MCP server directly.
All data in transit is protected by TLS 1.2 or higher.
Step 1 — User interacts with Intric in the browser
Section titled “Step 1 — User interacts with Intric in the browser”The user writes a message to an assistant that has the intranet crawler MCP tool configured.
Data sent to Intric’s server:
- The user’s message
- Chat history
- Any attached files
Step 2 — Intric calls the MCP server
Section titled “Step 2 — Intric calls the MCP server”Intric AI determines that a search in the intranet is needed and calls the MCP server’s search_intranet tool. The call is authenticated using a JWT bearer token (HS256).
Data sent from Intric’s server to the MCP server:
- The search query (a natural language string generated by the language model)
- JWT bearer token (in the Authorization header)
The user’s identity — name, email, IP address — is not included in the search call. The MCP server verifies the JWT using a shared secret stored in a Kubernetes Secret.
What happens inside the MCP server:
- The query is converted to an embedding vector using the same embedding API used during indexing
- ChromaDB performs a cosine similarity search and returns the most relevant text chunks
- Any user-submitted corrections for this query are prioritized in the results
Step 3 — Results returned to Intric
Section titled “Step 3 — Results returned to Intric”The MCP server returns the search results to Intric’s server.
Data sent from the MCP server to Intric:
- Formatted text chunks with source URLs and page titles
Intric forwards these results to the assistant’s selected language model, which uses them alongside the user’s original question to formulate a response.
Step 4 — User sees the response in the browser
Section titled “Step 4 — User sees the response in the browser”The response is displayed to the user in Intric in the browser.
Data stored on Intric’s servers:
- The generated response and the history from the user’s interaction with the assistant (according to the assistant’s deletion settings)
- Metadata about tool calls and results in the conversation history
Conversation history and related metadata stored in Intric are protected at rest by infrastructure-level database encryption.
Data sharing and privacy
Section titled “Data sharing and privacy”The intranet crawler is designed so that your organization’s intranet content stays within your own infrastructure. Below is a summary of what is sent to external services and what remains internal.
Crawl — Embedding API
Section titled “Crawl — Embedding API”The only external service that receives data during indexing is the embedding API. Only plain text chunks are sent — no URLs, page titles, or any metadata accompany the chunks.
| Sent to the Embedding API | Not sent to the Embedding API |
|---|---|
|
|
Query — MCP Server
Section titled “Query — MCP Server”The MCP server is hosted in your own Kubernetes cluster — it is not a third-party service. Only the search query and a JWT for authentication are sent from Intric to the MCP server.
| Sent to the MCP server | Not sent to the MCP server |
|---|---|
|
|
Hosting, authentication and credentials
Section titled “Hosting, authentication and credentials”Hosting
Section titled “Hosting”MCP servers hosted by Intric are hosted in Sweden using the subprocessor Glesys AB. Custom developed MCP servers deployed in a customer’s environment run in the customer’s environment.
When hosted in a customer specific instance, each MCP server runs in its own Kubernetes pod with dedicated infrastructure, databases, and a strict set of rules governing what it can and cannot access. MCP servers are logically isolated from one another — one server cannot reach another unless an explicit connection between them is established.
Authentication and credentials
Section titled “Authentication and credentials”All secrets used by the intranet crawler are stored in Kubernetes Secrets and injected as environment variables at pod startup — they are never exposed to users or the browser.
- MCP server: JWT (HS256) using a shared secret (
MCP_SERVER_JWT_SECRET). Intric AI and the MCP server share this secret to authenticate every search call. - Intranet login: If the intranet requires a login, form-based credentials are stored on a Persistent Volume encrypted with Fernet (AES-128-CBC). The encryption key is mounted as a Kubernetes Secret volume.
- Embedding API: The API key (
EMBEDDING_API_KEY) is stored as a Kubernetes Secret. - Admin interface: The web-based admin panel (
/admin/) is protected with HTTP Basic Auth. An IP allowlist (ALLOWED_IPS) restricts which addresses can reach the MCP server endpoint.
Data retention and deletion
Section titled “Data retention and deletion”Conversation history where the intranet crawler MCP tool has been used follows the same deletion rules as other assistants.
The crawled content and vector index are stored on the Persistent Volume within your Kubernetes cluster. Pages that have not changed since the last crawl (detected via ETag/Last-Modified headers) are skipped. To clear the index, the ChromaDB collection can be reset via the admin interface or by removing the PVC.
Administrators can monitor how the service is used via the audit log where enabled.