The Web Crawler connector allows Simpplr Enterprise Search to index content from public websites you designate - documentation portals, help centers, product sites, and other informational pages - so employees can discover that content in search alongside other enterprise sources.
With this connector, you can:
Centralize discovery of website documentation and help content through Enterprise Search.
Reduce manual browsing across multiple sites by surfacing titles, excerpts, and URLs in results.
Support keyword and semantic relevance using structured fields such as title, headings, body text, and URL facets (depending on how search is configured for your organization).
Indexed content from Simpplr Enterprise Search is available in:
Main search listing
Smart answers (when your deployment wires crawler fields into those experiences)
Area | Summary |
Content types | Public HTML web pages reached from seed URLs; one searchable document per successfully fetched page. |
Metadata | Page title, URL, meta description (when present), headings (h1–h6), extracted links, parsed URL path segments for filtering. |
Permissions | No source ACL mapping for public sites—treat indexed pages as visible to everyone who can query the destination search experience unless your platform applies separate controls. |
Indexing | Headless Chromium crawl with JavaScript enabled; breadth-first expansion within the same site as each seed; bounded by crawl depth and maximum pages; robots.txt is respected for crawling policy. |
Filters | Scope crawling via seed URLs, depth, and max pages; advanced rules may be available where your connector framework exposes them. Same-site link following only—external domains are not crawled from discovered links. |
Search features | Keyword and semantic/hybrid behavior depend on index mapping (see Field mapping); suggested field wiring is documented in internal Confluence (see References). |
Each indexed item is a single web page, (one document per URL processed in the crawl).
Indexed:
HTML pages reachable over HTTPS from your seeds, following internal links on the same host/site only.
Visible text derived from rendered HTML (titles, headings, body copy, lists/tables as the DOM exposes them).
Link URLs extracted from the page for relationship context.
Not indexed (or not supported as separate objects):
Pages that require login, SSO, MFA, or form-based authentication.
Comments, forums, feedback threads, or other user-generated modules as first-class records (only text that appears as normal page content may be captured).
Arbitrary external websites discovered via outbound links (those links may appear in metadata but are not crawled).
Guaranteed completeness on sites built entirely as heavy client-side applications—extraction quality depends on what appears in the DOM after scripts run.
For each indexed page, Web Crawler captures:
Title — from the HTML <title> element when HTML is available.
URL / link — canonical page URL.
Created / modified — the connector records indexing-oriented timestamps (_timestamp, last_crawled_at); it does not ingest authoritative “last modified” from the origin server as a separate source-of-truth field.
Path / location — parsed URL components: scheme, host, port, path, and the first three path segments (url_path_dir1 … url_path_dir3) for filtering and context.
File type and size — not applicable as file attachments; body length varies by page.
Description & structure — optional <meta name="description">, ordered list of heading texts, plain-text body for snippets (body_content), and markup (body_markup: markdown when supplied by the crawler engine, otherwise HTML).
Permissions are not read from the website as enterprise ACLs. Public pages are indexed without document-level security enforced from the source.
Users and groups: There is no synchronization of site memberships into Simpplr.
Public or link-shared content: Anything reachable without authentication at crawl time may be indexed; do not point seeds at sensitive paths.
Access removed on the website: The next successful crawl may still reflect old content until the URL is recrawled and updated or removed from the index per your pipeline behavior; there is no per-user ACL shrink/expansion from the source.
Topic | Detail |
Supported sources | Public websites served over HTTPS. |
Not supported | HTTP-only seeds (non-HTTPS URLs are rejected); authenticated employee portals; crawling that violates applicable terms or robots policies. |
Runtime requirements | The connector uses Crawl4AI with a headless Chromium browser. Your deployment must allow this stack to run (memory/CPU consistent with browser automation). |
Before you begin, ensure the following:
Your organization is permitted to crawl and index the target sites (copyright, terms of use, and internal policy).
robots.txt allows crawling for the relevant paths. The connector evaluates policy using the elastic-webcrawler identity for seed URL checks; ensure your site’s robots rules align with your intent.
Not required. There is no OAuth client, API key, or login for public crawling.
Outbound HTTPS from the connector runtime to every host you crawl (and to fetch robots.txt).
Allow browser automation dependencies if your environment restricts sandbox-capable workloads.
Simpplr Enterprise Search connects to websites using standard HTTPS requests and a headless browser. No credentials are stored or sent for site login.
Aspect | Detail |
Auth type | None (anonymous public fetch). |
Scopes / API permissions | N/A. |
Topic | Customer-facing guidance |
Data storage and residency | Indexed documents follow your Enterprise Search / Elasticsearch deployment’s region and tenancy choices—refer to your platform documentation. |
Encryption in transit | HTTPS to origin sites; TLS within your indexing pipeline per platform defaults. |
Encryption at rest | Determined by your search/index infrastructure. |
Permission enforcement | Query-time ACL filtering based on website permissions is not applied. Assume crawled documents are visible to all authorized search users unless separate index or app controls exist. |
Confirm target experiences are public over HTTPS.
Verify robots.txt permits crawling intended paths for the crawler.
Choose seed URLs that represent the sections you want discovered (e.g., documentation root paths).
In Simpplr, go to: Enterprise Search → Connectors → Add connector.
Select Web Crawler.
Enter basic information:
Name: A label for this connector instance.
Provide crawling parameters:
Field (Internal Name) | Required | Description | Default (if omitted / invalid) |
Seed URLs (start_urls) | Yes | One or more entry URLs, comma-separated. Crawling starts here and expands within the same site. Example: https://docs.example.com,https://help.example.com | — |
Maximum crawl depth (crawl_depth) | No | Maximum link depth from each seed (0 = seeds only). | 2 |
Maximum pages to crawl (max_pages) | No | Upper bound on pages processed in a sync run. | 500 |
5. Save the configuration.
Configure Audience based filtering.
Include audiences
Exclude audiences
Initial sync completes successfully.
Document counts grow as expected within depth and page caps.
What is indexed: All pages reachable under the same-site expansion rules from each allowed seed, up to depth and max_pages, honoring robots.txt.
Duration: Depends on site size, latency, and limits—large max_pages or deep trees increase runtime and load on both connector hosts and target sites.
Conceptual mapping from crawled web content to common Enterprise Search concepts:
Source (web page) | Index / search concept |
<title> / crawler title | title |
Canonical URL | url |
(no explicit author) | author (typically unset or derived by platform defaults) |
Crawl / index timestamps | last_modified / created_at analogs (use timestamp, lastcrawled_at per schema) |
Body length | size (may be absent—implementation-specific) |
Presentation | Field | Notes |
Presentation | Field | Notes |
Document ID | \_id | Stable hash derived from URL. |
Indexed at | \_timestamp | ISO-8601 UTC. |
Page URL | url | Canonical URL. |
Last crawled | last\_crawled\_at | Crawl pass timestamp. |
Site origin | domains | e.g. ["[https://www.example.com](https://www.example.com)"]. |
URL structure | url\_scheme, url\_host, url\_port, url\_path, url\_path\_dir1 … url\_path\_dir3 | Faceting / navigation. |
Plain text body | body\_content | Keyword snippets / chunk source. |
Rich markup | body\_markup | Markdown preferred; else HTML. |
Title | title | From <title> when HTML present. |
Meta description | meta\_description | From meta tag when present. |
Headings | headings | Ordered h1–h6 text. |
Links | links | Extracted URLs (list-shaped in ingestion). |
Internal vs external | additional\_urls | { internal: [...], external: [...] } style grouping from crawl helpers. |
Markdown | markdown | Raw markdown when available. |
Object type | object\_type | page. |
Result layout: Typically icon/source label, title link, snippet from body_content, URL, optional heading/meta context—exact UI depends on Enterprise Search configuration.
Smart Answers / Q&A: Supported when administrators map crawler body fields into answering pipelines; long pages benefit from chunking (recommended in Data Model guidance).
Semantic / hybrid ranking: Supported when embeddings/chunking are configured—see internal references.
Limitation | Detail |
Unsupported content | Password-protected pages, paywalls, SSO gates; some dynamic SPAs may yield incomplete text. |
Rate limits | Respect site performance—large crawls can trigger server throttling; reduce concurrency/scope if targets rate-limit browsers. |
Preview | Preview behavior depends on Enterprise Search UI; mostly URL + text excerpt for HTML pages. |
robots.txt enforcement | Seeds disallowed for crawler are skipped; if all seeds are disallowed, configuration fails validation for crawl start. |
SSL/TLS | Browser stack may tolerate certain certificate validation edge cases (ignore_https_errors=True)—validate against your security policy. |
Navigate to: Enterprise Search → [Connector name] → Health (labels may vary slightly by release).
Typical metrics:
Last sync status (Success / Warning / Failed)
Last sync time / next scheduled sync
Total items indexed
Error or warning counts
Issue: Sync fails immediately on start
Possible causes:
Non-HTTPS seeds or malformed URL list.
Every seed URL is disallowed by robots.txt.
Resolution:
Use only https:// URLs and comma-separated lists.
Update robots.txt or choose allowed seeds.
Issue: Authentication failed
Possible causes:
N/A for public crawler—usually a misclassified error.
Resolution:
Confirm the site does not redirect to a login page for crawled URLs.
Issue: Rate limit or quota exceeded
Possible causes:
Crawler concurrency (DEFAULT_CONCURRENT_REQUESTS = 10 in connector defaults) plus depth/max pages stressing weak origins.
Resolution:
Reduce max_pages / depth, widen schedules, or coordinate with site owners.
Issue: Few or missing documents
Possible causes:
Depth too shallow; max_pages exhausted early.
Content behind heavy JavaScript not rendered into text.
Resolution:
Adjust seeds to deeper entry points; test simpler documentation URLs.
Contact Support when:
Failures persist after validating HTTPS seeds and robots policy.
Sync stalls unexpectedly for extended periods.
Partial indexing cannot be explained by depth/max page bounds.
Include:
Connector name / instance ID (if shown)
Organization URL
Approximate time of failure
Error messages / screenshots
Seeds (start URLs) and recent configuration changes
Q1. Can I connect multiple websites or domains?
A. Yes - provide multiple comma-separated connectors seed URLs. Multiple connectors with difference seed Urls is also supported.
Q2. How often does Web Crawler sync data?
A. It’s configured as a daily full sync for freshness. Actual runtime depends on site size and the configured crawl limits (crawl_depth, max_pages).
Q3. Are comments, revisions, or version history indexed?
A. No dedicated handling—only visible page content in the HTML snapshot is captured.
Q4. Can I exclude certain folders or paths from indexing?
A. At this stage, fine‑grained filtering is not supported (no advanced include/exclude rules). The only supported way to control scope is by choosing the right seed URLs, then limiting the crawl with crawl_depth and max_pages.