Web Crawler connector for Simpplr Enterprise Search

Updated last month

Introduction

The Web Crawler connector allows Simpplr Enterprise Search to index content from public websites you designate - documentation portals, help centers, product sites, and other informational pages - so employees can discover that content in search alongside other enterprise sources.

With this connector, you can:

Centralize discovery of website documentation and help content through Enterprise Search.
Reduce manual browsing across multiple sites by surfacing titles, excerpts, and URLs in results.
Support keyword and semantic relevance using structured fields such as title, headings, body text, and URL facets (depending on how search is configured for your organization).

Indexed content from Simpplr Enterprise Search is available in:

Main search listing
Smart answers (when your deployment wires crawler fields into those experiences)

Capabilities at a glance

Area	Summary
Content types	Public HTML web pages reached from seed URLs; one searchable document per successfully fetched page.
Metadata	Page title, URL, meta description (when present), headings (h1–h6), extracted links, parsed URL path segments for filtering.
Permissions	No source ACL mapping for public sites—treat indexed pages as visible to everyone who can query the destination search experience unless your platform applies separate controls.
Indexing	Headless Chromium crawl with JavaScript enabled; breadth-first expansion within the same site as each seed; bounded by crawl depth and maximum pages; robots.txt is respected for crawling policy.
Filters	Scope crawling via seed URLs, depth, and max pages; advanced rules may be available where your connector framework exposes them. Same-site link following only—external domains are not crawled from discovered links.
Search features	Keyword and semantic/hybrid behavior depend on index mapping (see Field mapping); suggested field wiring is documented in internal Confluence (see References).

Objects and content supported

Objects

Each indexed item is a single web page, (one document per URL processed in the crawl).

Indexed:

HTML pages reachable over HTTPS from your seeds, following internal links on the same host/site only.
Visible text derived from rendered HTML (titles, headings, body copy, lists/tables as the DOM exposes them).
Link URLs extracted from the page for relationship context.

Not indexed (or not supported as separate objects):

Pages that require login, SSO, MFA, or form-based authentication.
Comments, forums, feedback threads, or other user-generated modules as first-class records (only text that appears as normal page content may be captured).
Arbitrary external websites discovered via outbound links (those links may appear in metadata but are not crawled).
Guaranteed completeness on sites built entirely as heavy client-side applications—extraction quality depends on what appears in the DOM after scripts run.

Metadata

For each indexed page, Web Crawler captures:

Title — from the HTML <title> element when HTML is available.
URL / link — canonical page URL.
Created / modified — the connector records indexing-oriented timestamps (_timestamp, last_crawled_at); it does not ingest authoritative “last modified” from the origin server as a separate source-of-truth field.
Path / location — parsed URL components: scheme, host, port, path, and the first three path segments (url_path_dir1 … url_path_dir3) for filtering and context.
File type and size — not applicable as file attachments; body length varies by page.
Description & structure — optional <meta name="description">, ordered list of heading texts, plain-text body for snippets (body_content), and markup (body_markup: markdown when supplied by the crawler engine, otherwise HTML).

Permissions model

Permissions are not read from the website as enterprise ACLs. Public pages are indexed without document-level security enforced from the source.

Users and groups: There is no synchronization of site memberships into Simpplr.
Public or link-shared content: Anything reachable without authentication at crawl time may be indexed; do not point seeds at sensitive paths.
Access removed on the website: The next successful crawl may still reflect old content until the URL is recrawled and updated or removed from the index per your pipeline behavior; there is no per-user ACL shrink/expansion from the source.

Versions and editions supported

Topic	Detail
Supported sources	Public websites served over HTTPS.
Not supported	HTTP-only seeds (non-HTTPS URLs are rejected); authenticated employee portals; crawling that violates applicable terms or robots policies.
Runtime requirements	The connector uses Crawl4AI with a headless Chromium browser. Your deployment must allow this stack to run (memory/CPU consistent with browser automation).

Prerequisites

Before you begin, ensure the following:

Source system permissions

Your organization is permitted to crawl and index the target sites (copyright, terms of use, and internal policy).
robots.txt allows crawling for the relevant paths. The connector evaluates policy using the elastic-webcrawler identity for seed URL checks; ensure your site’s robots rules align with your intent.

Application / service account

Not required. There is no OAuth client, API key, or login for public crawling.

Network and firewall

Outbound HTTPS from the connector runtime to every host you crawl (and to fetch robots.txt).
Allow browser automation dependencies if your environment restricts sandbox-capable workloads.

Authentication and security

Authentication mechanism

Simpplr Enterprise Search connects to websites using standard HTTPS requests and a headless browser. No credentials are stored or sent for site login.

Aspect	Detail
Auth type	None (anonymous public fetch).
Scopes / API permissions	N/A.

Data security

Topic	Customer-facing guidance
Data storage and residency	Indexed documents follow your Enterprise Search / Elasticsearch deployment’s region and tenancy choices—refer to your platform documentation.
Encryption in transit	HTTPS to origin sites; TLS within your indexing pipeline per platform defaults.
Encryption at rest	Determined by your search/index infrastructure.
Permission enforcement	Query-time ACL filtering based on website permissions is not applied. Assume crawled documents are visible to all authorized search users unless separate index or app controls exist.

Setup and configuration

Step 1 — Prepare your websites

Confirm target experiences are public over HTTPS.
Verify robots.txt permits crawling intended paths for the crawler.
Choose seed URLs that represent the sections you want discovered (e.g., documentation root paths).

Step 2 — Create the connector in Simpplr Enterprise Search

In Simpplr, go to: Enterprise Search → Connectors → Add connector.
Select Web Crawler.
Enter basic information:
- Name: A label for this connector instance.
Provide crawling parameters:

Field (Internal Name)	Required	Description	Default (if omitted / invalid)
Seed URLs (start_urls)	Yes	One or more entry URLs, comma-separated. Crawling starts here and expands within the same site. Example: https://docs.example.com,https://help.example.com	—
Maximum crawl depth (crawl_depth)	No	Maximum link depth from each seed (0 = seeds only).	2
Maximum pages to crawl (max_pages)	No	Upper bound on pages processed in a sync run.	500

5. Save the configuration.

Step 3 — Filters (optional)

Configure Audience based filtering.
- Include audiences
- Exclude audiences

Step 4 — Monitor

Use your connector dashboard to confirm:

Initial sync completes successfully.
Document counts grow as expected within depth and page caps.

Crawling and sync behavior

Initial full crawl

What is indexed: All pages reachable under the same-site expansion rules from each allowed seed, up to depth and max_pages, honoring robots.txt.
Duration: Depends on site size, latency, and limits—large max_pages or deep trees increase runtime and load on both connector hosts and target sites.

Field mapping and search experience

Default field mapping

Conceptual mapping from crawled web content to common Enterprise Search concepts:

Source (web page)	Index / search concept
<title> / crawler title	title
Canonical URL	url
(no explicit author)	author (typically unset or derived by platform defaults)
Crawl / index timestamps	last_modified / created_at analogs (use timestamp, lastcrawled_at per schema)
Body length	size (may be absent—implementation-specific)

Document schema (connector fields)

Presentation	Field	Notes
Presentation	Field	Notes
Document ID	\_id	Stable hash derived from URL.
Indexed at	\_timestamp	ISO-8601 UTC.
Page URL	url	Canonical URL.
Last crawled	last\_crawled\_at	Crawl pass timestamp.
Site origin	domains	e.g. ["[https://www.example.com](https://www.example.com)"].
URL structure	url\_scheme, url\_host, url\_port, url\_path, url\_path\_dir1 … url\_path\_dir3	Faceting / navigation.
Plain text body	body\_content	Keyword snippets / chunk source.
Rich markup	body\_markup	Markdown preferred; else HTML.
Title	title	From <title> when HTML present.
Meta description	meta\_description	From meta tag when present.
Headings	headings	Ordered h1–h6 text.
Links	links	Extracted URLs (list-shaped in ingestion).
Internal vs external	additional\_urls	{ internal: [...], external: [...] } style grouping from crawl helpers.
Markdown	markdown	Raw markdown when available.
Object type	object\_type	page.

Search experience

Result layout: Typically icon/source label, title link, snippet from body_content, URL, optional heading/meta context—exact UI depends on Enterprise Search configuration.
Smart Answers / Q&A: Supported when administrators map crawler body fields into answering pipelines; long pages benefit from chunking (recommended in Data Model guidance).
Semantic / hybrid ranking: Supported when embeddings/chunking are configured—see internal references.

Known limitations

Limitation	Detail
Unsupported content	Password-protected pages, paywalls, SSO gates; some dynamic SPAs may yield incomplete text.
Rate limits	Respect site performance—large crawls can trigger server throttling; reduce concurrency/scope if targets rate-limit browsers.
Preview	Preview behavior depends on Enterprise Search UI; mostly URL + text excerpt for HTML pages.
robots.txt enforcement	Seeds disallowed for crawler are skipped; if all seeds are disallowed, configuration fails validation for crawl start.
SSL/TLS	Browser stack may tolerate certain certificate validation edge cases (ignore_https_errors=True)—validate against your security policy.

Monitoring and troubleshooting

Connector health and monitoring

Navigate to: Enterprise Search → [Connector name] → Health (labels may vary slightly by release).

Typical metrics:

Last sync status (Success / Warning / Failed)
Last sync time / next scheduled sync
Total items indexed
Error or warning counts

Common issues and resolutions

Issue: Sync fails immediately on start

Possible causes:

Non-HTTPS seeds or malformed URL list.
Every seed URL is disallowed by robots.txt.

Resolution:

Use only https:// URLs and comma-separated lists.
Update robots.txt or choose allowed seeds.

Issue: Authentication failed

Possible causes:

N/A for public crawler—usually a misclassified error.

Resolution:

Confirm the site does not redirect to a login page for crawled URLs.

Issue: Rate limit or quota exceeded

Possible causes:

Crawler concurrency (DEFAULT_CONCURRENT_REQUESTS = 10 in connector defaults) plus depth/max pages stressing weak origins.

Resolution:

Reduce max_pages / depth, widen schedules, or coordinate with site owners.

Issue: Few or missing documents

Possible causes:

Depth too shallow; max_pages exhausted early.
Content behind heavy JavaScript not rendered into text.

Resolution:

Adjust seeds to deeper entry points; test simpler documentation URLs.

When to contact support

Contact Support when:

Failures persist after validating HTTPS seeds and robots policy.
Sync stalls unexpectedly for extended periods.
Partial indexing cannot be explained by depth/max page bounds.

Include:

Connector name / instance ID (if shown)
Organization URL
Approximate time of failure
Error messages / screenshots
Seeds (start URLs) and recent configuration changes

Frequently asked questions (FAQ)

Q1. Can I connect multiple websites or domains?
A. Yes - provide multiple comma-separated connectors seed URLs. Multiple connectors with difference seed Urls is also supported.

Q2. How often does Web Crawler sync data?
A. It’s configured as a daily full sync for freshness. Actual runtime depends on site size and the configured crawl limits (crawl_depth, max_pages).

Q3. Are comments, revisions, or version history indexed?
A. No dedicated handling—only visible page content in the HTML snapshot is captured.

Q4. Can I exclude certain folders or paths from indexing?
A. At this stage, fine‑grained filtering is not supported (no advanced include/exclude rules). The only supported way to control scope is by choosing the right seed URLs, then limiting the crawl with crawl_depth and max_pages.

Was this article helpful?

Subscribe to receive updates on this article