Websites

Alhena AI supports crawling and indexing content from publicly accessible websites to answer queries. This includes:

  • Landing pages

  • Sitemaps

  • Product pages

  • Help articles

  • Notion docs

  • Support articles

  • Developer docs

  • Zendesk support articles

  • CSV file links hosted on public cloud

For each website link, there are two different modes of crawling:

Crawl multiple pages: In multi-page crawl, Alhena AI will find the child pages and continue crawling as long as the root path of the child pages is the same as the root path of the parent URL. We crawl up to 5,000 pages per URL. If you have specific needs or require crawling more than 5,000 pages, contact support. For sitemaps, choose the multi-page crawl as it will also crawl child pages.

Crawl single page: In single-page crawl, we crawl only one page of the given URL.

Alhena AI Website / URL Crawling options

Automatic Site Metadata Discovery

When you add a website URL for multi-page crawling, Alhena automatically discovers structured metadata files from your domain to improve coverage. No extra configuration is needed.

What gets discovered

File
Purpose

robots.txt

Reads Sitemap: directives to find your sitemaps

sitemap.xml

Discovers all pages on your site (used as fallback if not listed in robots.txt)

llms.txt

A curated file of your site's most important content, designed for AI consumption

llms-full.txt

Your entire site content in a single AI-ready document

llms-ctx.txt

An expanded version of llms.txt with linked content included inline

This means you no longer need to manually add your sitemap URL — Alhena finds and crawls it automatically.

llms.txt

llms.txtarrow-up-right is a growing web standard where site owners publish a curated markdown file at /llms.txt that describes their site's most important content for AI systems.

When Alhena finds an llms.txt file on your domain:

  • The file content itself (such as inline FAQs or product descriptions) is ingested as training data

  • All links in the file are extracted and added as pages to crawl

  • If llms-full.txt or llms-ctx.txt exist, they are ingested as single training documents

If you maintain the website being trained, publishing an llms.txt file is one of the best ways to ensure Alhena learns from your most important content.

Troubleshooting discovery

  • Sitemap not found? Verify the file is accessible at https://yoursite.com/sitemap.xml or referenced in your robots.txt. Files that return an HTML page instead of XML will be skipped.

  • llms.txt not picked up? The file must be at the domain root (https://yoursite.com/llms.txt) and must not return an HTML response.

  • Duplicate URLs? Alhena deduplicates automatically. If you already added a sitemap URL manually, it won't be crawled twice.

Last updated