Websites
Alhena AI supports crawling and indexing content from publicly accessible websites to answer queries. This includes:
Landing pages
Sitemaps
Product pages
Help articles
Notion docs
Support articles
Developer docs
Zendesk support articles
CSV file links hosted on public cloud
For each website link, there are two different modes of crawling:
Crawl multiple pages: In multi-page crawl, Alhena AI will find the child pages and continue crawling as long as the root path of the child pages is the same as the root path of the parent URL. We crawl up to 5,000 pages per URL. If you have specific needs or require crawling more than 5,000 pages, contact support. For sitemaps, choose the multi-page crawl as it will also crawl child pages.
Crawl single page: In single-page crawl, we crawl only one page of the given URL.

Automatic Site Metadata Discovery
When you add a website URL for multi-page crawling, Alhena automatically discovers structured metadata files from your domain to improve coverage. No extra configuration is needed.
What gets discovered
robots.txt
Reads Sitemap: directives to find your sitemaps
sitemap.xml
Discovers all pages on your site (used as fallback if not listed in robots.txt)
llms.txt
A curated file of your site's most important content, designed for AI consumption
llms-full.txt
Your entire site content in a single AI-ready document
llms-ctx.txt
An expanded version of llms.txt with linked content included inline
This means you no longer need to manually add your sitemap URL — Alhena finds and crawls it automatically.
llms.txt
llms.txt is a growing web standard where site owners publish a curated markdown file at /llms.txt that describes their site's most important content for AI systems.
When Alhena finds an llms.txt file on your domain:
The file content itself (such as inline FAQs or product descriptions) is ingested as training data
All links in the file are extracted and added as pages to crawl
If
llms-full.txtorllms-ctx.txtexist, they are ingested as single training documents
If you maintain the website being trained, publishing an llms.txt file is one of the best ways to ensure Alhena learns from your most important content.
Troubleshooting discovery
Sitemap not found? Verify the file is accessible at
https://yoursite.com/sitemap.xmlor referenced in yourrobots.txt. Files that return an HTML page instead of XML will be skipped.llms.txt not picked up? The file must be at the domain root (
https://yoursite.com/llms.txt) and must not return an HTML response.Duplicate URLs? Alhena deduplicates automatically. If you already added a sitemap URL manually, it won't be crawled twice.
Last updated