Training Data Sources
Alhena AI supports various types of data sources for bots knowledge-base.
Supported Data Crawling Sources in Alhena AI
Alhena AI supports a wide range of data crawling capabilities, enabling seamless extraction and indexing of data from diverse sources. This article outlines all currently supported crawling integrations.
1. Websites & Web Pages
- General Websites – HTML pages, blogs, wikis, etc. 
- Sitemap XML – Crawl all URLs listed in an XML sitemap. 
- Confluence Pages – Crawl pages from Atlassian Confluence. 
- Notion Pages – Crawl public Notion pages. 
- Private Notion Pages – With appropriate authentication. 
- ServiceNow Documentation – Crawl content from ServiceNow knowledge base. 
2. Google Drive
- Google Docs 
- Google Sheets (spreadsheets) 
- Google Slides (presentations) 
- Google Drive Folders – Recursive crawling supported. 
- Google Drive Files – Generic support for any file inside Drive. 
3. Document Files
- PDF ( - .pdf)
- Word Documents ( - .doc,- .docx)
- Excel Spreadsheets ( - .xls,- .xlsx)
- PowerPoint Presentations ( - .ppt,- .pptx)
- CSV / TSV ( - .csv,- .tsv)
- Plain Text ( - .txt), Markdown (- .md), RST (- .rst)
- Rich Text Format ( - .rtf)
- OpenDocument Files ( - .odt,- .ods,- .odp)
- Apple iWork ( - .pages,- .numbers,- .key)
- Email Files ( - .eml,- .msg)
- EPUB ( - .epub), Org-mode (- .org)
- Config/Data Files ( - .ini,- .yaml,- .toml,- .xml,- .json)
- Images – - .jpg,- .jpeg,- .png,- .webp, etc.
4. Video & Media
- YouTube Videos – Crawled via GeminiVideoScraper or similar. 
- Other Video Files – - .mp4,- .avi,- .mkv,- .mov, and more.
5. Social & Communication Platforms
- Discord Messages 
- Slack Messages 
- Twitter Pages – Accounts, posts. 
6. Helpdesk & Ticketing Systems
- Zendesk Articles 
- Freshdesk Articles 
- Freshservice Articles 
- Gladly Articles 
7. Ecommerce Platforms
- Shopify Products 
- Woocommerce Products 
- Salesforce Commerce Cloud 
- Magento 
- Generic Product Pages – Custom HTML product extraction. 
8. GitHub
- Code Repositories – Source code, documentation, config files. 
- Issues 
- Discussions 
9. Custom / Other Sources
- S3 Uploaded Files 
- Custom Data Sources – Extensible scraper support. 
- PDF Crawling – Via URL or file upload. 
- Document Uploads – Supports any of the formats listed above. 
Supported File Extensions (from INCLUDE_PATTERNS)
INCLUDE_PATTERNS)Source Code
.py, .js, .java, .c, .cpp, .cs, .rb, .go, .rs, .php, .swift, .kt, .ts, .scala, .pl, .r, .sh
Web
.html
Config & Data
.yaml, .yml, .xml, .ini, .toml
Documentation
.md, .rst, .txt
Presentations
.ppt, .pptx, .odp, .key
Spreadsheets
.xls, .xlsx, .ods, .numbers
Archives
.zip, .rar, .7z, .tar, .gz, .bz2, .xz, .iso
Media
.png, .jpg, .jpeg, .gif, .bmp, .tiff, .svg, .webp, .ico, .mp3, .wav, .ogg, .flac, .aac, .m4a, .wma, .mp4, .avi, .mkv, .flv, .mov, .wmv, .m4v, .webm, .vob, .ogv
Other
.pdf, .doc, .docx, .pages, .eml, .msg, .epub, .org, .tsv, .rtf, .dockerfile, etc.
Summary Table
Websites
General, Sitemap, Confluence, Notion, ServiceNow
Google Drive
Docs, Sheets, Slides, Folders, Files
Documents
PDF, Word, Excel, PowerPoint, CSV, TSV, TXT, Markdown, RST, RTF, ODT, ODS, ODP, Pages, Numbers, Key, EML, MSG, EPUB, Org, Images
Video
YouTube, MP4, AVI, MKV, MOV, etc.
Social/Comm
Discord, Slack, Twitter
Helpdesk
Zendesk, Freshdesk, Freshservice, Gladly
Ecommerce
Shopify, Woocommerce, Salesforce, Magento, Generic Product Pages
GitHub
Code, Issues, Discussions
Custom/Other
S3, Custom sources, PDF crawling, Uploads
Last updated