Training Data Sources
Alhena AI supports various types of data sources for bots knowledge-base.
Supported Data Crawling Sources in Alhena AI
Alhena AI supports a wide range of data crawling capabilities, enabling seamless extraction and indexing of data from diverse sources. This article outlines all currently supported crawling integrations.
1. Websites & Web Pages
General Websites – HTML pages, blogs, wikis, etc.
Sitemap XML – Crawl all URLs listed in an XML sitemap.
Confluence Pages – Crawl pages from Atlassian Confluence.
Notion Pages – Crawl public Notion pages.
Private Notion Pages – With appropriate authentication.
ServiceNow Documentation – Crawl content from ServiceNow knowledge base.
2. Google Drive
Google Docs
Google Sheets (spreadsheets)
Google Slides (presentations)
Google Drive Folders – Recursive crawling supported.
Google Drive Files – Generic support for any file inside Drive.
3. Document Files
PDF (
.pdf
)Word Documents (
.doc
,.docx
)Excel Spreadsheets (
.xls
,.xlsx
)PowerPoint Presentations (
.ppt
,.pptx
)CSV / TSV (
.csv
,.tsv
)Plain Text (
.txt
), Markdown (.md
), RST (.rst
)Rich Text Format (
.rtf
)OpenDocument Files (
.odt
,.ods
,.odp
)Apple iWork (
.pages
,.numbers
,.key
)Email Files (
.eml
,.msg
)EPUB (
.epub
), Org-mode (.org
)Config/Data Files (
.ini
,.yaml
,.toml
,.xml
,.json
)Images –
.jpg
,.jpeg
,.png
,.webp
, etc.
4. Video & Media
YouTube Videos – Crawled via GeminiVideoScraper or similar.
Other Video Files –
.mp4
,.avi
,.mkv
,.mov
, and more.
5. Social & Communication Platforms
Discord Messages
Slack Messages
Twitter Pages – Accounts, posts.
6. Helpdesk & Ticketing Systems
Zendesk Articles
Freshdesk Articles
Freshservice Articles
Gladly Articles
7. Ecommerce Platforms
Shopify Products
Woocommerce Products
Salesforce Commerce Cloud
Magento
Generic Product Pages – Custom HTML product extraction.
8. GitHub
Code Repositories – Source code, documentation, config files.
Issues
Discussions
9. Custom / Other Sources
S3 Uploaded Files
Custom Data Sources – Extensible scraper support.
PDF Crawling – Via URL or file upload.
Document Uploads – Supports any of the formats listed above.
Supported File Extensions (from INCLUDE_PATTERNS
)
INCLUDE_PATTERNS
)Source Code
.py
, .js
, .java
, .c
, .cpp
, .cs
, .rb
, .go
, .rs
, .php
, .swift
, .kt
, .ts
, .scala
, .pl
, .r
, .sh
Web
.html
Config & Data
.yaml
, .yml
, .xml
, .ini
, .toml
Documentation
.md
, .rst
, .txt
Presentations
.ppt
, .pptx
, .odp
, .key
Spreadsheets
.xls
, .xlsx
, .ods
, .numbers
Archives
.zip
, .rar
, .7z
, .tar
, .gz
, .bz2
, .xz
, .iso
Media
.png
, .jpg
, .jpeg
, .gif
, .bmp
, .tiff
, .svg
, .webp
, .ico
, .mp3
, .wav
, .ogg
, .flac
, .aac
, .m4a
, .wma
, .mp4
, .avi
, .mkv
, .flv
, .mov
, .wmv
, .m4v
, .webm
, .vob
, .ogv
Other
.pdf
, .doc
, .docx
, .pages
, .eml
, .msg
, .epub
, .org
, .tsv
, .rtf
, .dockerfile
, etc.
Summary Table
Websites
General, Sitemap, Confluence, Notion, ServiceNow
Google Drive
Docs, Sheets, Slides, Folders, Files
Documents
PDF, Word, Excel, PowerPoint, CSV, TSV, TXT, Markdown, RST, RTF, ODT, ODS, ODP, Pages, Numbers, Key, EML, MSG, EPUB, Org, Images
Video
YouTube, MP4, AVI, MKV, MOV, etc.
Social/Comm
Discord, Slack, Twitter
Helpdesk
Zendesk, Freshdesk, Freshservice, Gladly
Ecommerce
Shopify, Woocommerce, Salesforce, Magento, Generic Product Pages
GitHub
Code, Issues, Discussions
Custom/Other
S3, Custom sources, PDF crawling, Uploads
Last updated