Training Data Sources

Alhena AI supports various types of data sources for bots knowledge-base.

Supported Data Crawling Sources in Alhena AI

Alhena AI supports a wide range of data crawling capabilities, enabling seamless extraction and indexing of data from diverse sources. This article outlines all currently supported crawling integrations.


1. Websites & Web Pages

  • General Websites – HTML pages, blogs, wikis, etc.

  • Sitemap XML – Crawl all URLs listed in an XML sitemap.

  • Confluence Pages – Crawl pages from Atlassian Confluence.

  • Notion Pages – Crawl public Notion pages.

  • Private Notion Pages – With appropriate authentication.

  • ServiceNow Documentation – Crawl content from ServiceNow knowledge base.


2. Google Drive

  • Google Docs

  • Google Sheets (spreadsheets)

  • Google Slides (presentations)

  • Google Drive Folders – Recursive crawling supported.

  • Google Drive Files – Generic support for any file inside Drive.


3. Document Files

  • PDF (.pdf)

  • Word Documents (.doc, .docx)

  • Excel Spreadsheets (.xls, .xlsx)

  • PowerPoint Presentations (.ppt, .pptx)

  • CSV / TSV (.csv, .tsv)

  • Plain Text (.txt), Markdown (.md), RST (.rst)

  • Rich Text Format (.rtf)

  • OpenDocument Files (.odt, .ods, .odp)

  • Apple iWork (.pages, .numbers, .key)

  • Email Files (.eml, .msg)

  • EPUB (.epub), Org-mode (.org)

  • Config/Data Files (.ini, .yaml, .toml, .xml, .json)

  • Images.jpg, .jpeg, .png, .webp, etc.


4. Video & Media

  • YouTube Videos – Crawled via GeminiVideoScraper or similar.

  • Other Video Files.mp4, .avi, .mkv, .mov, and more.


5. Social & Communication Platforms

  • Discord Messages

  • Slack Messages

  • Twitter Pages – Accounts, posts.


6. Helpdesk & Ticketing Systems

  • Zendesk Articles

  • Freshdesk Articles

  • Freshservice Articles

  • Gladly Articles


7. Ecommerce Platforms

  • Shopify Products

  • Woocommerce Products

  • Salesforce Commerce Cloud

  • Magento

  • Generic Product Pages – Custom HTML product extraction.


8. GitHub

  • Code Repositories – Source code, documentation, config files.

  • Issues

  • Discussions


9. Custom / Other Sources

  • S3 Uploaded Files

  • Custom Data Sources – Extensible scraper support.

  • PDF Crawling – Via URL or file upload.

  • Document Uploads – Supports any of the formats listed above.


Supported File Extensions (from INCLUDE_PATTERNS)

Source Code

.py, .js, .java, .c, .cpp, .cs, .rb, .go, .rs, .php, .swift, .kt, .ts, .scala, .pl, .r, .sh

Web

.html

Config & Data

.yaml, .yml, .xml, .ini, .toml

Documentation

.md, .rst, .txt

Presentations

.ppt, .pptx, .odp, .key

Spreadsheets

.xls, .xlsx, .ods, .numbers

Archives

.zip, .rar, .7z, .tar, .gz, .bz2, .xz, .iso

Media

.png, .jpg, .jpeg, .gif, .bmp, .tiff, .svg, .webp, .ico, .mp3, .wav, .ogg, .flac, .aac, .m4a, .wma, .mp4, .avi, .mkv, .flv, .mov, .wmv, .m4v, .webm, .vob, .ogv

Other

.pdf, .doc, .docx, .pages, .eml, .msg, .epub, .org, .tsv, .rtf, .dockerfile, etc.


Summary Table

Category
Examples / Platforms Supported

Websites

General, Sitemap, Confluence, Notion, ServiceNow

Google Drive

Docs, Sheets, Slides, Folders, Files

Documents

PDF, Word, Excel, PowerPoint, CSV, TSV, TXT, Markdown, RST, RTF, ODT, ODS, ODP, Pages, Numbers, Key, EML, MSG, EPUB, Org, Images

Video

YouTube, MP4, AVI, MKV, MOV, etc.

Social/Comm

Discord, Slack, Twitter

Helpdesk

Zendesk, Freshdesk, Freshservice, Gladly

Ecommerce

Shopify, Woocommerce, Salesforce, Magento, Generic Product Pages

GitHub

Code, Issues, Discussions

Custom/Other

S3, Custom sources, PDF crawling, Uploads


Last updated