githubEdit

Data Sources

Alhena AI supports various types of data sources for bots knowledge-base.

Supported Data Crawling Sources in Alhena AI

Alhena AI supports a wide range of data crawling capabilities, enabling seamless extraction and indexing of data from diverse sources. This article outlines all currently supported crawling integrations.


  • General Websites – HTML pages, blogs, wikis, etc.

  • Sitemap XML – Crawl all URLs listed in an XML sitemap.

  • Confluence Pages – Crawl pages from Atlassian Confluence.

  • ServiceNow Documentation – Crawl content from ServiceNow knowledge base.


  • Google Docs

  • Google Sheets (spreadsheets)

  • Google Slides (presentations)

  • Google Drive Folders – Recursive crawling supported.

  • Google Drive Files – Generic support for any file inside Drive.


3. Document Files

  • PDF (.pdf)

  • Word Documents (.doc, .docx)

  • Excel Spreadsheets (.xls, .xlsx)

  • PowerPoint Presentations (.ppt, .pptx)

  • CSV / TSV (.csv, .tsv)

  • Plain Text (.txt), Markdown (.md), RST (.rst)

  • Rich Text Format (.rtf)

  • OpenDocument Files (.odt, .ods, .odp)

  • Apple iWork (.pages, .numbers, .key)

  • Email Files (.eml, .msg)

  • EPUB (.epub), Org-mode (.org)

  • Config/Data Files (.ini, .yaml, .toml, .xml, .json)

  • Images.jpg, .jpeg, .png, .webp, etc.


4. Video & Media

  • YouTube Videos – Transcripts and content from videos.

  • Other Video Files.mp4, .avi, .mkv, .mov, and more.


5. Connected Workspaces & Communication Platforms


6. Helpdesk & Ticketing Systems


7. Ecommerce Platforms


  • Code Repositories – Source code, documentation, config files.

  • Issues

  • Discussions


9. Custom / Other Sources


Supported File Extensions (from INCLUDE_PATTERNS)

Source Code

.py, .js, .java, .c, .cpp, .cs, .rb, .go, .rs, .php, .swift, .kt, .ts, .scala, .pl, .r, .sh

Web

.html

Config & Data

.yaml, .yml, .xml, .ini, .toml

Documentation

.md, .rst, .txt

Presentations

.ppt, .pptx, .odp, .key

Spreadsheets

.xls, .xlsx, .ods, .numbers

Archives

.zip, .rar, .7z, .tar, .gz, .bz2, .xz, .iso

Media

.png, .jpg, .jpeg, .gif, .bmp, .tiff, .svg, .webp, .ico, .mp3, .wav, .ogg, .flac, .aac, .m4a, .wma, .mp4, .avi, .mkv, .flv, .mov, .wmv, .m4v, .webm, .vob, .ogv

Other

.pdf, .doc, .docx, .pages, .eml, .msg, .epub, .org, .tsv, .rtf, .dockerfile, etc.


Summary Table

Category
Examples / Platforms Supported

Websites

General, Sitemap, Confluence, ServiceNow

Google Drive

Docs, Sheets, Slides, Folders, Files

Documents

PDF, Word, Excel, PowerPoint, CSV, TSV, TXT, Markdown, RST, RTF, ODT, ODS, ODP, Pages, Numbers, Key, EML, MSG, EPUB, Org, Images

Video

YouTube, MP4, AVI, MKV, MOV, etc.

Workspaces

Notion, Discord, Slack, Twitter

Helpdesk

Zendesk, Freshdesk, Freshchat, Salesforce Service Cloud

Ecommerce

Shopify, Woocommerce, Salesforce, Magento, Generic Product Pages

GitHub

Code, Issues, Discussions

Custom/Other

S3, Custom sources, PDF crawling, Uploads


Last updated