πŸ“šDocument Loaders

Document loaders are a powerful feature for importing content from various sources (files, directories, URLs, databases) into a standardized Document format that can be processed by AI workflows, stored in vector databases, or used for retrieval-augmented generation (RAG).

πŸ”„ Document Loading Flow

πŸ“– Table of Contents

Overview

The document loading system provides:

  • Multiple Loader Types: Text, Markdown, CSV, JSON, XML, PDF, Log, HTTP, Feed, SQL, Directory, and WebCrawler loaders

  • Consistent Document Format: All loaders produce Document objects with content, metadata, id, and embedding properties

  • Fluent API: Chain methods for easy configuration and transformation

  • Memory Integration: Load directly into AI memory systems for RAG workflows via toMemory()

  • Chunking Support: Automatic text chunking for large documents

  • Multi-Memory Fan-out: Ingest to multiple memory systems simultaneously

  • Async Support: Load documents asynchronously with loadAsync()

  • Filter/Transform: Apply filters and transforms during loading

BIF Reference

BIF
Purpose
Returns

aiDocuments()

Create fluent document loader

IDocumentLoader

Quick Start

Using aiDocuments()

The main entry point for document loading - returns a fluent loader:

Fluent Configuration

The aiDocuments() BIF returns a loader that can be fluently configured:

Memory Integration with toMemory()

Ingest documents into memory with comprehensive reporting:

Ingestion Report Structure:

Filter and Transform

Apply filters and transforms during loading:

Document Structure

Each Document object has:

Document Methods

The Document class provides utility methods:

Available Loaders

TextLoader

Loads plain text files (.txt, .text).

MarkdownLoader

Loads Markdown files with optional header-based splitting.

Configuration Options:

Option
Type
Default
Description

splitByHeaders

boolean

false

Split document by headers

headerLevel

numeric

2

Header level to split at (1-6)

removeCodeBlocks

boolean

false

Remove fenced code blocks

removeImages

boolean

false

Remove image references

removeLinks

boolean

false

Remove links (keeps text)

HTTPLoader

Loads content from HTTP/HTTPS URLs with automatic content type detection. This is the primary loader for all web-based content including HTML pages, JSON APIs, and XML feeds.

Configuration Options:

Option
Type
Default
Description

contentType

string

"auto"

Content type (auto, text, html, json, xml)

method

string

"GET"

HTTP method

headers

struct

{}

Request headers

body

string

""

Request body

timeout

numeric

30

Request timeout in seconds

connectionTimeout

numeric

30

Connection timeout in seconds

redirect

boolean

true

Follow redirects

extractText

boolean

true

Extract text from HTML

removeScripts

boolean

true

Remove script tags from HTML

removeStyles

boolean

true

Remove style tags from HTML

Fluent HTTP Methods:

  • .get() - Set GET method

  • .post() - Set POST method

  • .put() - Set PUT method

  • .delete() - Set DELETE method

  • .method( "PATCH" ) - Set custom method

  • .header( name, value ) - Add single header

  • .headers( { name: value } ) - Add multiple headers

  • .body( content ) - Set request body

  • .timeout( seconds ) - Set request timeout

  • .connectionTimeout( seconds ) - Set connection timeout

  • .redirect( true/false ) - Enable/disable redirects

  • .proxy( server, port, user?, password? ) - Configure proxy

CSVLoader

Loads CSV files with header support and row-as-document options.

Configuration Options:

Option
Type
Default
Description

delimiter

string

","

Column delimiter

hasHeaders

boolean

true

First row contains headers

rowsAsDocuments

boolean

false

Create document per row

columns

array

[]

Columns to include

skipRows

numeric

0

Rows to skip at start

JSONLoader

Loads JSON files with field extraction options.

Configuration Options:

Option
Type
Default
Description

contentField

string

""

Field to use as content

metadataFields

array

[]

Fields to extract as metadata

arrayAsDocuments

boolean

false

Create document per array item

PDFLoader

Loads PDF documents with text extraction and metadata support using Apache PDFBox.

Configuration Options:

Option
Type
Default
Description

sortByPosition

boolean

false

Sort text by position on page

addMoreFormatting

boolean

false

Add additional formatting

startPage

numeric

1

First page to extract

endPage

numeric

0

Last page to extract (0 = all)

suppressDuplicateOverlappingText

boolean

true

Remove duplicate overlapping text

includeMetadata

boolean

true

Extract PDF metadata

Metadata Fields Extracted:

  • title - Document title

  • author - Document author

  • subject - Document subject

  • keywords - Document keywords

  • creator - Application that created the PDF

  • producer - PDF producer software

  • creationDate - When the PDF was created

  • pageCount - Total number of pages

  • pdfVersion - PDF version (e.g., "1.7")

  • isEncrypted - Whether the PDF is encrypted

LogLoader

Loads and parses application log files with pattern matching and filtering.

Configuration Options:

Option
Type
Default
Description

pattern

string

Auto-detect

Regex pattern to parse log entries

filterByLevel

string/array

""

Log level(s) to include

excludePattern

string

""

Regex pattern to exclude entries

startDate

string

""

Include logs after this date

endDate

string

""

Include logs before this date

maxLines

numeric

0

Max lines to load (0 = unlimited)

includeTimestamp

boolean

true

Include timestamp in metadata

Supported Log Formats:

  • Standard format: [2024-01-01 10:00:00] ERROR: Message

  • Syslog format: Jan 1 10:00:00 hostname app: Message

  • Custom regex patterns via pattern() configuration

DirectoryLoader

Loads all files from a directory using appropriate loaders.

Configuration Options:

Option
Type
Default
Description

recursive

boolean

false

Scan subdirectories

extensions

array

[]

File extensions to include

excludePatterns

array

[]

Regex patterns to exclude

includeHidden

boolean

false

Include hidden files

XMLLoader

Loads and parses XML documents with XPath support. Useful for config files, RSS feeds, and legacy data.

Configuration Options:

Option
Type
Default
Description

elementPath

string

""

XPath to extract elements from

elementsAsDocuments

boolean

false

Create document per matching element

contentElements

array

[]

XPath expressions for content

metadataElements

array

[]

XPath expressions for metadata

preserveWhitespace

boolean

false

Preserve whitespace in text

includeAttributes

boolean

true

Include attribute values

namespaceAware

boolean

true

Handle XML namespaces

namespaces

struct

{}

Namespace prefix mappings

FeedLoader

Loads RSS and Atom feeds, creating a document per feed item. Perfect for blog aggregation, news feeds, and content syndication.

Configuration Options:

Option
Type
Default
Description

includeDescription

boolean

true

Include item description

includeContent

boolean

true

Include full content

stripHtml

boolean

true

Strip HTML tags from content

maxItems

numeric

0

Maximum items to load (0 = all)

sinceDate

date

""

Only load items since date

categories

array

[]

Filter by categories

timeout

numeric

30

HTTP timeout for URL feeds

SQLLoader

Loads documents from database queries. Converts query results to Document objects for RAG over structured data.

Configuration Options:

Option
Type
Default
Description

datasource

string

""

Datasource name to use

contentColumn

string

""

Column to use as document content

contentColumns

array

[]

Array of columns to combine as content

contentTemplate

string

""

Template with ${column} placeholders

metadataColumns

array

[]

Columns to extract as metadata

idColumn

string

""

Column to use as document ID

params

struct

{}

Query parameters

maxRows

numeric

0

Maximum rows to return (0 = all)

rowsAsDocuments

boolean

true

Create document per row

WebCrawlerLoader

Crawls multiple web pages by following links. Respects robots.txt and supports depth-limited crawling. Uses JSoup for HTML parsing.

Configuration Options:

Option
Type
Default
Description

maxPages

numeric

10

Maximum pages to crawl

maxDepth

numeric

2

Maximum link depth

followExternalLinks

boolean

false

Follow links to other domains

allowedDomains

array

[]

Domains allowed for external links

allowedPaths

array

[]

Path prefixes to allow

excludedPaths

array

[]

Path prefixes to exclude

urlPatterns

array

[]

URL regex patterns to match

excludeUrlPatterns

array

[]

URL regex patterns to exclude

respectRobotsTxt

boolean

true

Respect robots.txt rules

contentSelector

string

""

CSS selector for content extraction

excludeSelectors

array

[]

CSS selectors to exclude from content

delay

numeric

1000

Delay between requests in ms

userAgent

string

"BoxLang-WebCrawler/1.0"

User agent string

deduplicateContent

boolean

true

Skip pages with duplicate content

Loading Methods

All loaders support these loading methods:

Method
Description

load()

Load all documents synchronously

loadAsync()

Load all documents asynchronously (returns BoxFuture)

loadAsStream()

Load as Java Stream for lazy processing

loadBatch( batchSize )

Load documents in batches

Loading to Memory

Using loadTo() Method

Loaders can store documents directly into AI memory systems:

For comprehensive ingestion with reporting:

Ingestion Options:

Option
Type
Default
Description

chunkSize

numeric

0

Chunk size (0 = no chunking)

overlap

numeric

0

Overlap between chunks

strategy

string

"recursive"

Chunking strategy

dedupe

boolean

false

Enable deduplication

dedupeThreshold

numeric

0.95

Similarity threshold

trackTokens

boolean

true

Track token counts

trackCost

boolean

true

Estimate costs

async

boolean

false

Use async for multi-memory

batchSize

numeric

100

Batch size for processing

continueOnError

boolean

true

Continue on document errors

The Document Class

The Document class provides a consistent interface for working with loaded content:

Error Handling

Loaders track errors encountered during loading:

Custom Loaders

You can create custom loaders by extending BaseDocumentLoader:

RAG Pipeline Example

Here's a complete example of building a RAG pipeline with document loaders:

Best Practices

  1. Use toMemory() for Production: It provides comprehensive reporting, error handling, and multi-memory support.

  2. Configure Chunking: For vector memory, use appropriate chunk sizes (500-1000 chars) with overlap (100-200 chars).

  3. Use Metadata: Leverage metadata for filtering and context in RAG queries.

  4. Handle Large Directories: Use recursive() sparingly and filter by extensions() to avoid loading unnecessary files.

  5. Monitor Costs: Use trackCost: true in ingestion options to estimate embedding costs before large ingestions.

  6. Multi-Memory for Redundancy: Use array of memories for fan-out ingestion to multiple vector stores.

  7. Use Async for Large Loads: Use loadAsync() or ingestOptions.async for non-blocking operations.

  8. Use HTTPLoader for Web Content: The HTTPLoader handles all URL-based content including HTML pages, JSON APIs, and XML feeds.

See Also

Last updated