πDocument Loaders
Document loaders are a powerful feature for importing content from various sources (files, directories, URLs, databases) into a standardized Document format that can be processed by AI workflows, stored in vector databases, or used for retrieval-augmented generation (RAG).
π Document Loading Flow
π Table of Contents
Overview
The document loading system provides:
Multiple Loader Types: Text, Markdown, CSV, JSON, XML, PDF, Log, HTTP, Feed, SQL, Directory, and WebCrawler loaders
Consistent Document Format: All loaders produce
Documentobjects with content, metadata, id, and embedding propertiesFluent API: Chain methods for easy configuration and transformation
Memory Integration: Load directly into AI memory systems for RAG workflows via
toMemory()Chunking Support: Automatic text chunking for large documents
Multi-Memory Fan-out: Ingest to multiple memory systems simultaneously
Async Support: Load documents asynchronously with
loadAsync()Filter/Transform: Apply filters and transforms during loading
BIF Reference
aiDocuments()
Create fluent document loader
IDocumentLoader
Quick Start
Using aiDocuments()
aiDocuments()The main entry point for document loading - returns a fluent loader:
Fluent Configuration
The aiDocuments() BIF returns a loader that can be fluently configured:
Memory Integration with toMemory()
toMemory()Ingest documents into memory with comprehensive reporting:
Ingestion Report Structure:
Filter and Transform
Apply filters and transforms during loading:
Document Structure
Each Document object has:
Document Methods
The Document class provides utility methods:
Available Loaders
TextLoader
Loads plain text files (.txt, .text).
MarkdownLoader
Loads Markdown files with optional header-based splitting.
Configuration Options:
splitByHeaders
boolean
false
Split document by headers
headerLevel
numeric
2
Header level to split at (1-6)
removeCodeBlocks
boolean
false
Remove fenced code blocks
removeImages
boolean
false
Remove image references
removeLinks
boolean
false
Remove links (keeps text)
HTTPLoader
Loads content from HTTP/HTTPS URLs with automatic content type detection. This is the primary loader for all web-based content including HTML pages, JSON APIs, and XML feeds.
Configuration Options:
contentType
string
"auto"
Content type (auto, text, html, json, xml)
method
string
"GET"
HTTP method
headers
struct
{}
Request headers
body
string
""
Request body
timeout
numeric
30
Request timeout in seconds
connectionTimeout
numeric
30
Connection timeout in seconds
redirect
boolean
true
Follow redirects
extractText
boolean
true
Extract text from HTML
removeScripts
boolean
true
Remove script tags from HTML
removeStyles
boolean
true
Remove style tags from HTML
Fluent HTTP Methods:
.get()- Set GET method.post()- Set POST method.put()- Set PUT method.delete()- Set DELETE method.method( "PATCH" )- Set custom method.header( name, value )- Add single header.headers( { name: value } )- Add multiple headers.body( content )- Set request body.timeout( seconds )- Set request timeout.connectionTimeout( seconds )- Set connection timeout.redirect( true/false )- Enable/disable redirects.proxy( server, port, user?, password? )- Configure proxy
CSVLoader
Loads CSV files with header support and row-as-document options.
Configuration Options:
delimiter
string
","
Column delimiter
hasHeaders
boolean
true
First row contains headers
rowsAsDocuments
boolean
false
Create document per row
columns
array
[]
Columns to include
skipRows
numeric
0
Rows to skip at start
JSONLoader
Loads JSON files with field extraction options.
Configuration Options:
contentField
string
""
Field to use as content
metadataFields
array
[]
Fields to extract as metadata
arrayAsDocuments
boolean
false
Create document per array item
PDFLoader
Loads PDF documents with text extraction and metadata support using Apache PDFBox.
Configuration Options:
sortByPosition
boolean
false
Sort text by position on page
addMoreFormatting
boolean
false
Add additional formatting
startPage
numeric
1
First page to extract
endPage
numeric
0
Last page to extract (0 = all)
suppressDuplicateOverlappingText
boolean
true
Remove duplicate overlapping text
includeMetadata
boolean
true
Extract PDF metadata
Metadata Fields Extracted:
title- Document titleauthor- Document authorsubject- Document subjectkeywords- Document keywordscreator- Application that created the PDFproducer- PDF producer softwarecreationDate- When the PDF was createdpageCount- Total number of pagespdfVersion- PDF version (e.g., "1.7")isEncrypted- Whether the PDF is encrypted
LogLoader
Loads and parses application log files with pattern matching and filtering.
Configuration Options:
pattern
string
Auto-detect
Regex pattern to parse log entries
filterByLevel
string/array
""
Log level(s) to include
excludePattern
string
""
Regex pattern to exclude entries
startDate
string
""
Include logs after this date
endDate
string
""
Include logs before this date
maxLines
numeric
0
Max lines to load (0 = unlimited)
includeTimestamp
boolean
true
Include timestamp in metadata
Supported Log Formats:
Standard format:
[2024-01-01 10:00:00] ERROR: MessageSyslog format:
Jan 1 10:00:00 hostname app: MessageCustom regex patterns via
pattern()configuration
DirectoryLoader
Loads all files from a directory using appropriate loaders.
Configuration Options:
recursive
boolean
false
Scan subdirectories
extensions
array
[]
File extensions to include
excludePatterns
array
[]
Regex patterns to exclude
includeHidden
boolean
false
Include hidden files
XMLLoader
Loads and parses XML documents with XPath support. Useful for config files, RSS feeds, and legacy data.
Configuration Options:
elementPath
string
""
XPath to extract elements from
elementsAsDocuments
boolean
false
Create document per matching element
contentElements
array
[]
XPath expressions for content
metadataElements
array
[]
XPath expressions for metadata
preserveWhitespace
boolean
false
Preserve whitespace in text
includeAttributes
boolean
true
Include attribute values
namespaceAware
boolean
true
Handle XML namespaces
namespaces
struct
{}
Namespace prefix mappings
FeedLoader
Loads RSS and Atom feeds, creating a document per feed item. Perfect for blog aggregation, news feeds, and content syndication.
Configuration Options:
includeDescription
boolean
true
Include item description
includeContent
boolean
true
Include full content
stripHtml
boolean
true
Strip HTML tags from content
maxItems
numeric
0
Maximum items to load (0 = all)
sinceDate
date
""
Only load items since date
categories
array
[]
Filter by categories
timeout
numeric
30
HTTP timeout for URL feeds
SQLLoader
Loads documents from database queries. Converts query results to Document objects for RAG over structured data.
Configuration Options:
datasource
string
""
Datasource name to use
contentColumn
string
""
Column to use as document content
contentColumns
array
[]
Array of columns to combine as content
contentTemplate
string
""
Template with ${column} placeholders
metadataColumns
array
[]
Columns to extract as metadata
idColumn
string
""
Column to use as document ID
params
struct
{}
Query parameters
maxRows
numeric
0
Maximum rows to return (0 = all)
rowsAsDocuments
boolean
true
Create document per row
WebCrawlerLoader
Crawls multiple web pages by following links. Respects robots.txt and supports depth-limited crawling. Uses JSoup for HTML parsing.
Configuration Options:
maxPages
numeric
10
Maximum pages to crawl
maxDepth
numeric
2
Maximum link depth
followExternalLinks
boolean
false
Follow links to other domains
allowedDomains
array
[]
Domains allowed for external links
allowedPaths
array
[]
Path prefixes to allow
excludedPaths
array
[]
Path prefixes to exclude
urlPatterns
array
[]
URL regex patterns to match
excludeUrlPatterns
array
[]
URL regex patterns to exclude
respectRobotsTxt
boolean
true
Respect robots.txt rules
contentSelector
string
""
CSS selector for content extraction
excludeSelectors
array
[]
CSS selectors to exclude from content
delay
numeric
1000
Delay between requests in ms
userAgent
string
"BoxLang-WebCrawler/1.0"
User agent string
deduplicateContent
boolean
true
Skip pages with duplicate content
Loading Methods
All loaders support these loading methods:
load()
Load all documents synchronously
loadAsync()
Load all documents asynchronously (returns BoxFuture)
loadAsStream()
Load as Java Stream for lazy processing
loadBatch( batchSize )
Load documents in batches
Loading to Memory
Using loadTo() Method
loadTo() MethodLoaders can store documents directly into AI memory systems:
Using toMemory() for Full Reporting (Recommended)
toMemory() for Full Reporting (Recommended)For comprehensive ingestion with reporting:
Ingestion Options:
chunkSize
numeric
0
Chunk size (0 = no chunking)
overlap
numeric
0
Overlap between chunks
strategy
string
"recursive"
Chunking strategy
dedupe
boolean
false
Enable deduplication
dedupeThreshold
numeric
0.95
Similarity threshold
trackTokens
boolean
true
Track token counts
trackCost
boolean
true
Estimate costs
async
boolean
false
Use async for multi-memory
batchSize
numeric
100
Batch size for processing
continueOnError
boolean
true
Continue on document errors
The Document Class
The Document class provides a consistent interface for working with loaded content:
Error Handling
Loaders track errors encountered during loading:
Custom Loaders
You can create custom loaders by extending BaseDocumentLoader:
RAG Pipeline Example
Here's a complete example of building a RAG pipeline with document loaders:
Best Practices
Use
toMemory()for Production: It provides comprehensive reporting, error handling, and multi-memory support.Configure Chunking: For vector memory, use appropriate chunk sizes (500-1000 chars) with overlap (100-200 chars).
Use Metadata: Leverage metadata for filtering and context in RAG queries.
Handle Large Directories: Use
recursive()sparingly and filter byextensions()to avoid loading unnecessary files.Monitor Costs: Use
trackCost: truein ingestion options to estimate embedding costs before large ingestions.Multi-Memory for Redundancy: Use array of memories for fan-out ingestion to multiple vector stores.
Use Async for Large Loads: Use
loadAsync()oringestOptions.asyncfor non-blocking operations.Use HTTPLoader for Web Content: The HTTPLoader handles all URL-based content including HTML pages, JSON APIs, and XML feeds.
See Also
Memory Systems - Standard and vector memory types
aiChunk() BIF - Text chunking strategies
Agents - Using agents with loaded documents
Last updated