aiChunk
Chunk text into smaller, manageable segments for AI processing, embeddings, or semantic search. Supports multiple chunking strategies including recursive, character, word, sentence, and paragraph-based splitting.
Syntax
aiChunk(text, options)Parameters
text
string
Yes
-
The text to chunk into segments
options
struct
No
{}
Configuration struct for chunking behavior
Options Structure
chunkSize
numeric
1000
The maximum size of each chunk (in characters)
overlap
numeric
200
The number of overlapping characters between chunks
strategy
string
"recursive"
Chunking strategy: "recursive", "characters", "words", "sentences", "paragraphs"
Chunking Strategies
recursive: Intelligently splits on paragraph, sentence, then word boundaries
characters: Fixed-size character splitting (simple but may break words)
words: Splits on word boundaries (preserves whole words)
sentences: Splits on sentence boundaries (preserves complete sentences)
paragraphs: Splits on paragraph boundaries (preserves complete paragraphs)
Returns
Returns an array of text chunks (strings). Each chunk respects the chunkSize limit and includes overlap characters from the previous chunk for context continuity.
Examples
Basic Chunking
Custom Chunk Size
Word-Based Strategy
Sentence-Based Strategy
Paragraph-Based Strategy
Recursive Strategy (Default)
Processing Large Documents
For Vector Search
Estimate Before Chunking
Chunk and Embed
Overlapping Context
Notes
📏 Size Management: Chunk size is in characters, not tokens (use
aiTokens()to estimate)🔄 Overlap Benefits: Overlap prevents context loss at chunk boundaries
🎯 Strategy Selection: Choose strategy based on content type and use case
💾 Memory Efficiency: Chunks large documents without loading entire content in memory
🔍 Search Optimization: Smaller chunks (400-600 chars) work best for semantic search
📚 Embedding Limits: Most embedding models have token limits (e.g., 8192 tokens)
Related Functions
aiEmbed()- Generate embeddings for chunksaiTokens()- Estimate token countsaiMemory()- Store chunked documents with embeddingsaiDocuments()- Load and process documents
Best Practices
✅ Match chunk size to use case - Smaller for search (500), larger for summarization (1500)
✅ Use overlap for continuity - 15-20% overlap prevents context loss
✅ Choose appropriate strategy - Recursive for mixed content, sentences for natural breaks
✅ Test different settings - Optimal size varies by content type and model
✅ Estimate tokens first - Use aiTokens() to verify chunks fit model limits
❌ Don't chunk too small - Very small chunks lose context (minimum ~200 characters)
❌ Don't ignore overlap - Zero overlap can break semantic meaning across boundaries
❌ Don't use fixed character strategy - It breaks words and sentences unnaturally
Last updated