aiChunk

Chunk text into smaller, manageable segments for AI processing, embeddings, or semantic search. Supports multiple chunking strategies including recursive, character, word, sentence, and paragraph-based splitting.

Syntax

aiChunk(text, options)

Parameters

Parameter
Type
Required
Default
Description

text

string

Yes

-

The text to chunk into segments

options

struct

No

{}

Configuration struct for chunking behavior

Options Structure

Option
Type
Default
Description

chunkSize

numeric

1000

The maximum size of each chunk (in characters)

overlap

numeric

200

The number of overlapping characters between chunks

strategy

string

"recursive"

Chunking strategy: "recursive", "characters", "words", "sentences", "paragraphs"

Chunking Strategies

  • recursive: Intelligently splits on paragraph, sentence, then word boundaries

  • characters: Fixed-size character splitting (simple but may break words)

  • words: Splits on word boundaries (preserves whole words)

  • sentences: Splits on sentence boundaries (preserves complete sentences)

  • paragraphs: Splits on paragraph boundaries (preserves complete paragraphs)

Returns

Returns an array of text chunks (strings). Each chunk respects the chunkSize limit and includes overlap characters from the previous chunk for context continuity.

Examples

Basic Chunking

Custom Chunk Size

Word-Based Strategy

Sentence-Based Strategy

Paragraph-Based Strategy

Recursive Strategy (Default)

Processing Large Documents

Estimate Before Chunking

Chunk and Embed

Overlapping Context

Notes

  • 📏 Size Management: Chunk size is in characters, not tokens (use aiTokens() to estimate)

  • 🔄 Overlap Benefits: Overlap prevents context loss at chunk boundaries

  • 🎯 Strategy Selection: Choose strategy based on content type and use case

  • 💾 Memory Efficiency: Chunks large documents without loading entire content in memory

  • 🔍 Search Optimization: Smaller chunks (400-600 chars) work best for semantic search

  • 📚 Embedding Limits: Most embedding models have token limits (e.g., 8192 tokens)

Best Practices

Match chunk size to use case - Smaller for search (500), larger for summarization (1500)

Use overlap for continuity - 15-20% overlap prevents context loss

Choose appropriate strategy - Recursive for mixed content, sentences for natural breaks

Test different settings - Optimal size varies by content type and model

Estimate tokens first - Use aiTokens() to verify chunks fit model limits

Don't chunk too small - Very small chunks lose context (minimum ~200 characters)

Don't ignore overlap - Zero overlap can break semantic meaning across boundaries

Don't use fixed character strategy - It breaks words and sentences unnaturally

Last updated