# docfind **Repository Path**: mirrors_microsoft/docfind ## Basic Information - **Project Name**: docfind - **Description**: A high-performance document search engine built in Rust with WebAssembly support. - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-16 - **Last Updated**: 2025-12-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # docfind A high-performance document search engine built in Rust with WebAssembly support. Combines full-text search using FST (Finite State Transducers) with FSST compression for efficient storage and fast fuzzy matching capabilities. ## Live Demo Check out the [interactive demo](https://microsoft.github.io/docfind/). The demo showcases docfind searching through 50,000 news articles from the AG News dataset, running entirely in your browser with WebAssembly. **Demo Performance Metrics:** - **Dataset**: 50,000 news articles (AG News Classification Dataset) - **Dataset Size**: 17.14 MB ([uncompressed JSON](https://github.com/microsoft/docfind/raw/refs/heads/main/static/documents.json)) - **Index Size**: 11.48 MB ([WASM file](https://github.com/microsoft/docfind/raw/refs/heads/main/static/docfind_bg.wasm)) - **Compressed Size**: 5.20 MB ([compressed with Brotli](https://github.com/microsoft/docfind/raw/refs/heads/main/static/docfind_bg.wasm.br)) - **Index Build Time**: ~1.1 seconds - **Load Time**: ~100ms (depending on network and browser) - **Search Speed**: ~1-3ms per query ## Features - **Fast Fuzzy Search**: Uses FST for efficient keyword matching with Levenshtein distance support - **Compact Storage**: FSST compression reduces index size while maintaining fast decompression - **RAKE Keyword Extraction**: Automatic keyword extraction from document content using the RAKE algorithm - **WebAssembly Ready**: Compile to WASM for browser-based search with no server required - **Standalone CLI Tool**: Self-contained CLI tool to build a .wasm file out of a collection of documents, no Rust tooling required ## Installation ### Quick Install **macOS/Linux:** ```bash curl -fsSL https://microsoft.github.io/docfind/install.sh | sh ``` **Windows (PowerShell):** ```powershell irm https://microsoft.github.io/docfind/install.ps1 | iex ``` The installer will: - Download the latest release binary for your platform - Install it to `~/.local/bin` (Unix) or `~\.docfind\bin` (Windows) - Provide instructions for adding it to your PATH if needed ### Manual Installation Download the binary for your platform from the [latest release](https://github.com/microsoft/docfind/releases/latest): - **macOS (Intel)**: `docfind-x86_64-apple-darwin` - **macOS (Apple Silicon)**: `docfind-aarch64-apple-darwin` - **Linux (x64)**: `docfind-x86_64-unknown-linux-musl` - **Linux (ARM64)**: `docfind-aarch64-unknown-linux-musl` - **Windows (x64)**: `docfind-x86_64-pc-windows-msvc.exe` - **Windows (ARM64)**: `docfind-aarch64-pc-windows-msvc.exe` Rename it to `docfind` (or `docfind.exe` on Windows), make it executable, and place it in your PATH. ### Building from Source #### Prerequisites Before building from source, ensure you have the following installed: 1. **Rust** - [rustup.rs](https://rustup.rs/) 2. **wasm-pack** - [drager.github.io/wasm-pack](https://drager.github.io/wasm-pack/) 3. **Node.js** - [nodejs.org](https://nodejs.org/) (required for esbuild) #### Build ```bash ./scripts/build.sh ``` The compiled binary will be available at `./target/release/docfind`. ## Usage ### Creating a Search Index Prepare a JSON file with your documents: ```json [ { "title": "Getting Started", "category": "docs", "href": "/docs/getting-started", "body": "This guide will help you get started." }, { "title": "API Reference", "category": "reference", "href": "/docs/api", "body": "Complete API documentation for all search functions and configuration options." } ] ``` Build the index and generate a WASM module: ```bash docfind documents.json output ``` This creates: - `output/docfind.js` - JavaScript bindings - `output/docfind_bg.wasm` - WebAssembly module with embedded index ### Using in the Browser ```html ``` ## How It Works ```mermaid flowchart LR A([documents.json]) --> B[docfind] B --> C[Keyword Extraction
RAKE] B --> E[FSST Compression
document strings] C --> D[FST Map
keywords → docs] D --> F[[Index]] E --> F F --> G([docfind_bg.wasm
+ docfind.js]) style A fill:#e1f5ff style G fill:#e1f5ff style F fill:#ffffcc ``` 1. **Indexing Phase** (CLI): - Extracts keywords from document titles, categories, and bodies - Uses RAKE algorithm to identify important multi-word phrases - Assigns relevance scores based on keyword source (metadata > title > body) - Builds an FST mapping keywords to document indices - Compresses all document strings using FSST - Serializes the index using Postcard (binary format) 2. **Embedding Phase** (CLI): - Parses the pre-compiled WASM module - Expands WASM memory to accommodate the index - Patches global variables (`INDEX_BASE`, `INDEX_LEN`) with actual values - Adds the index as a new data segment in the WASM binary 3. **Search Phase** (WASM): - Deserializes the embedded index on first use - Performs fuzzy matching using Levenshtein automaton - Combines results from multiple keywords with score accumulation - Decompresses matching document strings on demand - Returns ranked results as JavaScript objects ## Dependencies - **fst**: Fast finite state transducer library with Levenshtein support - **fsst-rs**: Fast string compression for text data - **rake**: Rapid Automatic Keyword Extraction algorithm - **serde/postcard**: Efficient serialization - **wasm-bindgen**: WebAssembly bindings for Rust - **wasm-encoder/wasmparser**: WASM manipulation tools ## Performance The combination of FST and FSST provides: - Sub-millisecond search times for typical queries - 60-80% compression ratio for document storage - Instant startup with lazy index loading - Zero network requests after initial load ## References ### Prior Art This project builds on the rich ecosystem of search technologies: - **[Algolia](https://www.algolia.com/)** - Server-side search-as-a-service platform - **[TypeSense](https://typesense.org/)** - Open-source server-side search engine - **[Lunr.js](https://lunrjs.com/)** - Client-side full-text search library for JavaScript - **[Stork Search](https://stork-search.net/)** - WebAssembly-powered search for static sites - **[Tinysearch](https://endler.dev/2019/tinysearch/)** - Minimalist WASM-based search engine ### Technical Foundations Key technologies and concepts that inspired and power docfind: - **[Finite State Transducers](https://burntsushi.net/transducers/)** - Andrew Gallant's comprehensive article on FSTs, the core data structure for efficient search - **[RAKE Algorithm](https://docs.rs/rake/latest/rake/)** - Rapid Automatic Keyword Extraction for identifying important phrases - **[FSST Compression](https://docs.rs/fsst-rs/latest/fsst/index.html)** - Fast Static Symbol Table compression for efficient text storage