ParadeDB: Full-Text Search Engine for PostgreSQL

ParadeDB is a PostgreSQL extension that adds BM25 full-text search operators and ranking functions directly to the database. Instead of running a separate search service, you declare a BM25 index on a table and use search operators alongside normal SQL.

In this blog post, we take a look at ParadeDB's search capabilities. If you are familiar with Elasticsearch or Solr, you will find many of the same terms: BM25, tokenizers, token filters, relevance scoring, and so on.

The advantage of using ParadeDB is that the data is always in sync with the search index. You don't have to worry about syncing data between a primary database and a separate search engine. You can also combine full-text search with SQL features like joins, aggregations, and transactions in a single query.

In the following blog post, I will use a Go application to demonstrate how to import and search with ParadeDB. However, the same principles apply to any programming language that can connect to PostgreSQL.

Setup ¶

The demo runs a single ParadeDB container. The Docker image paradedb/paradedb includes everything you need to get started. You can find the docker-compose file, that I use for this demo, in the GitHub repo.

Data Model ¶

First, the application enables ParadeDB with CREATE EXTENSION.

CREATE EXTENSION IF NOT EXISTS pg_search;

00001_bootstrap.sql

Then it creates a books table to hold the metadata and full text. The program will use texts from Project Gutenberg as a sample corpus.

CREATE TABLE IF NOT EXISTS books (
  gutenberg_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  title TEXT NOT NULL,
  author TEXT NOT NULL,
  language TEXT NOT NULL,
  subjects JSONB NOT NULL DEFAULT '[]'::jsonb,
  bookshelves JSONB NOT NULL DEFAULT '[]'::jsonb,
  download_count INTEGER NOT NULL DEFAULT 0,
  source_url TEXT NOT NULL,
  text_url TEXT NOT NULL,
  content TEXT NOT NULL,
  content_bytes BIGINT NOT NULL DEFAULT 0,
  imported_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

00001_bootstrap.sql

The CREATE INDEX then uses a ParadeDB-specific feature to declare a BM25 index on the books table. This is what enables full-text search capabilities on the indexed columns.

CREATE INDEX IF NOT EXISTS books_search_idx ON books
USING bm25 (
  gutenberg_id,
  title,
  author,
  language,
  subjects,
  bookshelves,
  download_count,
  content_bytes,
  content
)
WITH (key_field = 'gutenberg_id');

00001_bootstrap.sql

A few constraints govern how the index is structured:

One index per table. Only one BM25 index can exist per table. The recommendation is to index every column that may appear in a search query — including columns used for sorting, grouping, filtering, and aggregations. The books_search_idx above covers all columns for exactly this reason.
Supported types. Most PostgreSQL types can be indexed: text, JSON, numeric, timestamp, range, boolean, and arrays.
key_field is required. Every BM25 index needs a key_field. This is the column that uniquely identifies each row within the index. It must have a UNIQUE constraint (usually the table's primary key), appear first in the column list, and be untokenized if it is a text field. gutenberg_id fills this role here.
Covering index, columns must be declared up front. The BM25 index is a covering index: all indexed columns are stored together inside a single index per table. This colocation is what enables fast reads and efficient boolean conditions, but it means the column list is fixed at creation time. Adding or removing a column requires a full reindex:

REINDEX INDEX books_search_idx;

Plan your column list carefully before running CREATE INDEX on a large table.

Both CREATE INDEX and REINDEX can be slow on large tables. To monitor progress, open a second PostgreSQL connection and query pg_stat_progress_create_index:

SELECT pid, phase, blocks_done, blocks_total
FROM pg_stat_progress_create_index;

blocks_done relative to blocks_total gives a good approximation of how far along the build is. When the two values are equal, all rows have been indexed and the index is being flushed to disk.

Tokenizers ¶

During indexing, text columns are processed by a tokenizer, which breaks the raw text into discrete tokens. The choice of tokenizer affects how the text is indexed and how queries match against it.

By default, text columns are tokenized using the Unicode tokenizer. It splits text on Unicode word boundaries and lowercases all characters. This works well for most Western languages, but ParadeDB provides several other tokenizers optimized for different use cases and languages. You can find the full list in the documentation.

To use a non-default tokenizer, cast the column to the tokenizer type in the index definition. In this example, the icu tokenizer is used, which provides better support for non-Western languages.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.icu), category)
WITH (key_field='id');

Token Filters ¶

After tokenization, token filters can apply further processing to each token, including stemming to reduce words to their root form, ASCII folding to strip accents, and more. Filters are passed as configuration strings to the tokenizer:

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('stemmer=english', 'ascii_folding=true')), category)
WITH (key_field='id');

Token filters can be added to any tokenizer except literal, which by definition must preserve the source text exactly. You can find the full list of available token filters in the documentation.

Importing Data ¶

Importing data into ParadeDB is the same as importing into PostgreSQL. The BM25 index is updated automatically as new rows are inserted or existing rows are updated. In the demo, a Go importer populates the books table with metadata and full text from Project Gutenberg. It preserves the original Gutenberg ID and upserts by gutenberg_id, which keeps the import idempotent while also updating the BM25 index automatically:

INSERT INTO books (
  gutenberg_id,
  title,
  author,
  language,
  subjects,
  bookshelves,
  download_count,
  source_url,
  text_url,
  content,
  content_bytes,
  imported_at
)
OVERRIDING SYSTEM VALUE
VALUES (
  sqlc.arg(gutenberg_id),
  sqlc.arg(title),
  sqlc.arg(author),
  sqlc.arg(language),
  sqlc.arg(subjects)::jsonb,
  sqlc.arg(bookshelves)::jsonb,
  sqlc.arg(download_count),
  sqlc.arg(source_url),
  sqlc.arg(text_url),
  sqlc.arg(content),
  sqlc.arg(content_bytes),
  NOW()
)
ON CONFLICT (gutenberg_id)
DO UPDATE SET
  title = EXCLUDED.title,
  author = EXCLUDED.author,
  language = EXCLUDED.language,
  subjects = EXCLUDED.subjects,
  bookshelves = EXCLUDED.bookshelves,
  download_count = EXCLUDED.download_count,
  source_url = EXCLUDED.source_url,
  text_url = EXCLUDED.text_url,
  content = EXCLUDED.content,
  content_bytes = EXCLUDED.content_bytes,
  imported_at = NOW();

books.sql

Searching ¶

Text search in ParadeDB is built on token matching. At index time, a tokenizer breaks each column's text into discrete tokens. At query time, the engine looks for matches against those tokens. This is different from substring matching — token matching enables relevance scoring, typo tolerance, and more expressive query types.

ParadeDB provides several query types, each expressed as a SQL operator or function.

Match (`|||` and `&&&`) ¶

Match is the go-to query type. The query string is tokenized with the same tokenizer as the indexed column, and the result tokens are compared against the index.

||| — match disjunction: returns documents containing any one or more of the query tokens (OR semantics).
&&& — match conjunction: returns documents containing all of the query tokens (AND semantics).

-- OR: returns rows with "running" or "shoes" or both
SELECT description FROM mock_items WHERE description ||| 'running shoes';

-- AND: returns only rows that contain both "running" and "shoes"
SELECT description FROM mock_items WHERE description &&& 'running shoes';

Phrase (`###`) ¶

Phrase queries work like match conjunction but are stricter: the tokens must appear in the exact same order as the query string.

-- Matches "Sleek running shoes" but not "shoes running" or "running sleek shoes"
SELECT description FROM mock_items WHERE description ### 'running shoes';

Slop relaxes this by allowing a configurable number of positional changes. A slop of 2 permits a single transposition:

SELECT description FROM mock_items WHERE description ### 'shoes running'::pdb.slop(2);

Term (`===`) ¶

Term queries skip tokenization of the query string entirely and look for an exact token match. Because most tokenizers lowercase their output, a term query for RUNNING returns no results even if running is indexed.

-- Matches; "running" is a token in the index
SELECT description FROM mock_items WHERE description === 'running';

-- No match; the indexed token is lowercase "running", not "RUNNING"
SELECT description FROM mock_items WHERE description === 'RUNNING';

Term is useful when you want to match a pre-formed token exactly — for instance, a language code or a category slug.

Fuzzy ¶

Fuzzy extends match and term queries to tolerate typos. Casting the query string to pdb.fuzzy(n) allows token matches within a Levenshtein edit distance of n (maximum 2):

-- "runing" and "shose" are within edit distance 2 of "running" and "shoes"
SELECT id, description FROM mock_items WHERE description ||| 'runing shose'::pdb.fuzzy(2);

Fuzzy prefix matching is also supported — useful for autocomplete scenarios:

-- Matches tokens that start with approximately "rann" (e.g. "running")
SELECT id, description FROM mock_items WHERE description === 'rann'::pdb.fuzzy(1, t);

Highlighting (`pdb.snippet`) ¶

pdb.snippet returns a short excerpt from the matched document with the matching tokens wrapped in HTML tags. It can be added to any query that uses a ParadeDB search operator:

SELECT id, pdb.snippet(description)
FROM mock_items
WHERE description ||| 'shoes'
LIMIT 5;
-- Returns: "Sleek running <b>shoes</b>"

The tags default to <b></b> and can be overridden. pdb.snippets (plural) returns an array of all matching excerpts from a document — useful when a long document has several relevant matches spread across it. Because snippet generation is expensive, always pair it with a LIMIT.

Proximity (`##` and `##>`) ¶

Proximity queries match documents where two tokens appear within a given token distance of each other. This is more flexible than a phrase query because it does not require exact adjacency.

-- Matches if "sleek" is within 1 token of "shoes" (order-independent)
SELECT description FROM mock_items WHERE description @@@ ('sleek' ## 1 ## 'shoes');

-- Same, but "sleek" must appear before "shoes"
SELECT description FROM mock_items WHERE description @@@ ('sleek' ##> 1 ##> 'shoes');

Proximity also accepts regex expressions via pdb.prox_regex for matching a pattern near a fixed token:

-- Any token matching "sl.*" within 1 token of "shoes"
SELECT description FROM mock_items WHERE description @@@ (pdb.prox_regex('sl.*') ## 1 ## 'shoes');

Sorting and Relevance Tuning ¶

BM25 Score (`pdb.score`) ¶

pdb.score(key_field) returns the BM25 relevance score for each matched row. Any column that contributes to the score must be present in the BM25 index definition. Ordering by pdb.score(id) DESC surfaces the most relevant results first:

SELECT id, pdb.score(id), description
FROM mock_items
WHERE description ||| 'running shoes'
ORDER BY pdb.score(id) DESC
LIMIT 5;

Boosting ¶

Individual query terms can be weighted up or down by casting to pdb.boost(n), where n is a multiplicative factor. This lets you promote matches in one column (or on one term) without changing the query structure:

-- "shoes" matches count twice as much as "footwear" matches
SELECT id, pdb.score(id), description
FROM mock_items
WHERE description ||| 'shoes'::pdb.boost(2) OR category ||| 'footwear'
ORDER BY pdb.score(id) DESC
LIMIT 5;

Boost can be stacked with other casts — for example 'shose'::pdb.fuzzy(2)::pdb.boost(2) applies both typo tolerance and a score multiplier. pdb.const(n) is the degenerate case: it assigns every matching document the same fixed score, which is useful when you want boolean filtering without any relevance ranking. See the relevance tuning docs for the full range of options.

Top K ¶

When a query includes ORDER BY ... LIMIT, ParadeDB can execute a highly optimized Top K scan instead of sorting the full result set. For this to kick in, all ORDER BY fields must be indexed, at least one ParadeDB search operator must be present, and the query must have a LIMIT. You can verify the optimization is active by checking for TopKScanExecState in EXPLAIN output.

Text columns used in ORDER BY must be indexed with the literal tokenizer (which preserves the source value exactly). Up to three ORDER BY columns are handled efficiently.

Filtering ¶

Adding SQL WHERE conditions alongside a full-text predicate is all that is needed to filter results by metadata:

SELECT description, rating, category
FROM mock_items
WHERE description ||| 'running shoes' AND rating > 2;

When the filtered columns are included in the BM25 index, ParadeDB can push the filter down into the index scan itself, avoiding a separate heap pass. Numeric, date, boolean, UUID, and range types all support filter pushdown with standard comparison operators. Text columns used in equality filters (category = 'Footwear' or category IN (...)) must be indexed with the literal tokenizer for pushdown to apply.

In other words, with the current syntax you usually express filters as ordinary SQL predicates instead of special query-builder wrappers:

SELECT title, author, download_count
FROM books
WHERE gutenberg_id @@@ pdb.all()
  AND download_count BETWEEN 5000 AND 100000
ORDER BY download_count DESC
LIMIT 5;

Advanced Query Functions ¶

Beyond the infix operators, ParadeDB exposes query builder functions via the @@@ operator. These cover query types that cannot be expressed as a simple cast:

-- Regex: matches tokens matching a regex expression
WHERE description @@@ pdb.regex('key.*rd');


-- Boolean combination: must/should/must_not clauses
WHERE description @@@ pdb.boolean(
  must    => ARRAY[pdb.parse('running')],
  must_not => ARRAY[pdb.parse('expensive')]
);

You can find the full list of builder functions in the documentation. All builder functions can be combined, boosted, and used alongside regular SQL predicates in the same WHERE clause.

Wrapping Up ¶

ParadeDB keeps search inside PostgreSQL. The schema definition and the BM25 index definition sit next to each other. Metadata filters and full-text ranking live in the same query path. This has the advantage of simplicity and data consistency: you don't have to worry about syncing data between a primary database and a separate search engine. You can also combine full-text search with SQL features like joins, aggregations, and transactions in a single query.

Setup ¶

Data Model ¶

Tokenizers ¶

Token Filters ¶

Importing Data ¶

Searching ¶

Match (||| and &&&) ¶

Phrase (###) ¶

Term (===) ¶

Fuzzy ¶

Highlighting (pdb.snippet) ¶

Proximity (## and ##>) ¶

Sorting and Relevance Tuning ¶

BM25 Score (pdb.score) ¶

Boosting ¶

Top K ¶

Filtering ¶

Advanced Query Functions ¶

Wrapping Up ¶

Match (`|||` and `&&&`) ¶

Phrase (`###`) ¶

Term (`===`) ¶

Highlighting (`pdb.snippet`) ¶

Proximity (`##` and `##>`) ¶

BM25 Score (`pdb.score`) ¶