popoto.fields._tokenizer¶
popoto.fields._tokenizer
¶
Shared tokenization logic for text-indexing fields.
Used by ExistenceFilter (Bloom filter) and BM25Field (ranked keyword search). Extracts the tokenization logic into a shared module so both fields use identical preprocessing: lowercase, split on non-word characters, filter short tokens, and remove common English stop words.
tokenize(text, unique=True)
¶
Tokenize a text string into individual terms for indexing.
Lowercases the input, splits on non-word characters, filters out tokens shorter than 3 characters and common English stop words.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
The text string to tokenize. |
required | |
unique
|
If True (default), deduplicate tokens. Set to False to preserve raw term counts needed for BM25 term frequency. |
True
|
Returns:
| Type | Description |
|---|---|
|
list[str]: Tokens suitable for indexing. Deduplicated if unique=True, raw (with repeats) if unique=False. Returns an empty list if no tokens survive filtering. |