`popoto.fields._tokenizer`¶

`popoto.fields._tokenizer` ¶

Shared tokenization logic for text-indexing fields.

Used by ExistenceFilter (Bloom filter) and BM25Field (ranked keyword search). Extracts the tokenization logic into a shared module so both fields use identical preprocessing: lowercase, split on non-word characters, filter short tokens, and remove common English stop words.

`tokenize(text, unique=True)` ¶

Tokenize a text string into individual terms for indexing.

Lowercases the input, splits on non-word characters, filters out tokens shorter than 3 characters and common English stop words.

Parameters:

Name	Type	Description	Default
`text`		The text string to tokenize.	required
`unique`		If True (default), deduplicate tokens. Set to False to preserve raw term counts needed for BM25 term frequency.	`True`

Returns:

Type	Description
	list[str]: Tokens suitable for indexing. Deduplicated if unique=True, raw (with repeats) if unique=False. Returns an empty list if no tokens survive filtering.

Source code in src/popoto/fields/_tokenizer.py

def tokenize(text, unique=True):
    """Tokenize a text string into individual terms for indexing.

    Lowercases the input, splits on non-word characters, filters out tokens
    shorter than 3 characters and common English stop words.

    Args:
        text: The text string to tokenize.
        unique: If True (default), deduplicate tokens. Set to False to
            preserve raw term counts needed for BM25 term frequency.

    Returns:
        list[str]: Tokens suitable for indexing. Deduplicated if unique=True,
            raw (with repeats) if unique=False.
            Returns an empty list if no tokens survive filtering.
    """
    if not text:
        return []
    lowered = text.lower()
    raw_tokens = _SPLIT_PATTERN.split(lowered)
    if unique:
        seen = set()
        tokens = []
        for t in raw_tokens:
            if len(t) >= MIN_TOKEN_LENGTH and t not in STOP_WORDS and t not in seen:
                seen.add(t)
                tokens.append(t)
        return tokens
    else:
        return [
            t
            for t in raw_tokens
            if len(t) >= MIN_TOKEN_LENGTH and t not in STOP_WORDS
        ]

popoto.fields._tokenizer¶