Skip to content

popoto.fields.bm25_field

popoto.fields.bm25_field

BM25Field: Ranked keyword search using BM25 scoring in Redis.

Maintains term frequency / document frequency statistics in Redis sorted sets and computes BM25(k1=1.2, b=0.75) scores at query time via Lua scripts. No Redis modules required -- works on both Redis and Valkey.

Design

BM25Field is a "side-effect field" like ExistenceFilter -- it does not store a value on the model instance. It maintains an inverted index and corpus statistics via on_save()/on_delete() hooks. At query time, a Lua script computes BM25 scores server-side and returns ranked results.

Tokenization reuses the shared tokenizer from fields/_tokenizer.py (same as ExistenceFilter): lowercase, split on non-word chars, filter short tokens, remove stop words.

Redis Key Patterns
  • $BM25:{Class}:{field}:inv:{term} -- ZSET {doc_key: tf} (inverted index)
  • $BM25:{Class}:{field}:tf:{doc_key} -- ZSET {term: tf} (forward index)
  • $BM25:{Class}:{field}:df -- ZSET {term: df} (document frequency)
  • $BM25:{Class}:{field}:dl -- ZSET {doc_key: doc_length}
  • $BM25:{Class}:{field}:n -- STRING doc_count
  • $BM25:{Class}:{field}:avgdl -- STRING avg_doc_length
Example

class Memory(popoto.Model): key = popoto.AutoKeyField() raw_content = ContentField() content = BM25Field(source="raw_content")

After saving documents...

results = BM25Field.search(Memory, "content", "redis deployment", limit=10)

Returns [(redis_key, bm25_score), ...]

BM25Field

Bases: Field

BM25 ranked keyword search field backed by Redis sorted sets.

Maintains an inverted index and corpus statistics (tf, df, dl, n, avgdl) in Redis. Computes BM25 scores at query time via a Lua script.

This is a "side-effect field" -- it does not store a value on the model instance. It reads content from a source field and maintains search indexes via on_save()/on_delete() hooks.

Parameters:

Name Type Description Default
source str

Name of the field to read content from for indexing. Required -- the source field should contain text content.

None
**kwargs

Standard Field keyword arguments.

{}
Class Constants

BM25_K1: Term frequency saturation parameter. Default 1.2. BM25_B: Document length normalization parameter. Default 0.75.

Redis Keys
  • $BM25:{Class}:{field}:inv:{term} -- inverted index per term
  • $BM25:{Class}:{field}:tf:{doc_key} -- forward index per doc
  • $BM25:{Class}:{field}:df -- document frequency
  • $BM25:{Class}:{field}:dl -- document lengths
  • $BM25:{Class}:{field}:n -- total document count
  • $BM25:{Class}:{field}:avgdl -- average document length
Example

class Memory(popoto.Model): key = popoto.AutoKeyField() raw_content = ContentField() content = BM25Field(source="raw_content")

Save some documents

m = Memory(raw_content="kubernetes deployment guide") m.save()

results = BM25Field.search(Memory, "content", "kubernetes", limit=10)

Returns [(redis_key, bm25_score), ...]

Source code in src/popoto/fields/bm25_field.py
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
class BM25Field(Field):
    """BM25 ranked keyword search field backed by Redis sorted sets.

    Maintains an inverted index and corpus statistics (tf, df, dl, n, avgdl)
    in Redis. Computes BM25 scores at query time via a Lua script.

    This is a "side-effect field" -- it does not store a value on the model
    instance. It reads content from a ``source`` field and maintains search
    indexes via on_save()/on_delete() hooks.

    Args:
        source: Name of the field to read content from for indexing.
            Required -- the source field should contain text content.
        **kwargs: Standard Field keyword arguments.

    Class Constants:
        BM25_K1: Term frequency saturation parameter. Default 1.2.
        BM25_B: Document length normalization parameter. Default 0.75.

    Redis Keys:
        - ``$BM25:{Class}:{field}:inv:{term}`` -- inverted index per term
        - ``$BM25:{Class}:{field}:tf:{doc_key}`` -- forward index per doc
        - ``$BM25:{Class}:{field}:df`` -- document frequency
        - ``$BM25:{Class}:{field}:dl`` -- document lengths
        - ``$BM25:{Class}:{field}:n`` -- total document count
        - ``$BM25:{Class}:{field}:avgdl`` -- average document length

    Example:
        class Memory(popoto.Model):
            key = popoto.AutoKeyField()
            raw_content = ContentField()
            content = BM25Field(source="raw_content")

        # Save some documents
        m = Memory(raw_content="kubernetes deployment guide")
        m.save()

        # Ranked keyword search
        results = BM25Field.search(Memory, "content", "kubernetes", limit=10)
        # Returns [(redis_key, bm25_score), ...]
    """

    # BM25 tuning parameters -- override via subclass for experimentation
    BM25_K1 = 1.2  # Term frequency saturation
    BM25_B = 0.75  # Document length normalization

    # Override Field defaults -- BM25Field does not store a value
    type: type = str
    null: bool = True
    default = None

    def __init__(self, source: str = None, **kwargs):
        if source is None:
            raise ValueError("BM25Field requires a 'source' parameter")
        self.source = source
        super().__init__(**kwargs)

    def _key_prefix(self, model_class):
        """Build the Redis key prefix for this BM25Field's data structures.

        Returns:
            str: Prefix like ``$BM25:{ClassName}:{field_name}``
        """
        return f"$BM25:{model_class.__name__}:{self.name}"

    @classmethod
    def on_save(
        cls,
        model_instance,
        field_name,
        field_value,
        pipeline=None,
        **kwargs,
    ):
        """Update BM25 index when a model instance is saved.

        Reads the source field value, tokenizes it, and atomically updates
        all BM25 data structures (tf, df, dl, n, avgdl, inverted index)
        via a single Lua script.

        Args:
            model_instance: The model instance being saved.
            field_name: Name of this field on the model.
            field_value: Current field value (unused -- BM25Field is side-effect only).
            pipeline: Optional Redis pipeline (not used for Lua eval).
            **kwargs: Additional context.

        Returns:
            The pipeline if provided, otherwise None.
        """
        field = model_instance._meta.fields[field_name]
        if not isinstance(field, BM25Field):
            return pipeline if pipeline else None

        model_class = type(model_instance)
        prefix = field._key_prefix(model_class)

        # Read source field content
        source_value = getattr(model_instance, field.source, None)
        if source_value is None:
            source_value = ""

        # Handle ContentField references
        if isinstance(source_value, str) and source_value.startswith("$CF:"):
            source_field = model_instance._meta.fields.get(field.source)
            if hasattr(source_field, "store"):
                try:
                    source_value = source_field.store.read(source_value)
                except Exception:
                    source_value = ""

        source_value = str(source_value)
        # Use unique=False to preserve raw term counts for accurate tf
        tokens = tokenize(source_value, unique=False)

        # Get the document's Redis key
        doc_key = model_instance.db_key.redis_key

        # Build Lua KEYS and ARGV
        tf_key = f"{prefix}:tf:{doc_key}"
        df_key = f"{prefix}:df"
        dl_key = f"{prefix}:dl"
        n_key = f"{prefix}:n"
        avgdl_key = f"{prefix}:avgdl"
        inv_prefix = f"{prefix}:inv:"

        keys = [tf_key, df_key, dl_key, n_key, avgdl_key]
        argv = [doc_key, inv_prefix] + tokens

        POPOTO_REDIS_DB.eval(BM25_SAVE_LUA, len(keys), *keys, *argv)

        return pipeline if pipeline else None

    @classmethod
    def on_delete(
        cls,
        model_instance,
        field_name,
        field_value,
        pipeline=None,
        **kwargs,
    ):
        """Remove a document from the BM25 index when deleted.

        Atomically reverses the save operation: removes terms from the
        inverted index, updates df, removes dl entry, decrements n,
        and recomputes avgdl.

        Args:
            model_instance: The model instance being deleted.
            field_name: Name of this field on the model.
            field_value: Current field value (unused).
            pipeline: Optional Redis pipeline (not used for Lua eval).
            **kwargs: Additional context.

        Returns:
            The pipeline if provided, otherwise None.
        """
        field = model_instance._meta.fields[field_name]
        if not isinstance(field, BM25Field):
            return pipeline if pipeline else None

        model_class = type(model_instance)
        prefix = field._key_prefix(model_class)
        doc_key = model_instance.db_key.redis_key

        tf_key = f"{prefix}:tf:{doc_key}"
        df_key = f"{prefix}:df"
        dl_key = f"{prefix}:dl"
        n_key = f"{prefix}:n"
        avgdl_key = f"{prefix}:avgdl"
        inv_prefix = f"{prefix}:inv:"

        keys = [tf_key, df_key, dl_key, n_key, avgdl_key]
        argv = [doc_key, inv_prefix]

        POPOTO_REDIS_DB.eval(BM25_DELETE_LUA, len(keys), *keys, *argv)

        return pipeline if pipeline else None

    @classmethod
    def search(cls, model_class, field_name, query_text, limit=10):
        """Search the BM25 index and return ranked results.

        Tokenizes the query, executes the BM25 scoring Lua script, and
        returns results sorted by BM25 score descending.

        Args:
            model_class: The Model class to search.
            field_name: Name of the BM25Field on the model.
            query_text: The search query string.
            limit: Maximum number of results to return. Default 10.

        Returns:
            list[tuple[str, float]]: List of (redis_key, bm25_score) tuples
                sorted by score descending. Returns empty list if query
                produces no tokens or corpus is empty.

        Raises:
            QueryException: If field_name does not refer to a BM25Field.
        """
        from ..models.query import QueryException

        field = model_class._meta.fields.get(field_name)
        if not isinstance(field, BM25Field):
            raise QueryException(
                f"keyword_search() requires a BM25Field. "
                f"'{field_name}' is {type(field).__name__ if field else 'not found'}"
            )

        query_tokens = tokenize(query_text or "")
        if not query_tokens:
            return []

        prefix = field._key_prefix(model_class)
        df_key = f"{prefix}:df"
        dl_key = f"{prefix}:dl"
        n_key = f"{prefix}:n"
        avgdl_key = f"{prefix}:avgdl"
        inv_prefix = f"{prefix}:inv:"

        keys = [df_key, dl_key, n_key, avgdl_key]
        argv = [inv_prefix, limit, field.BM25_K1, field.BM25_B] + query_tokens

        result = POPOTO_REDIS_DB.eval(BM25_SEARCH_LUA, len(keys), *keys, *argv)

        if not result:
            return []

        # Parse flat array: [key1, score1, key2, score2, ...]
        scored = []
        for i in range(0, len(result), 2):
            key = result[i]
            score = result[i + 1]
            if isinstance(key, bytes):
                key = key.decode()
            if isinstance(score, bytes):
                score = score.decode()
            scored.append((key, float(score)))

        return scored

    @classmethod
    def recompute_stats(cls, model_class, field_name):
        """Recompute avgdl from scratch to correct floating-point drift.

        Reads all document lengths from the dl sorted set and recomputes
        the average. Also verifies n matches the actual document count.

        Args:
            model_class: The Model class.
            field_name: Name of the BM25Field on the model.
        """
        field = model_class._meta.fields.get(field_name)
        if not isinstance(field, BM25Field):
            return

        prefix = field._key_prefix(model_class)
        dl_key = f"{prefix}:dl"
        n_key = f"{prefix}:n"
        avgdl_key = f"{prefix}:avgdl"

        # Get all document lengths
        all_dl = POPOTO_REDIS_DB.zrangebyscore(dl_key, "-inf", "+inf", withscores=True)

        actual_n = len(all_dl)
        total_dl = sum(score for _, score in all_dl)

        POPOTO_REDIS_DB.set(n_key, str(actual_n))
        if actual_n > 0:
            POPOTO_REDIS_DB.set(avgdl_key, str(total_dl / actual_n))
        else:
            POPOTO_REDIS_DB.set(avgdl_key, "0")

    @classmethod
    def get_idf(cls, model_class, field_name, tokens):
        """Get IDF scores for tokens without running a full search.

        Reads document frequency from the existing BM25 df sorted set
        and total doc count. Computes standard BM25 IDF:
            idf = log((N - df + 0.5) / (df + 0.5) + 1)

        Uses ZMSCORE (Redis >= 6.2, Valkey compatible) for batch df lookup.
        Falls back to individual ZSCORE calls if ZMSCORE is unavailable.

        Args:
            model_class: The Model class.
            field_name: Name of the BM25Field.
            tokens: Single token string or list of token strings.

        Returns:
            dict[str, float]: Mapping of token -> IDF score. Tokens not in
                the corpus get maximum IDF (log(N + 1) when df=0).
                Returns empty dict for empty token list.
                Returns {token: 0.0} for all tokens when corpus is empty (N=0).
        """
        field = model_class._meta.fields.get(field_name)
        if not isinstance(field, BM25Field):
            from ..models.query import QueryException

            raise QueryException(
                f"get_idf() requires a BM25Field. "
                f"'{field_name}' is "
                f"{type(field).__name__ if field else 'not found'}"
            )

        # Normalize single token to list
        if isinstance(tokens, str):
            tokens = [tokens]
        if not tokens:
            return {}

        prefix = field._key_prefix(model_class)
        n_key = f"{prefix}:n"
        df_key = f"{prefix}:df"

        # Read total doc count
        n_raw = POPOTO_REDIS_DB.get(n_key)
        N = int(n_raw) if n_raw else 0

        if N == 0:
            return {token: 0.0 for token in tokens}

        # Batch df lookup via ZMSCORE (Redis >= 6.2) with ZSCORE fallback
        df_values = cls._batch_zscore(df_key, tokens)

        # Compute IDF for each token
        result = {}
        for token, df_raw in zip(tokens, df_values):
            df = float(df_raw) if df_raw is not None else 0.0
            idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
            result[token] = idf

        return result

    @classmethod
    def _batch_zscore(cls, key, members):
        """Batch-read sorted set scores using ZMSCORE with ZSCORE fallback.

        ZMSCORE was added in Redis 6.2 and is supported by Valkey.
        Falls back to individual ZSCORE calls if ZMSCORE is unavailable.

        Args:
            key: Redis sorted set key.
            members: List of member strings to look up.

        Returns:
            list: Scores for each member (None if member not in set).
        """
        try:
            return POPOTO_REDIS_DB.zmscore(key, members)
        except (AttributeError, Exception):
            # ZMSCORE not available -- fall back to individual ZSCORE
            return [POPOTO_REDIS_DB.zscore(key, m) for m in members]

    @classmethod
    def filter_selective_tokens(
        cls, model_class, field_name, tokens, min_idf=1.0
    ):
        """Filter tokens to only those with IDF above a threshold.

        Useful for pre-filtering keywords before running search().
        Tokens not in the corpus are considered maximally selective
        and included.

        Args:
            model_class: The Model class.
            field_name: Name of the BM25Field.
            tokens: List of token strings.
            min_idf: Minimum IDF score to keep. Default 1.0.

        Returns:
            list[str]: Tokens with IDF >= min_idf, preserving original order.
        """
        if not tokens:
            return []

        idf_scores = cls.get_idf(model_class, field_name, tokens)
        return [t for t in tokens if idf_scores.get(t, 0.0) >= min_idf]

on_save(model_instance, field_name, field_value, pipeline=None, **kwargs) classmethod

Update BM25 index when a model instance is saved.

Reads the source field value, tokenizes it, and atomically updates all BM25 data structures (tf, df, dl, n, avgdl, inverted index) via a single Lua script.

Parameters:

Name Type Description Default
model_instance

The model instance being saved.

required
field_name

Name of this field on the model.

required
field_value

Current field value (unused -- BM25Field is side-effect only).

required
pipeline

Optional Redis pipeline (not used for Lua eval).

None
**kwargs

Additional context.

{}

Returns:

Type Description

The pipeline if provided, otherwise None.

Source code in src/popoto/fields/bm25_field.py
@classmethod
def on_save(
    cls,
    model_instance,
    field_name,
    field_value,
    pipeline=None,
    **kwargs,
):
    """Update BM25 index when a model instance is saved.

    Reads the source field value, tokenizes it, and atomically updates
    all BM25 data structures (tf, df, dl, n, avgdl, inverted index)
    via a single Lua script.

    Args:
        model_instance: The model instance being saved.
        field_name: Name of this field on the model.
        field_value: Current field value (unused -- BM25Field is side-effect only).
        pipeline: Optional Redis pipeline (not used for Lua eval).
        **kwargs: Additional context.

    Returns:
        The pipeline if provided, otherwise None.
    """
    field = model_instance._meta.fields[field_name]
    if not isinstance(field, BM25Field):
        return pipeline if pipeline else None

    model_class = type(model_instance)
    prefix = field._key_prefix(model_class)

    # Read source field content
    source_value = getattr(model_instance, field.source, None)
    if source_value is None:
        source_value = ""

    # Handle ContentField references
    if isinstance(source_value, str) and source_value.startswith("$CF:"):
        source_field = model_instance._meta.fields.get(field.source)
        if hasattr(source_field, "store"):
            try:
                source_value = source_field.store.read(source_value)
            except Exception:
                source_value = ""

    source_value = str(source_value)
    # Use unique=False to preserve raw term counts for accurate tf
    tokens = tokenize(source_value, unique=False)

    # Get the document's Redis key
    doc_key = model_instance.db_key.redis_key

    # Build Lua KEYS and ARGV
    tf_key = f"{prefix}:tf:{doc_key}"
    df_key = f"{prefix}:df"
    dl_key = f"{prefix}:dl"
    n_key = f"{prefix}:n"
    avgdl_key = f"{prefix}:avgdl"
    inv_prefix = f"{prefix}:inv:"

    keys = [tf_key, df_key, dl_key, n_key, avgdl_key]
    argv = [doc_key, inv_prefix] + tokens

    POPOTO_REDIS_DB.eval(BM25_SAVE_LUA, len(keys), *keys, *argv)

    return pipeline if pipeline else None

on_delete(model_instance, field_name, field_value, pipeline=None, **kwargs) classmethod

Remove a document from the BM25 index when deleted.

Atomically reverses the save operation: removes terms from the inverted index, updates df, removes dl entry, decrements n, and recomputes avgdl.

Parameters:

Name Type Description Default
model_instance

The model instance being deleted.

required
field_name

Name of this field on the model.

required
field_value

Current field value (unused).

required
pipeline

Optional Redis pipeline (not used for Lua eval).

None
**kwargs

Additional context.

{}

Returns:

Type Description

The pipeline if provided, otherwise None.

Source code in src/popoto/fields/bm25_field.py
@classmethod
def on_delete(
    cls,
    model_instance,
    field_name,
    field_value,
    pipeline=None,
    **kwargs,
):
    """Remove a document from the BM25 index when deleted.

    Atomically reverses the save operation: removes terms from the
    inverted index, updates df, removes dl entry, decrements n,
    and recomputes avgdl.

    Args:
        model_instance: The model instance being deleted.
        field_name: Name of this field on the model.
        field_value: Current field value (unused).
        pipeline: Optional Redis pipeline (not used for Lua eval).
        **kwargs: Additional context.

    Returns:
        The pipeline if provided, otherwise None.
    """
    field = model_instance._meta.fields[field_name]
    if not isinstance(field, BM25Field):
        return pipeline if pipeline else None

    model_class = type(model_instance)
    prefix = field._key_prefix(model_class)
    doc_key = model_instance.db_key.redis_key

    tf_key = f"{prefix}:tf:{doc_key}"
    df_key = f"{prefix}:df"
    dl_key = f"{prefix}:dl"
    n_key = f"{prefix}:n"
    avgdl_key = f"{prefix}:avgdl"
    inv_prefix = f"{prefix}:inv:"

    keys = [tf_key, df_key, dl_key, n_key, avgdl_key]
    argv = [doc_key, inv_prefix]

    POPOTO_REDIS_DB.eval(BM25_DELETE_LUA, len(keys), *keys, *argv)

    return pipeline if pipeline else None

search(model_class, field_name, query_text, limit=10) classmethod

Search the BM25 index and return ranked results.

Tokenizes the query, executes the BM25 scoring Lua script, and returns results sorted by BM25 score descending.

Parameters:

Name Type Description Default
model_class

The Model class to search.

required
field_name

Name of the BM25Field on the model.

required
query_text

The search query string.

required
limit

Maximum number of results to return. Default 10.

10

Returns:

Type Description

list[tuple[str, float]]: List of (redis_key, bm25_score) tuples sorted by score descending. Returns empty list if query produces no tokens or corpus is empty.

Raises:

Type Description
QueryException

If field_name does not refer to a BM25Field.

Source code in src/popoto/fields/bm25_field.py
@classmethod
def search(cls, model_class, field_name, query_text, limit=10):
    """Search the BM25 index and return ranked results.

    Tokenizes the query, executes the BM25 scoring Lua script, and
    returns results sorted by BM25 score descending.

    Args:
        model_class: The Model class to search.
        field_name: Name of the BM25Field on the model.
        query_text: The search query string.
        limit: Maximum number of results to return. Default 10.

    Returns:
        list[tuple[str, float]]: List of (redis_key, bm25_score) tuples
            sorted by score descending. Returns empty list if query
            produces no tokens or corpus is empty.

    Raises:
        QueryException: If field_name does not refer to a BM25Field.
    """
    from ..models.query import QueryException

    field = model_class._meta.fields.get(field_name)
    if not isinstance(field, BM25Field):
        raise QueryException(
            f"keyword_search() requires a BM25Field. "
            f"'{field_name}' is {type(field).__name__ if field else 'not found'}"
        )

    query_tokens = tokenize(query_text or "")
    if not query_tokens:
        return []

    prefix = field._key_prefix(model_class)
    df_key = f"{prefix}:df"
    dl_key = f"{prefix}:dl"
    n_key = f"{prefix}:n"
    avgdl_key = f"{prefix}:avgdl"
    inv_prefix = f"{prefix}:inv:"

    keys = [df_key, dl_key, n_key, avgdl_key]
    argv = [inv_prefix, limit, field.BM25_K1, field.BM25_B] + query_tokens

    result = POPOTO_REDIS_DB.eval(BM25_SEARCH_LUA, len(keys), *keys, *argv)

    if not result:
        return []

    # Parse flat array: [key1, score1, key2, score2, ...]
    scored = []
    for i in range(0, len(result), 2):
        key = result[i]
        score = result[i + 1]
        if isinstance(key, bytes):
            key = key.decode()
        if isinstance(score, bytes):
            score = score.decode()
        scored.append((key, float(score)))

    return scored

recompute_stats(model_class, field_name) classmethod

Recompute avgdl from scratch to correct floating-point drift.

Reads all document lengths from the dl sorted set and recomputes the average. Also verifies n matches the actual document count.

Parameters:

Name Type Description Default
model_class

The Model class.

required
field_name

Name of the BM25Field on the model.

required
Source code in src/popoto/fields/bm25_field.py
@classmethod
def recompute_stats(cls, model_class, field_name):
    """Recompute avgdl from scratch to correct floating-point drift.

    Reads all document lengths from the dl sorted set and recomputes
    the average. Also verifies n matches the actual document count.

    Args:
        model_class: The Model class.
        field_name: Name of the BM25Field on the model.
    """
    field = model_class._meta.fields.get(field_name)
    if not isinstance(field, BM25Field):
        return

    prefix = field._key_prefix(model_class)
    dl_key = f"{prefix}:dl"
    n_key = f"{prefix}:n"
    avgdl_key = f"{prefix}:avgdl"

    # Get all document lengths
    all_dl = POPOTO_REDIS_DB.zrangebyscore(dl_key, "-inf", "+inf", withscores=True)

    actual_n = len(all_dl)
    total_dl = sum(score for _, score in all_dl)

    POPOTO_REDIS_DB.set(n_key, str(actual_n))
    if actual_n > 0:
        POPOTO_REDIS_DB.set(avgdl_key, str(total_dl / actual_n))
    else:
        POPOTO_REDIS_DB.set(avgdl_key, "0")

get_idf(model_class, field_name, tokens) classmethod

Get IDF scores for tokens without running a full search.

Reads document frequency from the existing BM25 df sorted set and total doc count. Computes standard BM25 IDF: idf = log((N - df + 0.5) / (df + 0.5) + 1)

Uses ZMSCORE (Redis >= 6.2, Valkey compatible) for batch df lookup. Falls back to individual ZSCORE calls if ZMSCORE is unavailable.

Parameters:

Name Type Description Default
model_class

The Model class.

required
field_name

Name of the BM25Field.

required
tokens

Single token string or list of token strings.

required

Returns:

Type Description

dict[str, float]: Mapping of token -> IDF score. Tokens not in the corpus get maximum IDF (log(N + 1) when df=0). Returns empty dict for empty token list. Returns {token: 0.0} for all tokens when corpus is empty (N=0).

Source code in src/popoto/fields/bm25_field.py
@classmethod
def get_idf(cls, model_class, field_name, tokens):
    """Get IDF scores for tokens without running a full search.

    Reads document frequency from the existing BM25 df sorted set
    and total doc count. Computes standard BM25 IDF:
        idf = log((N - df + 0.5) / (df + 0.5) + 1)

    Uses ZMSCORE (Redis >= 6.2, Valkey compatible) for batch df lookup.
    Falls back to individual ZSCORE calls if ZMSCORE is unavailable.

    Args:
        model_class: The Model class.
        field_name: Name of the BM25Field.
        tokens: Single token string or list of token strings.

    Returns:
        dict[str, float]: Mapping of token -> IDF score. Tokens not in
            the corpus get maximum IDF (log(N + 1) when df=0).
            Returns empty dict for empty token list.
            Returns {token: 0.0} for all tokens when corpus is empty (N=0).
    """
    field = model_class._meta.fields.get(field_name)
    if not isinstance(field, BM25Field):
        from ..models.query import QueryException

        raise QueryException(
            f"get_idf() requires a BM25Field. "
            f"'{field_name}' is "
            f"{type(field).__name__ if field else 'not found'}"
        )

    # Normalize single token to list
    if isinstance(tokens, str):
        tokens = [tokens]
    if not tokens:
        return {}

    prefix = field._key_prefix(model_class)
    n_key = f"{prefix}:n"
    df_key = f"{prefix}:df"

    # Read total doc count
    n_raw = POPOTO_REDIS_DB.get(n_key)
    N = int(n_raw) if n_raw else 0

    if N == 0:
        return {token: 0.0 for token in tokens}

    # Batch df lookup via ZMSCORE (Redis >= 6.2) with ZSCORE fallback
    df_values = cls._batch_zscore(df_key, tokens)

    # Compute IDF for each token
    result = {}
    for token, df_raw in zip(tokens, df_values):
        df = float(df_raw) if df_raw is not None else 0.0
        idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
        result[token] = idf

    return result

filter_selective_tokens(model_class, field_name, tokens, min_idf=1.0) classmethod

Filter tokens to only those with IDF above a threshold.

Useful for pre-filtering keywords before running search(). Tokens not in the corpus are considered maximally selective and included.

Parameters:

Name Type Description Default
model_class

The Model class.

required
field_name

Name of the BM25Field.

required
tokens

List of token strings.

required
min_idf

Minimum IDF score to keep. Default 1.0.

1.0

Returns:

Type Description

list[str]: Tokens with IDF >= min_idf, preserving original order.

Source code in src/popoto/fields/bm25_field.py
@classmethod
def filter_selective_tokens(
    cls, model_class, field_name, tokens, min_idf=1.0
):
    """Filter tokens to only those with IDF above a threshold.

    Useful for pre-filtering keywords before running search().
    Tokens not in the corpus are considered maximally selective
    and included.

    Args:
        model_class: The Model class.
        field_name: Name of the BM25Field.
        tokens: List of token strings.
        min_idf: Minimum IDF score to keep. Default 1.0.

    Returns:
        list[str]: Tokens with IDF >= min_idf, preserving original order.
    """
    if not tokens:
        return []

    idf_scores = cls.get_idf(model_class, field_name, tokens)
    return [t for t in tokens if idf_scores.get(t, 0.0) >= min_idf]