bi_etl.lookups.autodisk_range_lookup module

Created on Jan 5, 2016

@author: Derek Wood

class bi_etl.lookups.autodisk_range_lookup.AutoDiskRangeLookup(lookup_name: str, lookup_keys: list, parent_component: ETLComponent, begin_date, end_date, config: BI_ETL_Config_Base = None, use_value_cache: bool = True, path=None)[source]

Bases: AutoDiskLookup, RangeLookup

Automatic memory / disk lookup cache.

This version divides the cache into N chunks (default is 10). If RAM usage gets beyond limits, it starts moving chunks to disk. Once a chunk is on disk, it stays there.

TODO: For use cases where the lookup will be used in a mostly sequential fashion, it would be useful to have a version that uses ranges instead of a hash function. Then when find_in_cache is called on a disk segment, we could swap a different segment out and bring that segment in. That’s a lot more complicated. We’d also want to maintain a last used date for each segment so that if we add rows to the cache, we can choose the best segment to swap to disk.

Also worth considering is that if we bring a segment in from disk, it would best to keep the disk version. At that point any additions to that segment would need to go to both places.

COLLECTION_INDEX = datetime.datetime(1900, 1, 1, 0, 0)

DB_LOOKUP_WARNING = 1000

ROW_TYPES: alias of Union[Row, Sequence]

VERSION_COLLECTION_TYPE: alias of OOBTree

__init__(lookup_name: str, lookup_keys: list, parent_component: ETLComponent, begin_date, end_date, config: BI_ETL_Config_Base = None, use_value_cache: bool = True, path=None)[source]: Optional parameter path controls where the data is persisted

add_size_to_stats() → None

cache_row(row: Row, allow_update: bool = True, allow_insert: bool = True)[source]

Adds the given row to the cache for this lookup.

Parameters:

row¶ (Row) – The row to cache
allow_update¶ (boolean) – Allow this method to update an existing row in the cache.
allow_insert¶ (boolean) – Allow this method to insert a new row into the cache

Raises:

ValueError – If allow_update is False and an already existing row (lookup key) is passed in.

cache_set(lk_tuple: tuple, version_collection: OOBTree[datetime, Row], allow_update: bool = True)

Adds the given set of rows to the cache for this lookup.

Parameters:

lk_tuple¶ – The key tuple to store the rows under
version_collection¶ – The set of rows to cache
allow_update¶ (boolean) – Allow this method to update an existing row in the cache.

Raises:

ValueError – If allow_update is False and an already existing row (lookup key) is passed in.

check_estimate_row_size(force_now=False)

clear_cache(): Removes cache and resets to un-cached state

commit(): Placeholder for other implementations that might need it

estimated_row_size()

find(row: ROW_TYPES, fallback_to_db: bool = True, maintain_cache: bool = True, stats: Statistics = None, **kwargs) → Row

find_in_cache(row, **kwargs)[source]: Find a matching row in the lookup based on the lookup index (keys)

find_in_remote_table(row: Row | Sequence, **kwargs) → Row

Find a matching row in the lookup based on the lookup index (keys)

Only works if parent_component is based on bi_etl.components.readonlytable

find_matches_in_cache(row: ROW_TYPES, **kwargs) → Sequence[Row]

find_versions_list(row: ROW_TYPES, fallback_to_db: bool = True, maintain_cache: bool = True, stats: Statistics = None) → list

Parameters:

row¶ – row or tuple to find
fallback_to_db¶ – Use db to search if not found in cached copy
maintain_cache¶ – Add DB lookup rows to the cached copy?
stats¶ – Statistics to maintain

Return type:

A MutableMapping of rows

find_versions_list_in_remote_table(row: Row | Sequence) → list

Find a matching row in the lookup based on the lookup index (keys)

Only works if parent_component is based on bi_etl.components.readonlytable

find_where(key_names: Sequence, key_values_dict: Mapping, limit: int = None): Scan all cached rows (expensive) to find list of rows that match criteria.

flush_to_disk()

get_disk_size()

get_hashable_combined_key(row: ROW_TYPES) → Sequence

get_list_of_lookup_column_values(row: ROW_TYPES) → list

get_memory_size()

get_versions_collection(row: Row | Sequence) → MutableMapping[datetime, Row]

This method exists for compatibility with range caches

Parameters:: row¶ – The row with keys to search row
Return type:: A MutableMapping of rows

has_done_get_estimate_row_size()

has_row(row: ROW_TYPES) → bool

Does the row exist in the cache (for any date if it’s a date range cache)

Parameters:: row¶ –

init_cache(): Initializes the cache as empty.

init_disk_cache()

property lookup_keys_set

memory_limit_reached() → bool

report_on_value_cache_effectiveness(lookup_name: str = None)

row_iteration_header_has_lookup_keys(row_iteration_header: RowIterationHeader) → bool

static rstrip_key_value(val: object) → object

Since most, if not all, DBs consider two strings that only differ in trailing blanks to be equal, we need to rstrip any string values so that the lookup does the same.

Parameters:: val¶ –
Returns:

uncache_row(row: Row | Sequence)

uncache_set(row: Row | Sequence)

uncache_where(key_names: Sequence, key_values_dict: Mapping): Scan all cached rows (expensive) to find rows to remove.