bi_etl.components.hst_table module

Created on Nov 4, 2014

@author: Derek Wood

class bi_etl.components.hst_table.HistoryTable(task: ETLTask | None, database: DatabaseMetadata, table_name: str, table_name_case_sensitive: bool = True, schema: str = None, exclude_columns: frozenset = None, default_effective_date: datetime = None, **kwargs)[source]

Bases: Table

ETL target component for a table that stores history of updates. Also usable as a source.

Parameters:

task¶ (ETLTask) – The instance to register in (if not None)
database¶ (bi_etl.scheduler.task.Database) – The database to find the table/view in.
table_name¶ (str) – The name of the table/view.
exclude_columns¶ (frozenset) – Optional. A list of columns to exclude from the table/view. These columns will not be included in SELECT, INSERT, or UPDATE statements.

auto_generate_key

Should the primary key be automatically generated by the insert/upsert process? If True, the process will get the current maximum value and then increment it with each insert.

Type:: boolean

begin_date

Name of the begin date field

Type:: str

end_date

Name of the end date field

Type:: str

inserts_use_default_begin_date

Should inserts use the default begin date instead of the class effective date This allows records to match up in joins with other history tables where the effective date of the ‘primary’ might be before the first version effective date. Default = True

Type:: boolean

default_begin_date

Default begin date to assign for begin_date. Used for new records if inserts_use_default_begin_date is True. Also used for get_missing_row(), get_invalid_row(), get_not_applicable_row(), get_various_row() Default = 1900-1-1

Type:: date

default_end_date

Default begin date to assign for end_date for active rows. Also used for get_missing_row(), get_invalid_row(), get_not_applicable_row(), get_various_row() Default = 9999-1-1

Type:: date

auto_generate_key

Should the primary key be automatically generated by the insert/upsert process? If True, the process will get the current maximum value and then increment it with each insert. (inherited from Table)

Type:: boolean

batch_size

How many rows should be insert/update/deleted in a single batch. (inherited from Table)

Type:: int

delete_flag

The name of the delete_flag column, if any. Optional. (inherited from ReadOnlyTable)

Type:: str

delete_flag_yes

The value of delete_flag for deleted rows. Optional. (inherited from ReadOnlyTable)

Type:: str

delete_flag_no

The value of delete_flag for not deleted rows. Optional. (inherited from ReadOnlyTable)

Type:: str

default_date_format

The date parsing format to use for str -> date conversions. If more than one date format exists in the source, then explicit conversions will be required.

Default = ‘%m/%d/%Y’ (inherited from Table)

Type:: str

force_ascii

Should text values be forced into the ascii character set before passing to the database? Default = False (inherited from Table)

Type:: boolean

last_update_date

Name of the column which we should update when table updates are made. Default = None (inherited from Table)

Type:: str

log_first_row

Should we log progress on the first row read. Only applies if used as a source. (inherited from ETLComponent)

Type:: boolean

max_rows

The maximum number of rows to read. Only applies if Table is used as a source. Optional. (inherited from ETLComponent)

Type:: int

primary_key

The name of the primary key column(s). Only impacts trace messages. Default=None. If not passed in, will use the database value, if any. (inherited from ETLComponent)

Type:: list

natural_key

The list of natural key columns (as Column objects). The default is the list of non-begin/end date primary key columns. The default is not appropriate for dimension tables with surrogate keys.

Type:: list

progress_frequency

How often (in seconds) to output progress messages. Optional. (inherited from ETLComponent)

Type:: int

progress_message

The progress message to print. Optional. Default is "{logical_name} row # {row_number}". Note logical_name and row_number substitutions applied via format(). (inherited from ETLComponent)

Type:: str

special_values_descriptive_columns

A list of columns that get longer descriptive text in get_missing_row(), get_invalid_row(), get_not_applicable_row(), get_various_row() Optional. (inherited from ReadOnlyTable)

Type:: list

track_source_rows

Should the upsert() method keep a set container of source row keys that it has processed? That set would then be used by update_not_processed(), logically_delete_not_processed(), and delete_not_processed(). (inherited from Table)

Type:: boolean

type_1_surrogate

The name of the type 1 surrogate key. The value is automatically generated as equal to the type 2 key on inserts and equal to the existing value on updates. Optional.

Type:: str

DEFAULT_BATCH_SIZE = 5000

DEFAULT_PROGRESS_FREQUENCY = 10: Default for number of seconds between progress messages when reading from this component. See ETLComponent.progress_frequency` to override.

DEFAULT_PROGRESS_MESSAGE = '{logical_name} current row # {row_number:,}': Default progress message when reading from this component. See ETLComponent.progress_message` to override.

class DeleteMethod(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: IntEnum

__init__(*args, **kwds)

as_integer_ratio()

Return a pair of integers, whose ratio is equal to the original int.

The ratio is in lowest terms and has a positive denominator.

>>> (10).as_integer_ratio()
(10, 1)
>>> (-10).as_integer_ratio()
(-10, 1)
>>> (0).as_integer_ratio()
(0, 1)

bit_count()

Number of ones in the binary representation of the absolute value of self.

Also known as the population count.

>>> bin(13)
'0b1101'
>>> (13).bit_count()
3

bit_length()

Number of bits necessary to represent self in binary.

>>> bin(37)
'0b100101'
>>> (37).bit_length()
6

bulk_load = 3

conjugate(): Returns self, the complex conjugate of any int.

denominator: the denominator of a rational number in lowest terms

execute_many = 1

from_bytes(byteorder='big', *, signed=False)

Return the integer represented by the given array of bytes.

bytes: Holds the array of bytes to convert. The argument must either support the buffer protocol or be an iterable object producing bytes. Bytes and bytearray are examples of built-in objects that support the buffer protocol.
byteorder: The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value. Default is to use ‘big’.
signed: Indicates whether two’s complement is used to represent the integer.

imag: the imaginary part of a complex number

is_integer(): Returns True. Exists for duck type compatibility with float.is_integer.

numerator: the numerator of a rational number in lowest terms

real: the real part of a complex number

to_bytes(length=1, byteorder='big', *, signed=False)

Return an array of bytes representing an integer.

length: Length of bytes object to use. An OverflowError is raised if the integer is not representable with the given number of bytes. Default is length 1.
byteorder: The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value. Default is to use ‘big’.
signed: Determines whether two’s complement is used to represent the integer. If signed is False and a negative integer is given, an OverflowError is raised.

FULL_ITERATION_HEADER = 'full': Constant value passed into ETLComponent.Row() to request all columns in the row. Deprecated: Please use ETLComponent.full_row_instance() to get a row with all columns.

class InsertMethod(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: IntEnum

__init__(*args, **kwds)

as_integer_ratio()

Return a pair of integers, whose ratio is equal to the original int.

The ratio is in lowest terms and has a positive denominator.

>>> (10).as_integer_ratio()
(10, 1)
>>> (-10).as_integer_ratio()
(-10, 1)
>>> (0).as_integer_ratio()
(0, 1)

bit_count()

Number of ones in the binary representation of the absolute value of self.

Also known as the population count.

>>> bin(13)
'0b1101'
>>> (13).bit_count()
3

bit_length()

Number of bits necessary to represent self in binary.

>>> bin(37)
'0b100101'
>>> (37).bit_length()
6

bulk_load = 3

conjugate(): Returns self, the complex conjugate of any int.

denominator: the denominator of a rational number in lowest terms

execute_many = 1

from_bytes(byteorder='big', *, signed=False)

Return the integer represented by the given array of bytes.

bytes: Holds the array of bytes to convert. The argument must either support the buffer protocol or be an iterable object producing bytes. Bytes and bytearray are examples of built-in objects that support the buffer protocol.
byteorder: The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value. Default is to use ‘big’.
signed: Indicates whether two’s complement is used to represent the integer.

imag: the imaginary part of a complex number

insert_values_list = 2

is_integer(): Returns True. Exists for duck type compatibility with float.is_integer.

numerator: the numerator of a rational number in lowest terms

real: the real part of a complex number

to_bytes(length=1, byteorder='big', *, signed=False)

Return an array of bytes representing an integer.

length: Length of bytes object to use. An OverflowError is raised if the integer is not representable with the given number of bytes. Default is length 1.
byteorder: The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value. Default is to use ‘big’.
signed: Determines whether two’s complement is used to represent the integer. If signed is False and a negative integer is given, an OverflowError is raised.

class IntEnum(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: int, ReprEnum

Enum where members are also (and must be) ints

__init__(*args, **kwds)

as_integer_ratio()

Return a pair of integers, whose ratio is equal to the original int.

The ratio is in lowest terms and has a positive denominator.

>>> (10).as_integer_ratio()
(10, 1)
>>> (-10).as_integer_ratio()
(-10, 1)
>>> (0).as_integer_ratio()
(0, 1)

bit_count()

Number of ones in the binary representation of the absolute value of self.

Also known as the population count.

>>> bin(13)
'0b1101'
>>> (13).bit_count()
3

bit_length()

Number of bits necessary to represent self in binary.

>>> bin(37)
'0b100101'
>>> (37).bit_length()
6

conjugate(): Returns self, the complex conjugate of any int.

denominator: the denominator of a rational number in lowest terms

from_bytes(byteorder='big', *, signed=False)

Return the integer represented by the given array of bytes.

bytes: Holds the array of bytes to convert. The argument must either support the buffer protocol or be an iterable object producing bytes. Bytes and bytearray are examples of built-in objects that support the buffer protocol.
byteorder: The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value. Default is to use ‘big’.
signed: Indicates whether two’s complement is used to represent the integer.

imag: the imaginary part of a complex number

is_integer(): Returns True. Exists for duck type compatibility with float.is_integer.

numerator: the numerator of a rational number in lowest terms

real: the real part of a complex number

to_bytes(length=1, byteorder='big', *, signed=False)

Return an array of bytes representing an integer.

length: Length of bytes object to use. An OverflowError is raised if the integer is not representable with the given number of bytes. Default is length 1.
byteorder: The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value. Default is to use ‘big’.
signed: Determines whether two’s complement is used to represent the integer. If signed is False and a negative integer is given, an OverflowError is raised.

NAN_REPLACEMENT_VALUE = None

NK_LOOKUP = 'NK'

PK_LOOKUP = 'PK'

Row(data: MutableMapping | Iterator | None = None, iteration_header: RowIterationHeader | str | None = None) → Row: Make a new empty row with this components structure.

class TimePrecision(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

day = 'd'

hour = 'h'

microsecond = 'µs'

millisecond = 'ms'

minute = 'm'

second = 's'

class UpdateMethod(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: IntEnum

__init__(*args, **kwds)

as_integer_ratio()

Return a pair of integers, whose ratio is equal to the original int.

The ratio is in lowest terms and has a positive denominator.

>>> (10).as_integer_ratio()
(10, 1)
>>> (-10).as_integer_ratio()
(-10, 1)
>>> (0).as_integer_ratio()
(0, 1)

bit_count()

Number of ones in the binary representation of the absolute value of self.

Also known as the population count.

>>> bin(13)
'0b1101'
>>> (13).bit_count()
3

bit_length()

Number of bits necessary to represent self in binary.

>>> bin(37)
'0b100101'
>>> (37).bit_length()
6

bulk_load = 3

conjugate(): Returns self, the complex conjugate of any int.

denominator: the denominator of a rational number in lowest terms

execute_many = 1

from_bytes(byteorder='big', *, signed=False)

Return the integer represented by the given array of bytes.

bytes: Holds the array of bytes to convert. The argument must either support the buffer protocol or be an iterable object producing bytes. Bytes and bytearray are examples of built-in objects that support the buffer protocol.
byteorder: The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value. Default is to use ‘big’.
signed: Indicates whether two’s complement is used to represent the integer.

imag: the imaginary part of a complex number

is_integer(): Returns True. Exists for duck type compatibility with float.is_integer.

numerator: the numerator of a rational number in lowest terms

real: the real part of a complex number

to_bytes(length=1, byteorder='big', *, signed=False)

Return an array of bytes representing an integer.

length: Length of bytes object to use. An OverflowError is raised if the integer is not representable with the given number of bytes. Default is length 1.
byteorder: The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value. Default is to use ‘big’.
signed: Determines whether two’s complement is used to represent the integer. If signed is False and a negative integer is given, an OverflowError is raised.

__init__(task: ETLTask | None, database: DatabaseMetadata, table_name: str, table_name_case_sensitive: bool = True, schema: str = None, exclude_columns: frozenset = None, default_effective_date: datetime = None, **kwargs)[source]

apply_updates(row, changes_list: MutableSequence[ColumnDifference] = None, additional_update_values: dict | Row = None, add_to_cache: bool = True, allow_insert=True, stat_name: str = 'update', parent_stats: Statistics = None, **kwargs)[source]

This method should only be called with a row that has already been transformed into the correct datatypes and column names.

The update values can be in any of the following parameters

row (also used for PK)
changes_list
additional_update_values

Parameters:

row¶ – The row to update with (needs to have at least PK values)
changes_list¶ – A list of ColumnDifference objects to apply to the row
additional_update_values¶ – A Row or dict of additional values to apply to the row
add_to_cache¶ – Should this method update the cache (not if caller will)
allow_insert¶ (boolean) – Allow this method to insert a new row into the cache
stat_name¶ – Name of this step for the ETLTask statistics. Default = ‘update’
parent_stats¶ –
kwargs¶ –

effective_date: datetime
The effective date to use for the update

autogenerate_key(row: Row, force_override: bool = True)

autogenerate_sequence(row: Row, seq_column: str, force_override: bool = True)

property batch_size

begin(connection_name: str | None = None)

property begin_date_column: str

property begin_end_precision: TimePrecision

build_nk()[source]

build_row(source_row: Row, source_excludes: frozenset | None = None, target_excludes: frozenset | None = None, stat_name: str = 'build rows', parent_stats: Statistics = None) → Row[source]

Use a source row to build a row with correct data types for this table.

Parameters:

source_row¶ –
source_excludes¶ –
target_excludes¶ –
stat_name¶ – Name of this step for the ETLTask statistics. Default = ‘build rows’
parent_stats¶ –

Return type:

Row

build_row_dynamic_source(source_row: Row, source_excludes: frozenset | None = None, target_excludes: frozenset | None = None, stat_name: str = 'build_row_dynamic_source', parent_stats: Statistics | None = None) → Row

Use a source row to build a row with correct data types for this table. This version expects dynamically changing source rows, so it sanity checks all rows.

Parameters:

source_row¶ –
source_excludes¶ –
target_excludes¶ –
stat_name¶ – Name of this step for the ETLTask statistics. Default = ‘build rows’
parent_stats¶ –

Return type:

Row

bulk_load_from_cache(temp_table: str | None = None, stat_name: str = 'bulk_load_from_cache', parent_stats: Statistics | None = None)

cache_commit()

cache_iterable()

cache_row(row: Row, allow_update: bool = False, allow_insert: bool = True)

property check_row_limit

cleanup_versions(remove_spurious_deletes: bool = False, remove_redundant_versions: bool = None, lookup_name: str = None, exclude_from_compare: frozenset = None, criteria_list: list = None, criteria_dict: dict = None, use_cache_as_source: bool = True, repeat_until_clean: bool = True, max_begin_date_warnings: int = 100, max_end_date_warnings: int = 100, max_passes: int = 10, parent_stats=None, progress_message='{table} cleanup_versions pass {pass_number} current row # {row_number:,}')[source]

This routine will look for and remove versions where no material difference exists between it and the prior version. That can happen during loads if rows come in out of order. The routine can also optionally look for remove_spurious_deletes (see below). It also checks for version dates that are not set correctly.

Parameters:

remove_spurious_deletes¶ (boolean) – (defaults to False): Should the routine check for and remove versions that tag a record as being deleted only to be un-deleted later.
remove_redundant_versions¶ (boolean) –

(defaults to opposite of auto_generate_key value): Should the routine delete rows that are exactly
the same as the previous.
lookup_name¶ (str) – Name passed into define_lookup()
criteria_list¶ (string or list of strings) – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
criteria_dict¶ (dict) – Dict keys should be columns, values are set using = or in
exclude_from_compare¶ (frozenset) – Columns to exclude from comparisons. Defaults to begin date, end date, and last update date. Any values passed in are added to that list.
use_cache_as_source¶ (bool) – Attempt to read existing rows from the cache?
repeat_until_clean¶ – repeat loop until all issues are cleaned. Multiple bad rows in a key set can require multiple passes.
max_begin_date_warnings¶ – How many warning messages to print about bad begin dates
max_end_date_warnings¶ – How many warning messages to print about bad end dates
max_passes¶ – How many times should we loop over the dataset if we keep finding fixes. Note: Some situations do require multiple passes to fully correct.
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
progress_message¶ (str) – The progress message to print.

clear_cache(): Clear all lookup caches. Sets to un-cached state (unknown state v.s. empty state which is what init_cache gives)

clear_statistics()

close(error: bool = False)

close_connection(connection_name: str = None)

close_connections(exceptions: set | None = None)

column_coerce_type(target_column_object: ColumnElement, target_column_value: object): This is the slower non-dynamic code based data type conversion routine

property column_names: List[str]: The list of column names for this component.

property column_names_set: set: A set containing the column names for this component. Usable to quickly check if the component contains a certain column.

property columns: List[Column]: A named-based collection of sqlalchemy.sql.expression.ColumnElement objects in this table/view.

commit(stat_name: str = 'commit', parent_stats: Statistics | None = None, print_to_log: bool = True, connection_name: str | None = None, begin_new: bool = True)

Flush any buffered deletes, updates, or inserts

Parameters:

stat_name¶ (str) – Name of this step for the ETLTask statistics. Default = ‘commit’
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
print_to_log¶ (bool) – Should this add a debug log entry for each commit. Defaults to true.
connection_name¶ – Name of the pooled connection to use Defaults to DEFAULT_CONNECTION_NAME
begin_new¶ – Start a new transaction after commit

connection(connection_name: str | None = None, open_if_not_exist: bool = True, open_if_closed: bool = True) → Connection

count(column: str = None, where=None) → int

Query the table/view to get the count of a given column.

Parameters:

column¶ (str or sqlalchemy.sql.expression.ColumnElement.) – The column to get the max value of
where¶ (string or list of strings) – Each string value will be passed to sqlalchemy.sql.expression.Select.where() http://docs.sqlalchemy.org/en/rel_1_0/core/selectable.html?highlight=where#sqlalchemy.sql.expression.Select.where

Returns:

count

Return type:

int

debug_log(state: bool = True)

property default_begin_date

property default_effective_date

property default_end_date

define_lookup(lookup_name, lookup_keys, lookup_class=None, lookup_class_kwargs=None)[source]

Define a new lookup.

Parameters:

lookup_name¶ (str) – Name for the lookup. Used to refer to it later.
lookup_keys¶ (list) – list of lookup key columns
lookup_class¶ (Class) – Optional python class to use for the lookup. Defaults to value of default_lookup_class attribute.
lookup_class_kwargs¶ (dict) – Optional dict of additional parameters to pass to lookup constructor. Defaults to empty dict. begin_date and end_date are added automatically.

delete(**kwargs)[source]: Not implemented for history table. Instead use physically_delete_version.

property delete_flag

property delete_method

delete_not_in_set(set_of_key_tuples: set, lookup_name: str = None, criteria_list: list = None, criteria_dict: dict = None, use_cache_as_source: bool = True, stat_name: str = 'delete_not_in_set', progress_frequency: int = None, parent_stats: Statistics = None, **kwargs)[source]

Overridden to call logical delete.

Deletes rows matching criteria that are not in the list_of_key_tuples pass in.

Parameters:

set_of_key_tuples¶ – List of tuples comprising the primary key values. This list represents the rows that should not be deleted.
lookup_name¶ (str) – The name of the lookup to use to find key tuples.
criteria_list¶ (string or list of strings) – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
criteria_dict¶ (dict) – Dict keys should be columns, values are set using = or in
use_cache_as_source¶ (bool) – Attempt to read existing rows from the cache?
stat_name¶ (string) – Name of this step for the ETLTask statistics. Default = ‘delete_not_in_set’
progress_frequency¶ (int) – How often (in seconds) to output progress messages. Default 10. None for no progress messages.
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
kwargs¶ –

effective_date: datetime
The effective date to use for the update

delete_not_processed(criteria_list: list = None, criteria_dict: dict = None, use_cache_as_source: bool = True, stat_name: str = 'delete_not_processed', parent_stats: Statistics = None, **kwargs)[source]

Overridden to call logical delete.

Logically deletes rows matching criteria that are not in the Table memory of rows passed to upsert().

Parameters:

criteria_list¶ (string or list of strings) – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
criteria_dict¶ (dict) – Dict keys should be columns, values are set using = or in
use_cache_as_source¶ (bool) – Attempt to read existing rows from the cache?
stat_name¶ (string) – Name of this step for the ETLTask statistics.
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.

property empty_iteration_header: RowIterationHeader

property end_date_column: str

ensure_nk_lookup()

exclude_columns(columns_to_exclude: set)

Exclude columns from the table. Removes them from all SQL statements.

columns_to_exclude :: A list of columns to exclude when reading the table/view.

execute(statement, *list_params, connection_name: str = None, **params) → LegacyCursorResult

Parameters:

statement¶ – The SQL statement to execute. Note: caller must handle the transaction begin/end.
connection_name¶ – Name of the pooled connection to use Defaults to DEFAULT_CONNECTION_NAME

Return type:

sqlalchemy.engine.ResultProxy with results

fill_cache(progress_frequency: float = 10, progress_message='{component} fill_cache current row # {row_number:,}', criteria_list: list = None, criteria_dict: dict = None, column_list: list = None, exclude_cols: frozenset = None, order_by: list = None, assume_lookup_complete: bool = None, allow_duplicates_in_src: bool = False, row_limit: int = None, parent_stats: Statistics = None)[source]

Fill all lookup caches from the table.

Parameters:

column_list¶ (list) – Optional. Specific columns to include when filling the cache.
exclude_cols¶ (frozenset) – Optional. Columns to exclude from the cached rows.
progress_frequency¶ (int) – How often (in seconds) to output progress messages. Default 10. None for no progress messages. Optional.
progress_message¶ (str) – The progress message to print. Default is "{table} fill_cache current row # {row_number:,}". Note logical_name and row_number substitutions applied via format(). Optional.
criteria_list¶ (string or list of strings) – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
criteria_dict¶ (dict) – Dict keys should be columns, values are set using = or in
assume_lookup_complete¶ (boolean) – Should later lookup calls assume the cache is complete (and thus raise an Exception if a key combination is not found)? Default to False if filtering criteria was used, otherwise defaults to True.
allow_duplicates_in_src¶ – Should we quietly let the source provide multiple rows with the same key values? Default = False
row_limit¶ (int) – limit on number of rows to cache.
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
order_by¶ (list) – Columns to order by when pulling data. Sometimes required to build the cache corretly.

fill_cache_from_source(source: ETLComponent, progress_frequency: float = 10, progress_message='{component} fill_cache current row # {row_number:,}', criteria_list: list = None, criteria_dict: dict = None, column_list: list = None, exclude_cols: frozenset = None, order_by: list = None, assume_lookup_complete: bool = None, allow_duplicates_in_src: bool = False, row_limit: int = None, parent_stats: Statistics = None)

Fill all lookup caches from the table.

Parameters:

source¶ – Source compontent to get rows from.
progress_frequency¶ (int, optional) – How often (in seconds) to output progress messages. Default 10. None for no progress messages.
progress_message¶ (str, optional) – The progress message to print. Default is "{component} fill_cache current row # {row_number:,}". Note logical_name and row_number substitutions applied via format().
criteria_list¶ (string or list of strings) – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
criteria_dict¶ (dict) – Dict keys should be columns, values are set using = or in
column_list¶ – List of columns to include
exclude_cols¶ – Optional. Columns to exclude when filling the cache
order_by¶ (list) – list of columns to sort by when filling the cache (helps range caches)
assume_lookup_complete¶ (boolean) – Should later lookup calls assume the cache is complete? If so, lookups will raise an Exception if a key combination is not found. Default to False if filtering criteria was used, otherwise defaults to True.
allow_duplicates_in_src¶ – Should we quietly let the source provide multiple rows with the same key values? Default = False
row_limit¶ (int) – limit on number of rows to cache.
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.

property full_iteration_header: RowIterationHeader

full_row_instance(data: MutableMapping | Iterator | None = None) → Row

Build a full row (all columns) using the source data.

Note: If data is passed here, it uses bi_etl.components.row.row.Row.update() to map the data into the columns. That is nicely automatic, but slower since it has to try various ways to read the data container object.

Consider using the appropriate one of the more specific update methods based on the source data container.

generate_iteration_header(logical_name: str | None = None, columns_in_order: list | None = None, result_primary_key: list | None = None) → RowIterationHeader

get_bind_name(column_name: str) → str

get_by_key(source_row: Row, stats_id: str = 'get_by_key', parent_stats: Statistics = None) → Row: Get by the primary key.

get_by_lookup(lookup_name: str, source_row: Row, stats_id: str = 'get_by_lookup', parent_stats: Statistics | None = None, fallback_to_db: bool = False) → Row[source]

Get by an alternate key. Returns a row.

Parameters:

lookup_name¶ (str) – Name passed into define_lookup()
source_row¶ (Row) – Row to get lookup keys from (including effective date)
stats_id¶ (str) – Statistics name to use
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
fallback_to_db¶ – Should we check the DB if a record is not found in the cache

Raises:

NoResultFound: – If key doesn’t exist.
BeforeAllExisting: – If the effective date provided is before all existing records.

get_by_lookup_and_effective_date(lookup_name, source_row, effective_date, stats_id='get_by_lookup_and_effective_date', parent_stats=None, fallback_to_db: bool = False)[source]

Get by an alternate key. Returns a row.

Parameters:

lookup_name¶ (str) – Name passed into define_lookup()
source_row¶ (Row) – Row to get lookup keys from
effective_date¶ (date) – Effective date to use for lookup
stats_id¶ (str) – Statistics name to use
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
fallback_to_db¶ – Should we check the DB if a record is not found in the cache

Raises:

NoResultFound: – If key doesn’t exist.
BeforeAllExisting: – If the effective date provided is before all existing records.

get_coerce_method(target_column_object: str | ColumnElement) → Callable

get_column(column: str | Column) → Column: Get the sqlalchemy.sql.expression.ColumnElement object for a given column name.

get_column_name(column): Get the column name given a possible sqlalchemy.sql.expression.ColumnElement object.

get_column_special_value(column: Column, short_char: str, long_char: str, int_value: int, date_value: datetime, use_custom_special_values: bool = 'Y') → object

get_current_time() → datetime

get_default_lookup(row_iteration_header: RowIterationHeader) → Lookup

get_invalid_row()

Get a Row with the Invalid special values filled in for all columns.

Type	Value
Integer	-8888
Short Text	‘!’
Long Text	‘Invalid’
Date	9999-8-1

get_lookup(lookup_name: str | None) → Lookup

get_lookup_keys(lookup_name: str) → list

get_lookup_tuple(lookup_name: str, row: Row) → tuple

get_missing_row()

Get a Row with the Missing special values filled in for all columns.

Type	Value
Integer	-9999
Short Text	‘?’
Long Text	‘Missing’
Date	9999-9-1

get_natural_key_tuple(row) → tuple

get_natural_key_value_list(row: Row) → list

get_nk_lookup() → Lookup

get_nk_lookup_name()

get_none_selected_row()

Get a Row with the None Selected special values filled in for all columns.

Type	Value
Integer	-5555
Short Text	‘#’
Long Text	‘None Selected’
Date	9999-5-1

get_not_applicable_row()

Get a Row with the Not Applicable special values filled in for all columns.

Type	Value
Integer	-7777
Short Text	‘~’
Long Text	‘Not Available’
Date	9999-7-1

get_one(statement=None)

Executes and gets one row from the statement.

Parameters:

statement¶ – The SQL statement to execute

Returns:

row – The row returned

Return type:

Row

Raises:

NoResultFound – No rows returned.
MultipleResultsFound – More than one row was returned.

get_pk_lookup() → Lookup

get_primary_key_value_list(row) → list

get_primary_key_value_tuple(row) → tuple

get_qualified_lookup_name(base_lookup_name: str) → str

get_special_row(short_char: str, long_char: str, int_value: int, date_value: datetime)

get_stats_entry(stats_id: str, parent_stats: Statistics | None = None, print_start_stop_times: bool | None = None)

get_unique_stats_entry(stats_id: str, parent_stats: Statistics | None = None, print_start_stop_times: bool | None = None)

get_various_row()

Get a Row with the Various special values filled in for all columns.

Type	Value
Integer	-6666
Short Text	‘*’
Long Text	‘Various’
Date	9999-6-1

property in_bulk_mode

include_only_columns(columns_to_include: set)

Include only specified columns in the table definition. Columns that are non included are removed them from all SQL statements.

columns_to_includelist: A list of columns to include when reading the table/view.

init_cache(): Initialize all lookup caches as empty.

Insert a row or list of rows in the table.

Parameters:

source_row¶ (Row or list thereof) – Row(s) to insert
additional_insert_values¶ (dict) – Additional values to set on each row.
source_excludes¶ – list of Row source columns to exclude when mapping to this Table.
target_excludes¶ – list of Table columns to exclude when mapping from the source Row(s)
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.

property insert_method

insert_row(source_row: Row, additional_insert_values: dict = None, source_excludes: frozenset | None = None, target_excludes: frozenset | None = None, stat_name: str = 'insert', parent_stats: Statistics = None, **kwargs) → Row[source]

Inserts a row into the database (batching rows as batch_size)

Parameters:

source_row¶ – The row with values to insert
additional_insert_values¶ –
source_excludes¶ – set of source columns to exclude
target_excludes¶ – set of target columns to exclude
stat_name¶ –
parent_stats¶ –
**kwargs¶ –

Return type:

new_row

property is_closed

is_connected(connection_name: str | None = None) → bool

is_date_key_column(column)[source]

Yields:: row (Row) – next row

static kwattrs_order() → Dict[str, int][source]: Certain values need to be set before others in order to work correctly. This method should return a dict mapping those key values = arg name to a value less than the default of 9999, which will be used for any arg not explicitly listed here.

log_progress(row: Row, stats: Statistics)

logging_level_reported = False: Has the logging level of this component been reported (logged) yet? Stored at class level so that it can be logged only once.

logically_delete_not_in_set(set_of_key_tuples: set, lookup_name: str | None = None, criteria_list: list | None = None, criteria_dict: dict | None = None, use_cache_as_source: bool = True, stat_name: str = 'logically_delete_not_in_set', progress_frequency: int | None = 10, parent_stats: Statistics | None = None, **kwargs)

Logically deletes rows matching criteria that are not in the list_of_key_tuples pass in.

Parameters:

set_of_key_tuples¶ – List of tuples comprising the primary key values. This list represents the rows that should not be logically deleted.
lookup_name¶ – Name of the lookup to use
criteria_list¶ – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
criteria_dict¶ – Dict keys should be columns, values are set using = or in
use_cache_as_source¶ (bool) – Attempt to read existing rows from the cache?
stat_name¶ – Name of this step for the ETLTask statistics. Default = ‘delete_not_in_set’
progress_frequency¶ – How often (in seconds) to output progress messages. Default = 10.
parent_stats¶ – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
kwargs¶ –

IF child class HistoryTable

effective_date:
The effective date to use for this operation.

logically_delete_not_in_source(source: ReadOnlyTable, source_criteria_list: list | None = None, source_criteria_dict: dict | None = None, target_criteria_list: list | None = None, target_criteria_dict: dict | None = None, use_cache_as_source: bool | None = True, parent_stats: Statistics | None = None)

Logically deletes rows matching criteria that are not in the source component passed to this method. The primary use case for this method is when the upsert method is only passed new/changed records and so cannot build a complete set of source keys in source_keys_processed.

Parameters:

source¶ – The source to read to get the source keys.
source_criteria_list¶ (string or list of strings) – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
source_criteria_dict¶ – Dict keys should be columns, values are set using = or in
target_criteria_list¶ (string or list of strings) – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
target_criteria_dict¶ – Dict keys should be columns, values are set using = or in
use_cache_as_source¶ – Attempt to read existing rows from the cache?
parent_stats¶ – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.

logically_delete_not_processed(criteria_list: list | None = None, criteria_dict: dict | None = None, use_cache_as_source: bool = True, allow_delete_all: bool = False, stat_name='logically_delete_not_processed', parent_stats: Statistics | None = None, **kwargs)

Logically deletes rows matching criteria that are not in the Table memory of rows passed to upsert().

Parameters:

criteria_list¶ – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
criteria_dict¶ – Dict keys should be columns, values are set using = or in
use_cache_as_source¶ (bool) – Attempt to read existing rows from the cache?
allow_delete_all¶ – Allow this method to delete all rows. Defaults to False in case an error preventing the processing of any rows.
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
stat_name¶ (string) – Name of this step for the ETLTask statistics. Default = ‘logically_delete_not_processed’

property lookups

property maintain_cache_during_load: bool

max(column, where=None, connection_name: str = 'max')

Query the table/view to get the maximum value of a given column.

Parameters:

column¶ (str or sqlalchemy.sql.expression.ColumnElement.) – The column to get the max value of
where¶ (string or list of strings) – Each string value will be passed to sqlalchemy.sql.expression.Select.where() http://docs.sqlalchemy.org/en/rel_1_0/core/selectable.html?highlight=where#sqlalchemy.sql.expression.Select.where
connection_name¶ – Name of the pooled connection to use Defaults to ‘max’

Returns:

max

Return type:

depends on column datatype

property natural_key: list: Get this tables natural key

order_by(order_by: list, stats_id: str = None, parent_stats: Statistics = None) → Iterable[Row]

Iterate over all rows in order provided.

Parameters:

order_by¶ (string or list of strings) – Each value should represent a column to order by.
stats_id¶ (string) – Name of this step for the ETLTask statistics.
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.

Yields:

row (Row) – Row object with contents of a table/view row

physically_delete_version(row_to_be_deleted: Row, remove_from_cache=True, prior_row: Row = Ellipsis, stat_name='delete_version', parent_stats=None)[source]

Physically delete a given version row. Corrects the preceding end date value.

Parameters:

row_to_be_deleted¶ (Row) – Expected to be an entire existing row.
remove_from_cache¶ (boolean) –
Optional. Remove the row from the cache?

Default = True
prior_row¶ (bi_etl.components.row.row_case_insensitive.Row) – Optional. The prior row, if already available. If None, it will be obtained via get_by_lookup_and_effective_date()
stat_name¶ (str) – Name of this step for the ETLTask statistics. Default = ‘delete_version’
parent_stats¶ (bi_etl.statistics.Statistics) – Optional. Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.

property primary_key: list: The name of the primary key column(s). Only impacts trace messages. Default=Empty list.

property primary_key_tuple: tuple: The name of the primary key column(s) in a tuple. Used when a hashable PK definition is needed.

property progress_frequency: int: How often (in seconds) to output progress messages. None for no progress messages.

property qualified_table_name: The table name

property quoted_qualified_table_name: The table name

rollback(stat_name: str = 'rollback', parent_stats: Statistics | None = None, connection_name: str | None = None, begin_new: bool = True)

Rollback any uncommitted deletes, updates, or inserts.

Parameters:

stat_name¶ (str) – Name of this step for the ETLTask statistics. Default = ‘rollback’
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
connection_name¶ – Name of the connection to rollback
begin_new¶ – Start a new transaction after rollback

property row_name

property rows_read: int: int The number of rows read and returned.

sanity_check_example_row(example_source_row, source_excludes: frozenset | None = None, target_excludes: frozenset | None = None, ignore_source_not_in_target: bool | None = None, ignore_target_not_in_source: bool | None = None)

sanity_check_source_mapping(source_definition, source_name=None, source_excludes: frozenset | None = None, target_excludes: frozenset | None = None, ignore_source_not_in_target=None, ignore_target_not_in_source=None, raise_on_source_not_in_target=None, raise_on_target_not_in_source=None)[source]

select(column_list: list | None = None, exclude_cols: frozenset | None = None) → GenerativeSelect

Builds a select statement for this table.

Return type:: statement

set_bulk_loader(bulk_loader: BulkLoader)

set_columns(columns)

set_kwattrs(**kwargs)

set_last_update_date(row)

sql_upsert(source_table: ReadOnlyTable, source_effective_date_column: str, source_excludes: frozenset | None = None, target_excludes: frozenset | None = None, skip_update_check_on: frozenset | None = None, check_for_deletes: bool = None, connection_name: str = 'sql_upsert', temp_table_prefix: str = '', commit_each_table: bool = False, stat_name: str = 'upsert_db_exclusive', parent_stats: Statistics | None = None)[source]

property statistics

property table

property table_key_memory

property table_name: The table name

property trace_data: bool: boolean Should a debug message be printed with the parsed contents (as columns) of each row.

transaction(connection_name: str | None = None)

truncate(timeout: int = 60, stat_name: str = 'truncate', parent_stats: Statistics | None = None)

Truncate the table if possible, else delete all.

Parameters:

timeout¶ (int) – How long in seconds to wait for the truncate. Oracle only.
stat_name¶ (str) – Name of this step for the ETLTask statistics. Default = ‘truncate’
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.

uncache_row(row)

uncache_where(key_names, key_values_dict)

unique(): Class decorator for enumerations ensuring unique member values.

Directly performs a database update. Invalidates caching. THIS METHOD IS SLOW! If you have a full target row, use apply_updates instead.

Parameters:

updates_to_make¶ – Updates to make to the rows matching the criteria Can also be used to pass the key_values, so you can pass a single Row or dict to the call and have it automatically get the filter values and updates from it.
key_names¶ – Optional. List of columns to apply criteria too (see key_values). Defaults to Primary Key columns.
key_values¶ – Optional. List of values to apply as criteria (see key_names). If not provided, and update_all_rows is False, look in updates_to_make for values.
lookup_name¶ (str) – Name of the lookup to use
update_all_rows¶ – Optional. Defaults to False. If set to True, key_names and key_values are not required.
source_excludes¶ – Optional. list of Row source columns to exclude when mapping to this Table.
target_excludes¶ – Optional. list of Table columns to exclude when mapping from the source Row (s)
stat_name¶ (str) – Name of this step for the ETLTask statistics. Default = ‘direct update’
parent_stats¶ (bi_etl.statistics.Statistics) – Optional. Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
connection_name¶ – Name of the pooled connection to use Defaults to DEFAULT_CONNECTION_NAME

property update_method

update_not_in_set(updates_to_make: dict | Row, set_of_key_tuples: set, lookup_name: str = None, criteria_list: list = None, criteria_dict: dict = None, use_cache_as_source: bool = True, progress_frequency: int = None, stat_name: str = 'update_not_in_set', parent_stats: Statistics = None, **kwargs)[source]

Applies update to rows matching criteria that are not in the list_of_key_tuples pass in.

Parameters:

updates_to_make¶ (Row) – Row or dict of updates to make
set_of_key_tuples¶ (set) – Set of tuples comprising the primary key values. This list represents the rows that should not be updated.
lookup_name¶ (str) – The name of the lookup (see define_lookup()) to use when searching for an existing row.
criteria_list¶ (string or list of strings) – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
criteria_dict¶ (dict) – Dict keys should be columns, values are set using = or in
use_cache_as_source¶ (bool) – Attempt to read existing rows from the cache?
stat_name¶ (string) – Name of this step for the ETLTask statistics. Default = ‘delete_not_in_set’
progress_frequency¶ (int) – How often (in seconds) to output progress messages. Default = 10. None for no progress messages.
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
kwargs¶ –

effective_date: datetime
The effective date to use for the update

update_not_processed(update_row, lookup_name: str | None = None, criteria_list: Iterable | None = None, criteria_dict: MutableMapping | None = None, use_cache_as_source: bool = True, stat_name: str | None = 'update_not_processed', parent_stats: Statistics | None = None, **kwargs)

Applies update to all rows matching criteria that are not in the Table memory of rows passed to upsert().

Parameters:

update_row¶ – Row or dict of updates to make
criteria_list¶ – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). https://goo.gl/JlY9us
criteria_dict¶ – Dict keys should be columns, values are set using = or in
lookup_name¶ – Optional lookup name those key values are from.
use_cache_as_source¶ – Attempt to read existing rows from the cache?
stat_name¶ – Name of this step for the ETLTask statistics.
parent_stats¶ – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
kwargs¶ –

IF HistoryTable or child thereof:

effective_date: datetime
The effective date to use for the update

Updates the table using the primary key for the where clause.

Parameters:

updates_to_make¶ – Updates to make to the rows matching the criteria Can also be used to pass the key_values, so you can pass a single Row or dict to the call and have it automatically get the filter values and updates from it.
key_values¶ – Optional. dict or list of key to apply as criteria
source_excludes¶ – Optional. set of column names to exclude from the source row source columns to exclude when mapping to this Table.
target_excludes¶ – Optional. set of target column names to exclude when mapping from the source to target Row (s)
stat_name¶ – Name of this step for the ETLTask statistics. Default = ‘upsert_by_pk’
parent_stats¶ – Optional. Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.

upsert(source_row: Row | List[Row], lookup_name: str = None, skip_update_check_on: frozenset | None = None, do_not_update: list = None, additional_update_values: dict = None, additional_insert_values: dict = None, update_callback: Callable[[MutableSequence, Row], None] = None, insert_callback: Callable[[Row], None] = None, source_excludes: frozenset | None = None, target_excludes: frozenset | None = None, stat_name: str = 'upsert', parent_stats: Statistics = None, **kwargs)[source]

Update (if changed) or Insert a row in the table. Returns the row found/inserted, with the auto-generated key (if that feature is enabled)

Parameters:

source_row¶ (Row) – Row to upsert
lookup_name¶ (str) – The name of the lookup (see define_lookup()) to use when searching for an existing row.
skip_update_check_on¶ (list) – List of column names to not compare old vs new for updates.
do_not_update¶ (list) – List of columns to never update.
additional_update_values¶ (dict) – Additional updates to apply when updating
additional_insert_values¶ (dict) – Additional values to set on each row when inserting.
update_callback¶ (func) – Function to pass updated rows to. Function should not modify row.
insert_callback¶ (func) – Function to pass inserted rows to. Function should not modify row.
source_excludes¶ (frozenset) – list of source columns to exclude when mapping to this Table.
target_excludes¶ (frozenset) – list of Table columns to exclude when mapping from the source row
stat_name¶ (string) – Name of this step for the ETLTask statistics. Default = ‘upsert’
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
kwargs¶ –

effective_date: datetime
The effective date to use for the update

upsert_by_pk(source_row: Row, stat_name='upsert_by_pk', parent_stats: Statistics | None = None, **kwargs)

Used by bi_etl.components.table.Table.upsert_special_values_rows() to find and update rows by the full PK. Not expected to be useful outside that use case.

Parameters:

source_row¶ (Row) – Row to upsert
stat_name¶ (string) – Name of this step for the ETLTask statistics. Default = ‘upsert_by_pk’
parent_stats¶ (bi_etl.statistics.Statistics) – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.

upsert_special_values_rows(stat_name: str = 'upsert_special_values_rows', parent_stats: Statistics = None)[source]

Send all special values rows to upsert to ensure they exist and are current. Rows come from:

get_missing_row()

get_invalid_row()

get_not_applicable_row()

get_various_row()

Parameters:

stat_name¶ – Name of this step for the ETLTask statistics. Default = ‘upsert_special_values_rows’
parent_stats¶ – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.

where(criteria_list: list = None, criteria_dict: dict = None, order_by: list = None, column_list: List[Column | str] = None, exclude_cols: FrozenSet[Column | str] = None, use_cache_as_source: bool = None, connection_name: str = 'select', progress_frequency: int = None, stats_id: str = None, parent_stats: Statistics = None) → Iterable[Row]

Parameters:

criteria_list¶ – Each string value will be passed to sqlalchemy.sql.expression.Select.where(). http://docs.sqlalchemy.org/en/rel_1_0/core/selectable.html?highlight=where#sqlalchemy.sql.expression.Select.where
criteria_dict¶ – Dict keys should be columns, values are set using = or in
order_by¶ – List of sort keys
column_list¶ – List of columns (str or Column)
exclude_cols¶ –
use_cache_as_source¶ –
connection_name¶ – Name of the pooled connection to use
progress_frequency¶ –
stats_id¶ –
parent_stats¶ –

Return type:

rows