bi_etl.components.xlsx_writer module
Created on Apr 2, 2015
- class bi_etl.components.xlsx_writer.XLSXWriter(task: ETLTask | None, file_name: str | Path, logical_name: str = None, write_only: bool = True, **kwargs)[source]
Bases:
XLSXReader
XLSXWriter will write rows to a Microsoft Excel XLSX formatted workbook.
- Parameters:
- workbook
The workbook that was opened.
- Type:
openpyxl.workbook.workbook.Workbook
- log_first_row
Should we log progress on the first row read. Only applies if used as a source. (inherited from ETLComponent)
- Type:
boolean
- max_rows
The maximum number of rows to read. Only applies if Table is used as a source. (inherited from ETLComponent)
- Type:
int, optional
- primary_key
The name of the primary key column(s). Only impacts trace messages. Default=None. (inherited from ETLComponent)
- Type:
- progress_frequency
How often (in seconds) to output progress messages. None for no progress messages. (inherited from ETLComponent)
- Type:
- progress_message
The progress message to print. Default is
"{logical_name} row # {row_number}"
. Notelogical_name
androw_number
subs. (inherited from ETLComponent)- Type:
- restkey
Column name to catch extra long rows (more columns than we have column names) when reading values (extra values).
- Type:
- restval
The value to put in columns that are in the column_names but not present in a given row when reading (missing values).
- Type:
- DEFAULT_PROGRESS_FREQUENCY = 10
Default for number of seconds between progress messages when reading from this component. See
ETLComponent.progress_frequency`
to override.
- DEFAULT_PROGRESS_MESSAGE = '{logical_name} current row # {row_number:,}'
Default progress message when reading from this component. See
ETLComponent.progress_message`
to override.
- FULL_ITERATION_HEADER = 'full'
Constant value passed into
ETLComponent.Row()
to request all columns in the row. Deprecated: Please useETLComponent.full_row_instance()
to get a row with all columns.
- Row(data: MutableMapping | Iterator | None = None, iteration_header: RowIterationHeader | str | None = None) Row
Make a new empty row with this components structure.
- __init__(task: ETLTask | None, file_name: str | Path, logical_name: str = None, write_only: bool = True, **kwargs)[source]
- property active_worksheet
- build_row(source_row: Row, source_excludes: frozenset | None = None, target_excludes: frozenset | None = None, stat_name: str = 'build_row_safe', parent_stats: Statistics | None = None) Row
Use a source row to build a row with correct data types for this table.
- build_row_dynamic_source(source_row: Row, source_excludes: frozenset | None = None, target_excludes: frozenset | None = None, stat_name: str = 'build_row_dynamic_source', parent_stats: Statistics | None = None) Row
Use a source row to build a row with correct data types for this table. This version expects dynamically changing source rows, so it sanity checks all rows.
- cache_commit()
- property check_row_limit
- clear_cache()
Clear all lookup caches. Sets to un-cached state (unknown state v.s. empty state which is what init_cache gives)
- clear_statistics()
- close(error: bool = False)[source]
Close the xlsx file, saving first if
error
is false.- Parameters:
error¶ – Did we run into an error during processing? Errors cause a rollback, which skips the save of the file.
- property column_names_set: set
A set containing the column names for this component. Usable to quickly check if the component contains a certain column.
- define_lookup(lookup_name: str, lookup_keys: list, lookup_class: Type[Lookup] = None, lookup_class_kwargs: dict | None = None)
Define a new lookup.
- Parameters:
lookup_name¶ – Name for the lookup. Used to refer to it later.
lookup_keys¶ – list of lookup key columns
lookup_class¶ – Optional python class to use for the lookup. Defaults to value of default_lookup_class attribute.
lookup_class_kwargs¶ – Optional dict of additional parameters to pass to lookup constructor. Defaults to empty dict.
- property empty_iteration_header: RowIterationHeader
- fill_cache(progress_frequency: float = 10, progress_message='{component} fill_cache current row # {row_number:,}', criteria_list: list = None, criteria_dict: dict = None, column_list: list = None, exclude_cols: frozenset = None, order_by: list = None, assume_lookup_complete: bool = None, allow_duplicates_in_src: bool = False, row_limit: int = None, parent_stats: Statistics = None)
Fill all lookup caches from the table.
- Parameters:
progress_frequency¶ – How often (in seconds) to output progress messages. Default 10. None for no progress messages.
progress_message¶ – The progress message to print. Default is
"{component} fill_cache current row # {row_number:,}"
. Notelogical_name
androw_number
substitutions applied viaformat()
.criteria_list¶ – Each string value will be passed to
sqlalchemy.sql.expression.Select.where()
. https://goo.gl/JlY9uscriteria_dict¶ – Dict keys should be columns, values are set using = or in
column_list¶ – List of columns to include
exclude_cols¶ (frozenset) – Optional. Columns to exclude when filling the cache
order_by¶ – list of columns to sort by when filling the cache (helps range caches)
assume_lookup_complete¶ – Should later lookup calls assume the cache is complete? If so, lookups will raise an Exception if a key combination is not found. Default to False if filtering criteria was used, otherwise defaults to True.
allow_duplicates_in_src¶ – Should we quietly let the source provide multiple rows with the same key values? Default = False
row_limit¶ – limit on number of rows to cache.
row_limit¶ – limit on number of rows to cache.
parent_stats¶ – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
- fill_cache_from_source(source: ETLComponent, progress_frequency: float = 10, progress_message='{component} fill_cache current row # {row_number:,}', criteria_list: list = None, criteria_dict: dict = None, column_list: list = None, exclude_cols: frozenset = None, order_by: list = None, assume_lookup_complete: bool = None, allow_duplicates_in_src: bool = False, row_limit: int = None, parent_stats: Statistics = None)
Fill all lookup caches from the database table. Note that filtering criteria can be specified so that the resulting cache is not the entire current contents. See
assume_lookup_complete
for how the lookup will handle cache misses – note only database table backed components have the ability to fall back to querying the existing data on cache misses.- Parameters:
source¶ – Source component to get rows from.
progress_frequency¶ – How often (in seconds) to output progress messages. Default 10. None for no progress messages.
progress_message¶ – The progress message to print. Default is
"{component} fill_cache current row # {row_number:,}"
. Notelogical_name
androw_number
substitutions applied viaformat()
.criteria_list¶ – Each string value will be passed to
sqlalchemy.sql.expression.Select.where()
. https://goo.gl/JlY9uscriteria_dict¶ – Dict keys should be columns, values are set using = or in
column_list¶ – List of columns to include
exclude_cols¶ – Optional. Columns to exclude when filling the cache
order_by¶ – list of columns to sort by when filling the cache (helps range caches)
assume_lookup_complete¶ – Should later lookup calls assume the cache is complete? If so, lookups will raise an Exception if a key combination is not found. Default to False if filtering criteria was used, otherwise defaults to True.
allow_duplicates_in_src¶ – Should we quietly let the source provide multiple rows with the same key values? Default = False
row_limit¶ – limit on number of rows to cache.
parent_stats¶ – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
- property full_iteration_header: RowIterationHeader
- full_row_instance(data: MutableMapping | Iterator | None = None) Row
Build a full row (all columns) using the source data.
Note: If data is passed here, it uses
bi_etl.components.row.row.Row.update()
to map the data into the columns. That is nicely automatic, but slower since it has to try various ways to read the data container object.Consider using the appropriate one of the more specific update methods based on the source data container.
- generate_iteration_header(logical_name: str | None = None, columns_in_order: list | None = None, result_primary_key: list | None = None) RowIterationHeader
- get_by_lookup(lookup_name: str, source_row: Row, stats_id: str = 'get_by_lookup', parent_stats: Statistics | None = None, fallback_to_db: bool = False) Row
Get by an alternate key. Returns a
Row
- Throws:
NoResultFound
- get_sheet_by_name(name)
Returns a worksheet by its name.
- get_sheet_names()
- get_stats_entry(stats_id: str, parent_stats: Statistics | None = None, print_start_stop_times: bool | None = None)
- get_unique_stats_entry(stats_id: str, parent_stats: Statistics | None = None, print_start_stop_times: bool | None = None)
- init_cache()
Initialize all lookup caches as empty.
- insert(source_row: Row | list, parent_stats: Statistics = None, **kwargs)[source]
Insert a row or list of rows in the table.
- insert_row(source_row: Row, additional_insert_values: dict = None, stat_name: str = 'insert', parent_stats: Statistics = None) Row [source]
Inserts a row into the database (batching rows as batch_size)
- Parameters:
source_row¶ – The row with values to insert
additional_insert_values¶ – Values to add / override in the row before inserting.
stat_name¶ – Name of this step for the ETLTask statistics.
parent_stats¶ – Optional Statistics object to nest this steps statistics in. Default is to place statistics in the ETLTask level statistics.
- Return type:
new_row
- property is_closed
- iter_result(result_list: object, columns_in_order: list | None = None, criteria_dict: dict | None = None, logical_name: str | None = None, progress_frequency: int | None = None, stats_id: str | None = None, parent_stats: Statistics | None = None) Iterable[Row]
- Yields:
row (
Row
) – next row
- static kwattrs_order() Dict[str, int]
Certain values need to be set before others in order to work correctly. This method should return a dict mapping those key values = arg name to a value less than the default of 9999, which will be used for any arg not explicitly listed here.
- property line_num
The current line number in the source file. line_num differs from rows_read in that rows_read deals with rows that would be returned to the caller
- log_progress(row: Row, stats: Statistics)
- logging_level_reported = False
Has the logging level of this component been reported (logged) yet? Stored at class level so that it can be logged only once.
- property lookups
- make_table_from_inserted_data(style_name: str = 'TableStyleMedium2')[source]
Format the newly inserted worksheet data as a table.
- Parameters:
style_name¶ – The name of the Excel style to apply to the table.
- property primary_key: list
The name of the primary key column(s). Only impacts trace messages. Default=Empty list.
- property primary_key_tuple: tuple
The name of the primary key column(s) in a tuple. Used when a hashable PK definition is needed.
- property progress_frequency: int
How often (in seconds) to output progress messages. None for no progress messages.
- read_header_row()
- property rows_inserted
- property rows_inserted_this_sheet
- sanity_check_example_row(example_source_row, source_excludes=None, target_excludes=None, ignore_source_not_in_target=None, ignore_target_not_in_source=None)
- sanity_check_source_mapping(source_definition: ETLComponent, source_name: str = None, source_excludes: frozenset = None, target_excludes: frozenset = None, ignore_source_not_in_target: bool = None, ignore_target_not_in_source: bool = None, raise_on_source_not_in_target: bool = None, raise_on_target_not_in_source: bool = None)
- set_active_worksheet_by_number(sheet_number: int)[source]
Change to an existing worksheet based on the sheet number.
- set_columns_and_widths(columns_dict: Dict[str, float])[source]
Set the column names and widths at the same time. See
set_widths()
.
- set_kwattrs(**kwargs)
- set_widths(column_widths: Iterable[float])[source]
Set the column widths in the xlsx for the currently active worksheet.
- property statistics
- property trace_data: bool
boolean Should a debug message be printed with the parsed contents (as columns) of each row.
- uncache_row(row)
- uncache_where(key_names, key_values_dict)
- where(criteria_list: list | None = None, criteria_dict: dict | None = None, order_by: list | None = None, column_list: List[Column | str] = None, exclude_cols: FrozenSet[Column | str] = None, use_cache_as_source: bool | None = None, progress_frequency: int | None = None, stats_id: str | None = None, parent_stats: Statistics | None = None) Iterable[Row]
- Parameters:
criteria_list¶ – Each string value will be passed to
sqlalchemy.sql.expression.Select.where()
. https://docs.sqlalchemy.org/en/14/core/selectable.html?highlight=where#sqlalchemy.sql.expression.Select.wherecriteria_dict¶ – Dict keys should be columns, values are set using = or in
order_by¶ – List of sort keys
column_list¶ – List of columns (str or Column)
exclude_cols¶ –
use_cache_as_source¶ –
progress_frequency¶ –
stats_id¶ –
parent_stats¶ –
- Return type:
rows
- property workbook