Source / Target Components

Within a task you will use source / target components to extract and load the data.

CSVReader

Documentation Link: CSVReader

Can read any delimited file (see ‘’delimiter’’ parameter) It is based on the Python csv module. See https://docs.python.org/3.5/library/csv.html

Usable as Source: Yes

Usable as Target: No

CSVWriter

Documentation Link: CSVWriter

Can write any delimited file (see ‘’delimiter’’ parameter) It is based on the Python csv module. See https://docs.python.org/3.5/library/csv.html

Usable as Source: Yes, but does not support updates so use CSVReader instead

Usable as Target: Yes

XLSXReader

Documentation Link: XLSXReader

Reads from Excel files; although only those in xlsx format.

Usable as Source: Yes Usable as Target: No

XLSXWriter

Documentation Link: XLSXWriter

Writes to Excel xlsx files (can also read/update files).

Usable as Source: Yes, including for updates

Usable as Target: Yes

SQLQuery

Documentation Link: SQLQuery

Reads from the result of a SQL query.

Usable as Source: Yes

Usable as Target: No

ReadOnlyTable

Useful when reading all columns from a database table or view. Rows can be filtered using the where method.

Usable as Source: Yes

Usable as Target: No

Table

Documentation Link: Table

Inherits from ReadOnlyTable. Added features:
  • lookups, optional data cache

  • insert, update, delete and upsert

  • delete_not_in_set, and delete_not_processed

  • logically_delete_not_in_set, and not_processed

  • update_not_in_set, update_not_processed

Usable as Source: Yes

Usable as Target: Yes

HistoryTable

Documentation Link: HistoryTable

Inherits from Table. Adds the ability to correctly load versioned tables. Supports both type 2 dimensions and date versioned warehouse tables. Also has cleanup_versions method to remove version rows that are not needed (due to being redundant).

Usable as Source: Yes

Usable as Target: Yes

HistoryTableSourceBased

Documentation Link: HistoryTableSourceBased

Inherits from HistoryTable. Changes the versioning processing so that the source can restate the version history as needed. Versions are not removed from the target, but rather the values are changed to match the active source version at that time. This prevents “breaking” any fact tables that refer to that version.

Usable as Source: Yes

Usable as Target: Yes

PyArrowDatasetReader

Documentation Link: PyArrowDatasetReader

PyArrowDatasetReader will read rows using pyarrow.dataset functionality but presented using the common bi_etl interface including Row objects.

Usable as Source: Yes

Usable as Target: No

W3CReader

Documentation Link: W3CReader

Reads W3C based log files (web server logs).

Usable as Source: Yes

Usable as Target: No

DataAnalyzer

Documentation Link: DataAnalyzer

Produces a summary of the columns in the data rows passed to the analyze_row() method. The output currently goes to the task log.

Usable as Source: No

Usable as Target: Yes

Functionality common to all sources

All source components share the following common functionality.

The source can output progress messages to the task log every X seconds. This defaults to every 10 seconds with the message format being "{logical_name} current row # {row_number:,}". See parameters progress_frequency, and progress_message.

They can limit the number of rows to process. See parameter max_rows (Default None)

They can print a debug trace of all rows processed. See class property trace_data (default False).

They can print a debug trace of the first row processed. See parameter and class property log_first_row (default False).

They track statistics on how long it took to retrieve the first row and all rows. The read timer is starts and stops as rows are passed onto other code, so it should represent just the read elapsed time.