- Published on
15 Useful extra columns for ETL jobs
- Authors
- Name
- Wuttichai Kaewlomsap
- @wuttichaihung
Introduction:
ETL (Extract, Transform, Load) jobs are critical components of data pipelines, allowing organizations to move data from source systems to target systems and transform it along the way. As you design and implement ETL jobs, it's important to consider the additional columns that can provide valuable information and context for downstream data consumers. In this blog post, we'll explore some of the extra columns that you might consider including in your ETL jobs.
List of extra columns
ppn_tm
: This could potentially refer to "process start time" or "processing time" depending on the context. Adding a column for the time at which a process started or the time it took to process data can help with performance analysis and optimization.source_system
: This column can be used to indicate the source system from which data was extracted. This can be useful for tracking data lineage and identifying any issues or errors that may have originated from a specific source.load_date
: Adding a column for the date on which the data was loaded can help with tracking data freshness and identifying any data quality issues that may have occurred during the loading process.record_status
: This column can be used to indicate the status of a record (e.g., new, updated, deleted). This can help with identifying changes to data over time and ensuring data consistency.error_message
: Adding a column to capture any error messages or exceptions that occur during the ETL process can help with troubleshooting and identifying issues with data quality or processing.transformation_version
: This column can be used to indicate the version of the ETL transformation or job that processed the data. This can help with tracking changes to the ETL process over time and identifying any issues that may be related to specific versions.record_type
: This column can be used to indicate the type of record (e.g., customer, transaction, product). This can help with organizing data and identifying patterns or trends.source_file
: Adding a column to capture the name or location of the source file can help with tracking data lineage and identifying any issues or errors that may have originated from a specific file.source_row
: This column can be used to capture the row number or identifier of the source data. This can be useful for identifying and correcting data quality issues.source_timestamp
: Adding a column for the timestamp of the source data can help with tracking data freshness and identifying any issues that may be related to specific time periods.target_timestamp
: This column can be used to capture the timestamp at which the data was loaded into the target system. This can help with tracking data lineage and ensuring data consistency.data_hash
: Adding a column to capture a hash value of the data can help with detecting any changes or inconsistencies in the data.derived_column
: This column can be used to capture any new or derived data that is generated during the ETL process. This can help with organizing and analyzing data.batch_id
: Adding a column for the ID of the batch or job that processed the data can help with tracking data lineage and identifying any issues that may be related to specific batches.processing_time
: This column can be used to capture the amount of time it took to process the data. This can help with performance analysis and optimization.
Extra columns with transposed sampling data
Column | Row 1 | Row 2 | Row 3 | Row 4 | Row 5 |
---|---|---|---|---|---|
id | 1 | 2 | 3 | 4 | 5 |
name | John | Jane | Bob | Alice | David |
age | 25 | 30 | 40 | 35 | 28 |
gender | Male | Female | Male | Female | Male |
ppn_tm | 20220307120134 | 20220307120140 | 20220307120200 | 20220307120310 | 20220307120430 |
source_system | SourceA | SourceB | SourceA | SourceC | SourceB |
load_date | 2022-03-07 | 2022-03-07 | 2022-03-07 | 2022-03-07 | 2022-03-07 |
record_status | OK | OK | OK | ERROR | OK |
error_message | NULL | NULL | NULL | Invalid data | NULL |
transformation_version | 1.2 | 1.2 | 1.2 | 1.2 | 1.2 |
record_type | New | New | New | New | New |
source_row | 112 | 113 | 114 | 115 | 116 |
source_timestamp | 2022-03-07 12:01:34 | 2022-03-07 12:01:40 | 2022-03-07 12:02:00 | 2022-03-07 12:03:10 | 2022-03-07 12:04:30 |
target_timestamp | 2022-03-07 12:01:35 | 2022-03-07 12:01:41 | 2022-03-07 12:02:01 | 2022-03-07 12:03:11 | 2022-03-07 12:04:31 |
data_hash | asj329dke | skw339qlf | def423qwe | ghi293plk | lmn923fgh |
derived_column | Calculated | Calculated | Calculated | Calculated | Calculated |
batch_id | 1234 | 1234 | 1234 | 1234 | 1234 |
processing_time | 32.56 | 45.23 | 12.10 | 87.45 | 54.32 |
Conclusion
extra columns in your ETL jobs can provide valuable context and information for downstream data consumers, allowing them to better understand the source and quality of the data. By considering these extra columns, you can improve the quality and reliability of your data pipelines, and make it easier for downstream consumers to extract value from your data.