Introduction:

ETL (Extract, Transform, Load) jobs are critical components of data pipelines, allowing organizations to move data from source systems to target systems and transform it along the way. As you design and implement ETL jobs, it's important to consider the additional columns that can provide valuable information and context for downstream data consumers. In this blog post, we'll explore some of the extra columns that you might consider including in your ETL jobs.

List of extra columns

ppn_tm: This could potentially refer to "process start time" or "processing time" depending on the context. Adding a column for the time at which a process started or the time it took to process data can help with performance analysis and optimization.
source_system: This column can be used to indicate the source system from which data was extracted. This can be useful for tracking data lineage and identifying any issues or errors that may have originated from a specific source.
load_date: Adding a column for the date on which the data was loaded can help with tracking data freshness and identifying any data quality issues that may have occurred during the loading process.
record_status: This column can be used to indicate the status of a record (e.g., new, updated, deleted). This can help with identifying changes to data over time and ensuring data consistency.
error_message: Adding a column to capture any error messages or exceptions that occur during the ETL process can help with troubleshooting and identifying issues with data quality or processing.
transformation_version: This column can be used to indicate the version of the ETL transformation or job that processed the data. This can help with tracking changes to the ETL process over time and identifying any issues that may be related to specific versions.
record_type: This column can be used to indicate the type of record (e.g., customer, transaction, product). This can help with organizing data and identifying patterns or trends.
source_file: Adding a column to capture the name or location of the source file can help with tracking data lineage and identifying any issues or errors that may have originated from a specific file.
source_row: This column can be used to capture the row number or identifier of the source data. This can be useful for identifying and correcting data quality issues.
source_timestamp: Adding a column for the timestamp of the source data can help with tracking data freshness and identifying any issues that may be related to specific time periods.
target_timestamp: This column can be used to capture the timestamp at which the data was loaded into the target system. This can help with tracking data lineage and ensuring data consistency.
data_hash: Adding a column to capture a hash value of the data can help with detecting any changes or inconsistencies in the data.
derived_column: This column can be used to capture any new or derived data that is generated during the ETL process. This can help with organizing and analyzing data.
batch_id: Adding a column for the ID of the batch or job that processed the data can help with tracking data lineage and identifying any issues that may be related to specific batches.
processing_time: This column can be used to capture the amount of time it took to process the data. This can help with performance analysis and optimization.

Extra columns with transposed sampling data

Column	Row 1	Row 2	Row 3	Row 4	Row 5
id	1	2	3	4	5
name	John	Jane	Bob	Alice	David
age	25	30	40	35	28
gender	Male	Female	Male	Female	Male
ppn_tm	20220307120134	20220307120140	20220307120200	20220307120310	20220307120430
source_system	SourceA	SourceB	SourceA	SourceC	SourceB
load_date	2022-03-07	2022-03-07	2022-03-07	2022-03-07	2022-03-07
record_status	OK	OK	OK	ERROR	OK
error_message	NULL	NULL	NULL	Invalid data	NULL
transformation_version	1.2	1.2	1.2	1.2	1.2
record_type	New	New	New	New	New
source_row	112	113	114	115	116
source_timestamp	2022-03-07 12:01:34	2022-03-07 12:01:40	2022-03-07 12:02:00	2022-03-07 12:03:10	2022-03-07 12:04:30
target_timestamp	2022-03-07 12:01:35	2022-03-07 12:01:41	2022-03-07 12:02:01	2022-03-07 12:03:11	2022-03-07 12:04:31
data_hash	asj329dke	skw339qlf	def423qwe	ghi293plk	lmn923fgh
derived_column	Calculated	Calculated	Calculated	Calculated	Calculated
batch_id	1234	1234	1234	1234	1234
processing_time	32.56	45.23	12.10	87.45	54.32

Conclusion

extra columns in your ETL jobs can provide valuable context and information for downstream data consumers, allowing them to better understand the source and quality of the data. By considering these extra columns, you can improve the quality and reliability of your data pipelines, and make it easier for downstream consumers to extract value from your data.

15 Useful extra columns for ETL jobs

Introduction:

List of extra columns

Extra columns with transposed sampling data

Conclusion

Wuttichai Kaewlomsap

Previous Article

Next Article