The next post will look at the Logical layer in more detail. Then we will also be looking at sync process for descriptions in lookups and near real time aggregation.
Tim Seyfried, Principal Consultant, Data Analytics Practice
The leap from traditional batched loaded DW, and NRT DW is smaller than the jump from transactional reporting to the data warehouse.
For Step 1
Conceptually the structure you want at the end is the same as a traditional DW.
This can be broken down into three discreet steps.
- Transactional to STG/ODS
- STG/ODS to physical reporting
- Logical layer between physical and end user
Looking specifically at the second step STG/ODS to physical reporting:
As with the previous step, the capture and apply are controlled by the CDC tool (like Golden Gate), and the refresh and consumption are handled by the ETL tool (like Oracle Data Integrator). The big difference here is that the CDC process isn’t across two databases like transactional to ODS/STG. It is all internal to the data warehouse. Since the CDC tool always has limitations to what transformations it can apply, I don’t allow any transformations to be used as part of CDC. All transformations that are done as part of data movement should be done in the ETL tool. This centralizes all transformations allowing for easier maintenance and discovery going forward.
The refresh and CDC consumption jobs handled by the ETL tool are where you want to transform data, but not all transformations should be done in this layer. The transformations you need to apply are the ones that aren’t likely to be changed are manipulated by changes in business processes in the future. Deciding where to do a transformation is not an exact science, and therefore your goal is to balance between performance and future development needs and functionality.
The transforms that should be done as part of the ETL into Reporting are:
- Denormalization (Joins)
- *Lookups (Descriptions for Codes)
- Trim everything (makes for a more predictable result)
- NVL everything (pick an analytical specific default that is different from the transactional systems like [X])
*Note: Future post we will discuss how to keep descriptions for codes in sync and strategies for real-time aggregation.
The transformations that you should defer to the logical layer are ones that get manipulated often:
- Measure calculations
- Application of rates
Reporting Table Structures
Reporting layer is where we start moving away from the table structures of the source system by removing transactional constraints that are needed for controls but are not need in a data warehouse. All text columns should be converted to [nvarchar(250)], and all numeric fields should be translated to a [number] with no precision. By removing these constraints, it will make adding analytical functionality or additional data from another system more straightforward in the future.
Primary Key constraints should be the only constraints carried forward into the reporting structures. The only reason to carry forward the primary key constraint and index is to ensure the correct insert and update logic is applied by the ETL tool. It is recommended, especially on an Exadata machine that the data columns associated with a primary key also be repeated without the index and constraint. By doing this, you have the flexibility to have the index ignored by the reporting engine for performance purposes but not by the ETL engine. This is especially useful when one process needs an index, but another does better without it.
Codes and Descriptions
In the reporting structure wherever you have a code and description, you are also going to create a concatenated field. Many front-end reporting tools can do this, but I have found that it helps indexing strategies if done at the database level and allows the users not to need to remember how best to apply a filter.