Analytics systems are a huge demand with organizations that deal with massive data on a daily basis. Out of the many requirements that such organizations have with regards to analytics extraction, one in great demand is extraction of data having the same source but with some additional information (Columns) from it or a completely new source (Table).Here’s a case study of the process of building an analytics system for one of our clients who wanted to support analytics extraction based on the above requirement.
The data needs to be extracted from 30 different tables with each table being handled by its individual package. This is not a flexible system because for every new source table, we will either have to:
- create a new package or
- manually open a package and modify it if any new column is required from the existing source table
Things could get worse if a similar column is added in all 30 tables. We will have to individually open all 30 packages, manually change the metadata to add the extra column and then deploy each one of them.
A more generic extraction strategy was needed which not only worked in a similar manner as the existing one, but also easily supported different versions.
Alternate Extraction Strategy
After looking into the extraction packages, we developed another way of managing them by controlling their development, deployment and execution through a Code. This strategy proved to be ideal because:
- Maintenance of one package is easier since packages have the same signature (setting, tasks), so generic codes can replace them. For future changes, all the tables would have a single change in code instead of changing all the packages and testing them individually.
- Code debugging is much easier than SSIS debugging.
- Settings, when moved to database, allows the individual changes like order, timestamp, etc. to be made in the database itself. Changing and deploying a new package is not necessary.
- Conversion errors can be reduced considerably. Static SSIS doesn’t support automatic conversions like Unicode to nonunicode values. However, when we create SSIS from code, new versions of conversion routine, error handling, etc. can be added.
- Better logging can be implemented using code instead of SSIS’s own logging.
- Different extraction modes on a table can be implemented and easily visualized by:-
- Full refresh on master tables
- Time stamp based extraction on frequently updated tables
- Transaction log based extraction on bulk commits and transactions.
Comparison with existing Extraction Strategy
- No manual package metadata changes are required for extracting extra information from the existing sources. Any additional columns can just be added in the controlling table.
- No new package is needed for any new source of extraction. Just a new entry needs to be made in the controlling table. Each row in the table will be equivalent to a package.
- Writing code for generating dynamic SSIS may seem cumbersome. However, once this code is written, no future changes (version specific) are required. Any change that applies to all packages can be made at the same place.
- Upgrading and downgrading is a matter of just one click once the new system is in place.
Below is the step-by-step process of implementing the Alternate Extraction Strategy:
Step 1: A generic SSIS package was generated which could be run for all extractions.
Step 2: A database table would contain all package settings like source table, destination table, source SQL procedure and last update date. The package will automatically read this setting.
Step 3:The package can be run in two ways for each row:
- In a setting table, a package can either be created, loaded and executed immediately, or
- All packages can be created first by a process and then another process in the DB or server can load and execute them one by one.
The generic package will be created programmatically. So a code will-
- Create a control and data flow task (same as drag and drop, but by code)
- Create individual tasks and put them in control and data flow and,
- Conduct mapping and settings within data flow task.