Technology

Data Profiling – Cross-Database Validation

With a collection of quick and easy checks, data profiling gives you a better understanding of your data. You can quickly find problems before you get involved in any data project; issues that will cost you much more to correct later in the project life cycle.

In this article we are going to focus on perhaps one of the most advanced aspects of data profiling; Database cross-validation and checks. Unfortunately, many tools do not support cross-database analysis, and you will often need to load all relevant sources into the same database or repository to perform such checks.

But even given this extra step, cross-database validation is a worthwhile exercise and will pay for itself handsomely in any data initiative:

* Data integration projects, by their very nature, will require the analysis and comparison of multiple data sources.

* In any data migration project, you’ll want to validate both the source and the loaded data sets.

* Even with a “single” database project, you’ll find that there are usually multiple authoritative data sources scattered throughout the company (often in the form of Excel spreadsheets and personal data sets) that need to be checked against the database. destination data.

To deal with all of this, you’ll want to perform a series of cross-database checks. In effect, you will be profiling data from various sources and comparing their resulting profiles. Specifically, you should consider:

* Comparison of codes used in the different systems. If they are not identical, is there a proper mapping between the codes?

* If there are many codes, perhaps Social Security Numbers, compare their patterns/formats.

* If entities are expected on more than one system, you can check the keys on both systems for duplicate or missing entries. And of course, if you expect data in systems to be unique, you still need to search for and investigate any duplicates.

Cross-database validation is not trivial, but it’s not that difficult either. The checks are easy to understand and communicate, and any issues found are usually significant. Therefore, it is something you should always do as part of any data profiling exercise.

Leave a Reply

Your email address will not be published. Required fields are marked *