Author: Latheesh NK

Data Cleaning – A detailed view for Data Analysts

Definition of Data Cleaning

Data cleaning is a process of preparing data for analysis by removing or modifying data that are incorrect, duplicated, incomplete, irrelevant and improperly formatted. As per industry experts, most of data analysts/scientists spend their 70% effort in Data cleaning processes. We can refer data is good if that follows the below characteristics: Validity, Accuracy, Consistency, Completeness and Uniformity.

Data Cleaning Techniques

Remove bad characters

Removing unnecessary whitespaces is one of the most important activity in Data Cleaning Process.

TechReferences
SQL ServerFN_RemoveBadCharacter
Excel TRIM

Remove blank/null data

Blank data is a concern for all analysts as they cannot arrive to a decision making system due to unavailability of right or required information. We may need to make our system intelligent enough to handle null values.

TechReferences
SQL ServerDefault constraint in SQL Server

Remove Duplicates

Duplicates are always a pain area for Data Analysts. This could lead to a wrong understanding or perception of data that ends up with a bad conclusions. Few of reasons for duplicate data are redundant processing due to any failures, absence of exception handling, multiple manual entries etc. In database management ecosystem, we can enforce UNIQUE constraint to avoid duplicate entries into column(s). As an another option, we can identify the duplicate values and delete those dirty data(this can be scheduled to run frequently).

TechReferences
PythonIdentify duplicates in python

Typographical Errors

Typo errors are another important reason for dirty data. There are different types of typographical errors that can cause difficulty for many analysts and scientists.

Improper Kerning/leading: The spacing between letters or words are always a concern. This can lead to confusions and unclear situations. For eg. the word “away” becomes “a way” without a proper Kerning can cause totally different meaning. As an another note, leading is the space between two lines, that is also important for a reading aspect. Leading space cannot be a worrisome, but that improves the reading capability.

Capitalization inconsistencies

These types of errors are causing when a writer does not capitalize a word when it is required or otherwise. For eg. Name of a person, country, state, month etc. needs to be denoted with a capital letter. Few of common rules are given in this post.

Data Type Inconsistencies

Data is stored depending on the type of field defined. If the source and target data types are different, there could be missing of complete data, for eg. decimal to integer. Currently many tools supports features to automatically identifying the data type fro unstructured data.

Formatting Irregularities

Formatting issues with source is always a problem for many data engineers/analysts during data preparations, transformations etc. This post explains some of issues in Excel formatting.

Advantages of Data Cleaning

At the end of the data cleaning process, the system will be benefitted of the following:

  • Business qualified and quantified data
  • Improved Decision making
  • Revenue Booster
  • Cost Effective
  • Increased Productivity

Disadvantages of Data Cleaning 

  • Data cleaning can eliminate outliers sometimes that can cause missing of some insights.
  • The process is very time consuming and expensive.
  • Automation of cleaning process needs to be drafted with utmost care and validations.

If you enjoyed this blog post, please share it with your friends!

UnPivot data in Excel using Power Query

Its a very frequent requirement to transform data in your excel sheet to meet some of your requirement or understanding. Today, we are going to see how we can transform data (unpivot) using Power Query in Excel.

Steps to transform/unpivot data in Excel

Select the data range to be considered -> Click “From Table/Range” under “Data” Menu.

It will open a new Power Query window as below.

Right Click on “First Column – Name” and select “UnPivot Other Columns”

Rename the column names as required in the Query Editor

Finally, “Close & Load” the query window which will open a new Table sheet

If you enjoyed this blog post, please share it with your friends!

A glimpse of Dimensional Modelling

Why Data Warehouse?

Datawarehouse is important in modern technology world for several reasons as below:

  • To ensure the consistency and quality of data.
  • To ensure the governance of data for a large enterprises.
  • To analyze data from different sources(homo and hetero)
  • To separate data from transactional to analytical model
  • To improve the speed and efficiency of accessing data
  • To help in arriving decision making system

Transactional Versus Reporting

Transactional Database DesignReporting Database Design (DW)
Performance is more about Data manipulation statementPerformance is about retrieving Data 
Locking/Blocking must be minimalLocking is not a concern
Small set of Data retrievalLarge set of Data retrieval
Normalization of dataDe Normalized form of data

Dimensional Modelling

“Dimensional Modelling is a design technique for databases intended to support end-user queries in a data warehouse.”

— Ralph Kimball

Dimensional modelling features as best to answer the questions the business would like to understand; not on the operational system. For eg. business might want to get the sales order for last year or growth over a period of times etc. Dimensional modelling ensures an easy and efficient way of data retrieval.

Dimensional Modelling primarily consists of two types of entities: Dimension and Fact.

Fact:

A measurable matrix that is normally used to calculate the number of sales, quantity or sales order etc. It contains the Primary key of Dimension associated with. A Fact can contain facts and pointer to dimension.

Dimension:

A descriptive information of a measurement in Fact. There will be more number of Dimension tables than Fact. A dimension can contain many dimension attributes.

How do we understand the Dimension & Fact

One of the widely appreciated way is based on 7Ws of Data Warehouse design on a data story. The question words are:

How, What, When , Who, Where and Why – determines Dimensions.

How many/much – determines Fact.

Star Schema:

A pictorial representation of star schema as below. Few points to be noted are:

  • Fact is pointing to Dimension.
  • Dimensions are descriptive and denormalized form, not pointing to another dimension.
  • Dimension Keys are normally surrogate keys, not generated by source, but load process. This is to ensure the change in the source is not going to impact the data warehouse system.
  • Dimension attributes determines the granularity called grains of Facts.

Snowflake Schema:

  • Fact is pointing to a Dimension.
  • Dimension can point to another Dimension as shown in the below picture.
  • Good for Transaction Database design, however not so efficient one for Data warehouse (unless its proven).
  • It can incur performance issues due to several join requirement to get the required data that business asked in reporting purposes.

Bus Matrix:

Bus matrix is a key item in the requirement phase of Data warehouse systems. It provides information and relation of dimensions and subject areas in a matrix form. The matrix is generated from various sources of truth by asking questions like user interviews/surveys etc. It provides a clear picture of shared/common Dimensions used by various subject areas and its dependencies easily. This is important for any DW project to keep this ready and up-to-date.

The below picture is a representation of bus matrix on AdventureWorks DW prepared in an Excel.

Changing the Dimension values & its significance

Occasionally there could be some changes happen even for Dimension, though its rare. So, its important to understand how the system should behave to the modification and these are called slowly changing dimensions.

3 Different Types of SCD (Slowly Changing Dimensions)

  • Type 1

This is the most simplest form of SCD. Its is easier to implement. In principle, it just updates the current value with new value. The disadvantage is that we will lose the ability of understanding the previous data, history data, as the update is in-place.

  • Type 2

This type tracks the historical data as a snapshot. For eg: This can be implemented with versioning concepts with start date and end date, hence at any point of times, the reports can roll up the data accordingly based on the versioning. The best way to get historical information, however, more complex to be implemented.

  • Type 3

This type is an easier one than Type2, while keeping the historical data. As an example, keeping the old and new values by adding an extra column for the changed dimension value.

If you enjoyed this blog post, feel free to share it with your friends!

Data Modelling & Key Concepts

Definition of Data Modelling

Data Modelling is a process of identifying the business requirements of an organization and modelling the organization of data to meet. Data modelling evaluates and measures the flow of data in and out to the database management system. It describes the treatment of data and its connecting objects. It ensures a great customer experience by modelling the business data in a most appropriate manner.

Data modelling Types

There are 3 types of Data modelling

  • Conceptual
  • Logical
  • Physical

Conceptual Data Modelling

Conceptual modelling is a method of describing information in high level. It describes the semantics of an organization and its assertions. It usually hides the internal or low level of information about data structures.

Logical Data Modelling

Logical modelling is a representation of logical design of an information system, a representation of the abstract structure of the information domain that defines all the logical constraints applied to the data stored. 

Physical Data Modelling

Physical data modelling is a representation of an implementation design of business requirement. A complete physical schema includes artifacts required to achieve performance goals or create relationships between data, such as indexes, linking tables, and constraint definitions. Analysts can use a physical schema to calculate storage estimates, and this may include specific storage allocation details for an information system.

In a simpler way, An entity (in logical modelling) is called as Table (in physical modelling) and Attribute (in logical modelling) is called as Columns (in physical modelling).

If you enjoyed this blog post, feel free to share it with your friends!

Exception: The module being executed is not trusted. Either the owner of the database of the module needs to be granted authenticate permission, or the module needs to be digitally signed.

When we were trying to create a procedure and trying to execute the procedure, it was throwing an error as

The module being executed is not trusted. Either the owner of the database of the module needs to be granted authenticate permission, or the module needs to be digitally signed.

create or Alter procedure GetAGNodes
With encryption, EXECUTE AS owner

Though it was granted with execute permission with the user, we need to give the database trustworthy to be enabled. The below script needs to be executed to resolve the issue.

ALTER DATABASE DBNAME SET TRUSTWORTHY ON

Hope this helps, if there are any other related issues and resolutions, please feel free to share as a comment!