Getting to know the data
In analytics base table the data quality reports give the in depth picture of data. The data that we will work with, we should study it in detail to get to know. We must examine the central tendency and variation to understand the type of values that each feature can take. In categorical feature, we first examine the mode, 2nd mode, mode %, and 2dn mode % in data quality report. It shows us most common levels within these features and will identify if any level dominates the dataset. The bar plot plays very vital role here. It gives quick overview of all level in the domain of each categorical feature and the frequency of these levels.
We must first examine the mean and standard deviation of each feature to get a sense of central tendency and variation of values within the dataset for the continuous feature. To understand the range that is possible for each feature we should examine the minimum and maximum values. For each continuous feature included in data quality report are very easy to understand by histogram and also how values for a feature are distributed across the range they can take. In histogram there are number of common, well understood shapes that we should look out for. These shapes are related to well-known standard probability distribution. It also recognizes that distribution of the values in an analytics base table for feature closely matches one of these standard distribution can help us when building machine learning model. We don’t need to go any further than simply recognizing that features seem to follow particular distribution, and this can be done from examining the histogram for each feature during data exploration.
Following figures shows a selection of histogram shapes that exhibit characteristics commonly seen when analyzing features and that are indicative of standard, well-known probability distribution.
Above figure shows a histogram exhibiting a uniform distribution. A uniform distribution shows that a feature is equally likely to take a value in any of the ranges present. Sometime a uniform distribution is indicative of descriptive feature that contain an ID rather than a measure of something more interesting.
Above figure shape indicative of normal distribution. Features following a normal distribution are characterized by strong tendency toward a central value and symmetric variation to either side of central tendency. Histogram that follow a normal distribution can also be described as unimodal, because they have a single peak around central tendency. Finding features that exhibit a normal distribution is a good thing.
Above figures shows unimodal histogram that exhibit skew. Skew is simply tendency toward very high (right skew) or very low (left skew) value. Skewed distribution offend said to have long tails toward these very high or very low value.
Above figure shows the features following exponential distribution, the likelihood of low values occurring is very high but diminishes rapidly for higher values. Features such as the amount of money spent by the customer on one trip to the supermarket follows an exponential distribution or the number of miles that a particular car can run before its battery wears out is exponentially distributed. Recognizing that a feature follows an exponential distribution is another clear wearing sign that outliers are likely. As figure shows exponential distribution have long tail, so very high values are not uncommon.
Above figure is of multimodal distribution, a features characterized by a multimodal distribution has two or more very commonly occurring ranges of values that are clearly separated. Above figure show bi-modal distribution with two clear peaks — we may say that two normal distribution pushed together. Multimodal distribution occurs when a feature contains a measurement made across a number of distinct groups.