In statistics, exploratory Data Analysis is a way to studying large data sets to summarize them, often in visual forms. A mathematical model is usually used, and generally EDA is used for seeing just what the data is able to tell us about the sample. It is important to be aware of the differences between descriptive and predictive EDA.
Descriptive data may include observations, like numbers of cows at a dairy farm, but they don’t indicate whether or not a cow will be fertile at any given time. If this were used as a predictive measure, then a prediction could be made by using the expected number of cows in the field at any given time. This is not a very precise way of estimating whether or not a cow will be fertile. A prediction would also depend on the characteristics of the cows, such as the cows’ health and feeding habits.
Predictive data, however, does allow an exact measurement of the probability of a cow being fertile at a given time. It is based on the known characteristics of the cow, including its health and feeding habits, and how it has behaved over time. This type of data, when combined with the previously mentioned descriptive data, provides a much more accurate measurement of the probability of a specific result.
One example of a predictive model for EDA can be a mathematical model that is used to evaluate the relationship between a person’s height and weight, or between a person’s age and his or her height. Using the model, we can determine whether a person is underweight or overweight based on the characteristics of his or her body. The model also allows us to measure a person’s growth rates over the years and see whether a person’s growth rate is changing over the years. As these data are collected and analyzed over time, the model helps us to determine whether the pattern of changes in a person’s growth rate is likely to continue.
Another example of a model is one that shows the effect of a change in a person’s height on his or her BMI (Body Mass Index). The model allows us to make predictions about how the person’s BMI is likely to change as he or she grows taller. If the predictions prove correct, then a person should have a smaller amount of excess body fat or be able to lose weight as his or her height increases. If the predictions prove incorrect, then the model may provide evidence that the person may need to start or stop exercise or weight loss programs, or perhaps even change diet.
A third example of a predictive model can be a model that looks at the relationship between a person’s smoking habit and his or her heart rate. The model looks at the rate at which the heart beats as the person is smoking, and then calculates how much of that rate will be contributed to smoking. Again, if the model indicates that the smoker will be healthier than someone who doesn’t smoke, then the smoker will benefit from quitting smoking.
As with any model, the model should be tested carefully. However, there is no scientific test or set of rules as to which model will work for each particular case. This is where EDA becomes valuable.
All models used in Data Analysis have been tested over time to show a pattern of consistency in results. The model is considered to be more reliable if it explains the sample better than other models, but not necessarily to the extent that a scientific test would. A model that provide the best general description of the data, and can be tested by many different types of data, may be called the best of all models.