Data Analytics does not only consist of analyzing data. It also consists of using the scientific process to find answers to questions and make important decisions. The process includes designing studies, gathering useful information, explaining the data with figures and charts, exploring the data, and drawing conclusions. We will now examine each step in this process and discuss the critical role of analytics.
Planning a Study
Once the research question is established, it is time to design a study to answer that specific question. This requires figuring out the methods that you will use to extract the necessary data. This covers the two main types of studies: descriptive studies and experimental studies.
With a descriptive study, data are gathered from people in a way that does not have an impact on them. The most widely used type of descriptive study is a survey.
Surveys are questionnaires that are given to people who are randomly selected from a target population.
Surveys are useful data tools for gathering information. As with all methods of gathering data, improperly conducted surveys are likely to result in inaccurate information. Common issues with surveys include inadequately worded questions, which can be confusing, lack of participant response, or lack of randomization in the selection process. Any of these problems can invalidate the results of the survey, therefore surveys must be carefully planned before they are implemented.
One limitation of the survey method is that they can only provide information on relationships that exist between variables and not information on causes and effects.
If the survey researchers observe that the people who smoke cigarettes, for example, tend to work longer hours per day than those who do not smoke, they are not in a position to suggest that smoking is the cause for the longer work hours. Variables that were not part of the research design might cause the relationship, such as number of hours they sleep every night.
Experiments involve the application of one or more treatments to subjects in a controlled environment. The treatments are things that may or may not affect the subject under study. Some studies involve medical experiments, wherein the subjects are patients who undergo medical treatments. Other experiments might include students who receive tutoring, or exposure to a particular instructional tool as the treatment. Businesses engage in experiments that involve sample participants from the consumer market.
These participants may be exposed to a certain type of advertisement and asked how they were emotionally affected. Once the treatments are applied, the responses are systematically recorded.
For instance, to study the effect of a drug dosage amount on blood pressure, a group of subjects may be administered 15mg of a medicine. A different sample group may be administered 30 mg of the same drug. Typically, a control group is also involved, where subjects each receive a placebo treatment (i.e., a substance with no medicinal properties).
Experiments are often designed to take place in a controlled setting, in order to reduce the number of potential unrelated variables and possible biases that might affect the results. Some possible problems might include: researchers knowing which participants received particular treatments; a particular circumstance or condition, not factored into the study, that may impact the results (e.g., other medications that a participant may be taking), or not including an experimental control group. However, when experiments are designed correctly, difference in responses, found when the groups are compared, allow the researchers to conclude that there is a cause and effect relationship. No matter what the study, it must be designed so that the original questions can be answered in a credible way.
Once a research plan (whether descriptive or experimental) has been designed, the subjects must be selected, and data must be gathered. This stage of the research process is essential to generating meaningful data. The ways in which data are collected vary with the type of study. In experimental designs, the data should be collected in the most controlled manner possible, in order to reduce the possibility of generating contaminated results. Some experiments require more strenuous procedures than others. When gathering data on people’s perceptions of a new business marketing strategy or data concerning the effectiveness of a new teaching strategy, the consequences of inaccurate results are not as critical as they would be in a medical study. Therefore, in low-stakes experiments, it is sometimes preferable to use less robust data gathering procedures in order to save time and money.
Selecting a Useful Sample
In analytics, as with computer programming, garbage in results in garbage out. If subjects are improperly chosen, for example by giving some more of a chance to be selected than others, the results will be unreliable and not useful for making decisions. For example, John is researching the attitudes of individuals about a possible new tax. John stands in front of a local grocery store and asks passers-by to share their thoughts and attitudes. The problem with that is that John will only get the attitudes of a) individuals who shop at that grocery store; b) on that specific day; c) at that specific time; d) and who actually chose to participate. Because of his limited selection process, the subjects in his survey are not representative of the entire population of the town. Likewise, John design an online survey and ask people to input their feedback on the new tax. However, only people who are aware of the website, have access to the Internet, and choose to participate will provide data. Characteristically, only people with the strongest attitudes are likely to participate. Again, these the participants would not be representative of everyone in the town.
In order to avoid such selection bias, it is necessary to select the sample randomly, using some type of process that gives everyone in the population the same statistical opportunity to be chosen. There are various methods for randomly selecting subjects in order to get valid and useable results.
Avoiding Bias in a Data Set If you were conducting a phone survey on political voting preferences, and you made your calls to people’s land lines at home between the hours of 8:00 a.m. and 4:00 p.m., you would fail to get feedback from individuals who work at that time. Perhaps those who work during those hours have different preferences than those at home during those hours. For example, more business owners may be at home and express voting preferences for something completely different than members of the working class. Surveys that are poorly designed may be too lengthy, resulting in some participants quitting before they finish. Participants may not be completely honest if the questions are too personal. If the list of choices is too limited, the survey will not be able to capture valuable data that people would have provided. Many things can render survey data invalid. Experiments can be even more problematic in terms of gathering data. If you want to test how well people retain information when exposed to loud music, a variety of factors could affect the outcomes.
The experiment designer should consider if everyone will listen to the same song, if they will be asked about the amount of sleep they got the night before, if they have prior knowledge about the type of subject matter, how they feel about being there participating in the experiment, whether they use drugs or alcohol regularly, and a host of other considerations that must be considered in order to control for outside variables.
Once data has been collected, it is time to compile it in order to get a view of the entire data set. Analysts describe data in two basic ways: with images, like graphs and charts, and with figures, called descriptive analytics. Descriptive analytics are the most commonly-used methods for describing data to the general population. When used effectively, a chart or graph can easily explain volumes of data in a single snapshot.
Data can be summarized by using descriptive analytics. Descriptive analytics are numerical representations of data that highlight the most important features of a dataset. With categorical data, wherein everything is sorted into groups (e.g., age, gender, ethnicity, currency, price, etc.) things are usually summarized by the number of units in each category. This is referred to in terms of frequency or percentage.
Numerical data consists of literal quantities or totals (e.g., height, weight, amount of money, etc.), wherein the actual numbers are meaningful. When working with numerical data, more aspects can be summarized than just the number or percentage within each category. Such elements include measures of middle (i.e., the center point of the data); measures of variance (i.e., how widely spread or how tightly-clustered the data are around the center). Another consideration is a measure of the relationship between different variables. Depending on the particular situation, certain descriptive analytics are more appropriate than others. For example, if you were to assign the codes 1 for men and 2 women, when analyzing the data, it would not make sense to attempt to average those numbers. Likewise, attempting to use a percentage to explain a singular amount of time would not be useful. Another type of data, ordinal data, is somewhat of a combination of the first two types. Ordinal data
appear are in categories, however the categories have a hierarchical order, such as rankings from 1 to 10, or student ranks of freshman through seniors. This data can be analyzed the same way as categorical data. Numerical data procedures can also be used when the categories represent meaningful numbers.
Charts and Graphs
Data can be presented visually with graphs and charts. Such graphs include pie charts and bar charts, which can be used with categorical variables like gender or type of car. A bar graph might present data about attitudes using, for example, a series of five ordered bars labeled from “Strongly Disagree”
through “Strongly Agree.” Not all data, however, can be presented clearly with these types of charts. Numerical data, such as height, time, or dollars that represent measures of something or totals require the types of graphs that can either summarize the numbers or group them numerically. One such graph that is a histogram, which will be discussed later in this book.
Once the data is collected and described with pictures and numbers, it is time to begin the process of data analysis. Assuming that the study was planned well, the research question can be properly answered by applying an appropriate data analysis. As with all previous steps in the process, selecting an appropriate analytical procedure determines the usefulness of the results.
We have discussed the foundations of data analytics. Using mathematical techniques and scientific procedures to collect, measure, analyze, and draw conclusions from data is what data analytics is all about.