This paper provides an introduction to the unglamorous, time-consuming, laborious, and sometimes dreaded “dirty work” of statistical investigations – data preparation. Thankfully, JMP users can perform wrangling operations within one software environment, without having to hop off and onto different platforms.
Successful analyses require a disciplined process like the one shown here, wherein a question gives rise to fact-gathering, analysis and implementation of a solution, which in turn is monitored and gives rise to further questions - an infinite loop where data plays a central role. In this image, though, the "Prepare" step is not drawn remotely to scale. Also, in any phase we may need to step back to an earlier phase. This paper will deal mainly with the Prepare and Explore phases.
Though this paper focuses on data preparation, there are four other themes that we’ll keep in mind:
- Data projects occur within a work context. The questions that drive any study arise from that context and in turn affect the scope and nature of the data preparation required.
- Workflow efficiency is valuable. Because data preparation is so time-consuming and varies from project to project, it is vital to establish and follow procedures that are efficient.
- Reproducibility contributes to workflow efficiency. Some projects are one-time, unique tasks, but many occur repeatedly. We need to preserve a record of precisely how analysts transform raw data into the data used.
- Your mind is your most valuable analytical software. As powerful and efficient as JMP is, it cannot determine if you and your team have collected data about the right variables to answer your questions or whether your models make sense within the work context.
This paper covers the following topics:
- Essential database operations.
- Detecting and addressing common data hygiene problems
- Messy data: Incorrect modeling types, inaccurate or obsolete data.
- Outliers and other dubious observations.
- Missing observations.
- Suspicious patterns.
- Issues driven by the analytic approach.
- Data requirements for different platforms.
- Reshaping a data table.
- Feature engineering: Transformations and data reduction.
- Special consideration for time and date data.