Use this page as a reference when working with large, wide data sets.
Before getting into some of the detailed tips and suggestions, let us start with the first rule of big data:
•
Get a powerful desktop or remote PC or Mac with plenty of fast cores, large fast memory, and a large monitor.
JMP requires you to fit the data in main memory, which is optimal for speed, but you need a machine that has several times the memory of the largest table that you will have.
The large number of variables used in wide data analysis also can be computationally challenging, so a machine with fast multi-core processors will make those operations faster.
Tips and Suggestions
1
Importing and Accessing Data
•
Use File > Open for importing individual data sets or opening existing JMP tables. Use the preview function to examine a data table before starting a lengthy import process. Refer to Importing Genomics Data to see examples.
•
The Multiple Import platform (File > Import Multiple Files) enables you to import many files in one operation. Refer to Import Multiple Files for more information.
•
Once you have imported the data, save it as a JMP file – all subsequent actions go faster, and you can invest in adding metadata as well as preparing the data.
•
Some database accesses support block-reads, so use this when available. Need more info.
2
Use the JMP Preferences platform (File > Preferences) to select time and space saving options.
•
Select General > Save Data Table Columns GZ Compressed to GZ compress the JMP table when saved.
•
Select General > JSL save column groups with group name.
•
Select Tables > Allow Short Numeric Formatto allow storing integer valued columns in a 1-4-byte format.
3
To reduce the burden of big data:
•
You should routinely compress files wherever possible. Use JMPs Compress File when Saved option to compress your large data tables.
•
Group similar columns (Cols > Group Columns) that can be treated together when performing manipulations and analyses. This option is especially useful when you save scripts as you can refer to the group name rather than enumerate many columns.
•
Compress selected columns using the JMP utility. (Cols > Utilities > Compress Select Columns).
•
Convert labels to codes using JMPs Labels to Codes utility (Cols > Utilities > Labels to Codes). JMP allocates a larger amount of memory allocation strings and if you repeatedly have long strings in a column you can consume considerably more memory than what is actually required. This utility recodes the labels to integers and adds a value label property which will be displayed in-place of the integer code in the column and in analysis reports, resulting in reduced memory usage.
•
JMPs Data Filter panel (Rows > Data Filter) gives you a variety of ways to select, hide, or exclude subsets of data from plots and analyses.
4
Changing Attributes, Transforming and Recoding
•
Use the Standardize Attributes platform (Cols > Standardize Attributes) if you want to change attributes for one or more selected columns.
•
If you have many variables that you want to recode in the same way, select the columns, then go to Cols > Standardize Attributes and click the button there.
5
Practical Hints
•
Use progress bars and early stopping. Many platforms are iterative with very strict converge criteria. If it is acceptable to have less-precise fits, then several platforms allow early stopping with a progress bar button called . See Nonlinear Platform Options for more information.
•
Do not be afraid to explore options available in individual reports.
•
If you have a large number of categories, you might consider using Recode to combine some of them into groups. Model fitting might need to create analysis columns for each category, so having many can slow down fits. Many categories also burden the displays. Graph Builder has a nice option called Packed Bar that will show the categories with large counts as bars, but collapse the remaining into packed tiles.
6
General Strategies
•
Often when you have many columns, the goal is not to look at every single column, but only those few that have the biggest effect.
•
Initially, you are better served by generating summary graphs instead of viewing all the details.
•
Use False Discovery Rate (FDR)/Logworth for screening thousands of tests. When you perform a large number of statistical significance tests for model effects and, if there are in reality no non-zero model effects, you will still see many significant statistical results due to just random variability. If there are no effects, the p-values are expected to have a uniform distribution, and there will be many small p-values by chance alone. Use the False Discovery Rate adjustments to get more realistic p-values. LogWorth is a transform of the p-value (LogWorth = -log10(p-value)), and examining small p-values on that scale shows greater separation in the values. The criteria to use for statistical significance should also be appropriate for the number of comparisons or model effects being calculated, and for wide data problems is should be, in general, much smaller than the traditional p-value < 0.05 criteria that is often used in smaller studies.
•
In addition to considering p-value adjustments and criteria, effect sizes should also be an important consideration. A commonly used plot to examine both effect sizes and p-values is a Volcano Plot, which is plot of FDR/Logworth vs Effect Size. This plot will generally have a “U” shape similar to the shape of a volcano’s crater, and effects that are large AND highly significant will be shown in the upper right and left hand quadrants of the plot.
•
For multivariate analyses when missing values are scattered throughout the data, then you may need to use imputation in order to avoid losing large numbers of observations through rowwise deletion. A number of multivariate platforms (Hierarchical Clustering and Principal Components, for example) have built-in imputation, which is strongly preferred when available. For other platforms, you can make an imputed version of your data using the Explore Missing Values platform.