Biostatistics Using JMP
271 pages

Vous pourrez modifier la taille du texte de cet ouvrage

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Biostatistics Using JMP


Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
271 pages

Vous pourrez modifier la taille du texte de cet ouvrage


Analyze your biostatistics data with JMP!

Trevor Bihl's Biostatistics Using JMP: A Practical Guide provides a practical introduction on using JMP, the interactive statistical discovery software, to solve biostatistical problems. Providing extensive breadth, from summary statistics to neural networks, this essential volume offers a comprehensive, step-by-step guide to using JMP to handle your data.

The first biostatistical book to focus on software, Biostatistics Using JMP discusses such topics as data visualization, data wrangling, data cleaning, histograms, box plots, Pareto plots, scatter plots, hypothesis tests, confidence intervals, analysis of variance, regression, curve fitting, clustering, classification, discriminant analysis, neural networks, decision trees, logistic regression, survival analysis, control charts, and metaanalysis.

Written for university students, professors, those who perform biological/biomedical experiments, laboratory managers, and research scientists, Biostatistics Using JMP provides a practical approach to using JMP to solve your biostatistical problems.



Publié par
Date de parution 03 octobre 2017
Nombre de lectures 0
EAN13 9781635262414
Langue English
Poids de l'ouvrage 21 Mo

Informations légales : prix de location à la page 0,0112€. Cette information est donnée uniquement à titre indicatif conformément à la législation en vigueur.


Biostatistics Using JMP
A Practical Guide
Trevor Bihl
The correct bibliographic citation for this manual is as follows: Bihl, Trevor. 2017. Biostatistics Using JMP : A Practical Guide . Cary, NC: SAS Institute Inc .
Biostatistics Using JMP : A Practical Guide
Copyright 2017, SAS Institute Inc., Cary, NC, USA ISBN 978-1-62960-383-4 (Hard copy) ISBN 978-1-63526-241-4 (EPUB) ISBN 978-1-63526-242-1 (MOBI) ISBN 978-1-63526-243-8 (PDF)
All Rights Reserved. Produced in the United States of America.
For a hard copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government s rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414
September 2017
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their respective companies.
SAS software may be provided with certain third-party software, including but not limited to open-source software, which is licensed under its applicable third-party software license agreement. For license information about third-party software distributed with SAS software, refer to .
To the memory of Gregory Boivin, DVM, MBA, who provided encouragement and much needed data for this endeavor
About This Book
About the Author
Chapter 1: Introduction
1.1 Background and Overview
1.2 Getting Started with JMP
1.3 General Outline
1.4 How to Use This Book
1.5 Reference
Chapter 2: Data Wrangling: Data Collection
2.1 Introduction
2.2 Collecting Data from Files
2.2.1 JMP Native Files
2.2.2 SAS Format Files
2.2.3 Excel Spreadsheets
2.2.4 Text and CSV Format
2.3 Extracting Data from Internet Locations
2.3.1 Opening as Data
2.3.2 Opening as a Webpage
2.4 Data Modeling Types
2.4.1 Incorporating Expression and Contextual Data
2.5 References
Chapter 3: Data Wrangling: Data Cleaning
3.1 Introduction
3.2 Tables
3.2.1 Stacking Columns
3.2.2 Basic Table Organization
3.2.3 Column Properties
3.3 The Sorted Array
3.4 Restructuring Data
3.4.1 Combining Columns
3.4.2 Separating Out a Column (Text to Columns)
3.4.3 Creating Indicator Columns
3.4.4 Grouping Inside Columns
3.5 References
Chapter 4: Initial Data Analysis with Descriptive Statistics
4.1 Introduction
4.2 Histograms and Distributions
4.2.1 Histograms
4.2.2 Box Plots
4.2.3 Stem-and-Leaf Plots
4.2.4 Pareto Charts
4.3 Descriptive Statistics
4.3.1 Sample Mean and Standard Deviation
4.3.2 Additional Statistical Measures
4.4 References
Chapter 5: Data Visualization Tools
5.1 Introduction
5.2 Scatter Plots
5.2.1 Coloring Points
5.2.2 Copying Better-Looking Figures
5.2.3 Multiple Scatter Plots
5.3 Charts
5.4 Multidimensional Plots
5.4.1 Parallel Plots
5.4.2 Cell Plots
5.5 Multivariate and Correlations Tool
5.5.1 Correlation Table
5.5.2 Correlation Heat Maps
5.5.3 Simple Statistics
5.5.4 Additional Multivariate Measures
5.6 Graph Builder and Custom Figures
5.6.1 Graph Builder Custom Colors
5.6.2 Incorporating Contextual Data
5.7 References
Chapter 6: Rates, Proportions, and Epidemiology
6.1 Introduction
6.2 Rates
6.2.1 Crude Rates
6.2.2 Adjusted Rates
6.3 Geographic Visualizations
6.3.1 National Visualizations
6.3.2 County and Lower Level Visualizations
6.4 References
Chapter 7: Statistical Tests and Confidence Intervals
7.1 Introduction
7.1.1 General Hypothesis Test Background
7.1.2 Selecting the Appropriate Method
7.2 Testing for Normality
7.2.1 Histogram Analysis
7.2.2 Normal Quantile/Probability Plot
7.2.3 Goodness-of-Fit Tests
7.2.4 Goodness-of-Fit for Other Distributions
7.3 General Hypothesis Tests
7.3.1 Z-Test Hypothesis Test of Mean
7.3.2 T-Test Hypothesis Test of Mean
7.3.3 Nonparametric Test of Mean (Wilcoxon Signed Rank)
7.3.4 Standard Deviation Hypothesis Test
7.3.5 Tests of Proportions
7.4 Confidence Intervals
7.4.1 Mean Confidence Intervals
7.4.2 Mean Confidence Intervals with Different Thresholds
7.4.3 Confidence Intervals for Proportions
7.5 Chi-Squared Analysis of Frequency and Contingency Tables
7.6 Two Sample Tests
7.6.1 Comparing Two Group Means
7.6.2 Paired Comparison, Matched Pairs
7.7 References
Chapter 8: Analysis of Variance (ANOVA) and Design of Experiments (DoE)
8.1 Introduction
8.2 One-Way ANOVA
8.2.1 One-Way ANOVA with Fit Y by X
8.2.2 Means Comparison, LSD Matrix, and Connecting Letters
8.2.3 Fit Y by X Changing Significance Levels
8.2.4 Multiple Comparisons, Multiple One-Way ANOVAs
8.2.5 One-Way ANOVA via Fit Model
8.2.6 One-Way ANOVA for Unequal Group Sizes (Unbalanced)
8.3 Blocking
8.3.1 One-Way ANOVA with Blocking via Fit Y by X
8.3.2 One-Way ANOVA with Blocking via Fit Model
8.3.3 Note on Blocking
8.4 Multiple Factors
8.4.1 Experimental Design Considerations
8.4.2 Multiple ANOVA
8.4.3 Feature Selection and Parsimonious Models
8.5 Multivariate ANOVA (MANOVA) and Repeated Measures
8.5.1 Repeated Measures MANOVA Background
8.5.2 MANOVA in Fit Model
8.6 References
Chapter 9: Regression and Curve Fitting
9.1 Introduction
9.2 Simple Linear Regression
9.2.1 Fit Y by X for Bivariate Fits (One X and One Y)
9.2.2 Special Fitting Tools
9.3 Multiple Regression
9.3.1 Fit Model
9.3.2 Stepwise Feature Selection
9.3.3 Analysis of Covariance (ANCOVA)
9.4 Nonlinear Curve Fitting and a Nonlinear Platform Example
9.5 References
Chapter 10: Diagnostic Methods for Regression, Curve Fitting, and ANOVA
10.1 Introduction
10.2 Computing Residuals with Fit Y by X and Fit Model
10.2.1 Fit Y by X
10.2.2 Fit Model
10.3 Checking for Normality
10.4 Checking for Nonconstant Error Variance (Heteroscedasticity)
10.5 Checking for Outliers
10.6 Checking for Nonindependence
10.7 Multiple Factor Diagnostics
10.8 Nonlinear Fit Residuals
10.9 Developing Appropriate Models
10.10 References
Chapter 11: Categorical Data Analysis
11.1 Introduction
11.2 Clustering
11.2.1 Hierarchical Clustering
11.2.2 K-means Clustering
11.3 Classification
11.3.1 JMP Data Preliminaries for Classification
11.3.2 Example Data Sets
11.4 Classification by Logistic Regression
11.4.1 Logistic Regression in Fit Y by X
11.4.2 Logistic Regression in Fit Model
11.5 Classification by Discriminant Analysis
11.5.1 Discriminant Analysis Loadings
11.5.2 Stepwise Discriminant Analysis
11.6 Classification with Tabulated Data
11.7 Classifier Performance Verification
11.8 References
Chapter 12: Advanced Modeling Methods
12.1 Introduction
12.2 Principal Components and Factor Analysis
12.2.1 Principal Components in JMP
12.2.2 Dimensionality Assessment
12.2.3 Factor Analysis in JMP
12.3 Partial Least Squares
12.4 Decision Trees
12.4.1 Classification Decision Trees in JMP
12.4.2 Predictive Decision Trees in JMP
12.5 Artificial Neural Networks
12.5.1 Neural Network Architecture
12.5.2 Classification Neural Networks in JMP
12.5.3 Predictive Neural Networks in JMP
12.6 Control Charts
12.7 References
Chapter 13: Survival Analysis
13.1 Introduction
13.2 Life Distributions
13.3 Kaplan-Meier Curves
13.3.1 Simple Survival Analysis
13.3.2 Multiple Groups
13.3.3 Censoring
13.3.4 Proportional Hazards
13.4 References
Chapter 14: Collaboration and Additional Functionality
14.1 Introduction
14.2 Saving Scripts and SAS Coding
14.2.1 Saving Scripts to Data Table
14.2.2 SAS Coding Functionality
14.3 Collaboration
14.3.1 Journals
14.3.2 Web Reports
14.4 Add-Ins
14.4.1 Finding Add-Ins
14.4.2 Developing Add-Ins
14.4.3 Example Add-In: Forest Plot / Meta-analysis
14.4.4 Add-In Version Control
14.5 References
I would like to thank my wife, Ji, and daughter, Talia, for their support and understanding while I worked on this book. Additionally, my wife was an excellent editor and subject matter expert on various biomedical topics. The education I received from working with Kenneth Bauer pushed me into statistics and thus this book. Motivation by Camilla Mauzy, David Smallenberger, and Michael Gibb also needs mentioning, since it helped move this project from ad hoc student reference materials to a completed book. This motivation was furthered along by Bill Worley, of SAS Institute Inc., who was instrumental in getting me to submit a book proposal as well as being a sounding board for ideas and a source of knowledge on the finer points of JMP. Support and motivation by Stacey Hamilton, my editor at SAS, were also extremely helpful and appreciated.
Finding new and relevant data is always particularly challenging, and considerable thanks goes to those who were willing and able to share their data. Gregory Boivin was very helpful in this regard and provided a very useful dataset on mouse tendon strength, which is used throughout the book; in addition to thanking him, considerable thanks also goes to Hamish Simpson and Michelle Ghert, the editors of Bone and Joint Research , who gave me permission to reuse Greg s mouse tendon strength dataset. Similarly, Angie Brown, of the West Virginia Medical Journal , was very helpful in giving me permission to reuse data presented in their journal. A similar debt of gratitude goes to Teresa Hawkes, Otilia Banji, and Kranthi Kumar, who shared their own datasets with me. Finally, thanks goes to my reviewers, including Teresa Hawkes, Amanda King, and Richard Zink, and my editor, Stacey Hamilton, who greatly helped improve the quality over the initial drafts.
About This Book
Rationale for This Book
This book focuses on the basics of statistical data analysis of biomedical/biological data using JMP. After both teaching and consulting in biostatistics, I saw a gap that existed between biostatistics books, which tend to be theoretical, and statistical software. To address this gap, I use statistical methods to analyze various biostatistics problems.
Importance of Statistical Analysis
Analytics, data mining, data science, and statistics are essentially synonyms, and describe finding meaning in data by developing mathematical models to find and describe relationships in the data. While many biostatistical applications are simple in nature, for example, a t-test to evaluate the mean differences in response due to a treatment, a wide variety of methods exists.
Biostatistics Focus
Biostatistics is the application of statistical methods to biological, or medical, data. While some methods see more frequent use in biostatistics, for example, survival analysis, these methods are not limited in use to just biostatistical problems. Essentially, all data is a matrix at the end of the day, and thus methods seen in biostatistical analysis can be applied to other domains.
The Power of JMP for Analytics
Familiarity with statistical methods enables one to analyze data via methods familiar in textbooks. However, many textbook examples are simple in nature, but real-world data rarely is. Thus, applying methods in a textbook can be frustrating if you have to wrestle both with the data and software.
This book was written with JMP due to the many advantages JMP has over other statistical software. JMP provides a GUI (graphical user interface) in which one can analyze data without coding algorithms. Additionally, the SAS underpinnings to JMP provide a wide and stable platform that can be trusted in its analysis. In total, JMP provides a tool that is easy to use and comes with a wide variety of built-in methods, the results of which can be trusted (something you can t say about all statistical software). And, for those who wish to code boutique algorithms, JMP also supports this as well.
Who Should Read This Book
This book is written for a variety of different persona groups. Although biostatistics is the focus, and is in the title, this book has broader appeal.
Biological/Medical Researchers and Laboratory Managers
Researchers in the sciences, for example, biology and medicine, spend a large majority of their time performing experiments and a small fraction of their time analyzing data. Remembering how to use software that is only accessed a few times a year can be challenging. Thus, this book is aimed particularly at this group and provides a practical guide to analyzing collected biological/medical data.
Statisticians and Data Scientists
This group might be interested in a broad look at how to use JMP to solve various problems and analyze data in JMP. While theory is light in this book, this group could easily learn the steps and nuances of JMP. Additionally, they would see practical data analysis and experimental data analysis using various JMP capabilities.
Students in Biostatistics or Statistics Classes
Many biostatistical courses use excellent textbooks that cover the theory and examples for a wide variety of problems. However, these textbooks rarely discuss how to solve the problems, leaving students with the need to either code equations or learn various statistical software programs on the fly. This book is written from a general standpoint and can thus be combined with any biostatistical textbook. Additionally, since the statistical methods themselves can be used in many domains, this book can be combined with multiple statistics courses and textbooks.
Biostatistics Methods and JMP Functionality Covered in This Book
This Book Covers the Following Biostatistics Methods
Data Cleaning - Data Wrangling - Descriptive Statistics - Data Visualization
Rates - Proportion - Geographical Visualization - Epidemiology
Confidence Intervals - Hypothesis Tests
Linear Regression - Curve Fitting - General Linear Models
Analysis of Variance (ANOVA) - Analysis of Covariance (ANCOVA) - Remedial Measures for Regression and ANOVA
Cluster Analysis - Hierarchical Clustering - K-means
Classification Analysis - Logistic Regression - Discriminant Analysis
Survival Analysis - Meta Analysis - Control Charts - Neural Networks - Decision Trees
Structure of This Book
Chapter 1 introduces this book and mirrors some content in this section. Additionally, Chapter 1 introduces how to start using JMP. Chapters 2 and 3 introduce data-wrangling issues, such as data collection and cleaning. These chapters are very helpful when analyzing real-world data using JMP. The basics of descriptive statistics and data visualization are presented in Chapters 4 and 5 .
After Chapter 5 , the focus of this book is on developing statistical models to describe data. Chapters 6 through 13 present various approaches, and your data and goals will drive which chapter you should read. Chapter 6 discusses epidemiology and geographical data analysis. Chapter 7 discusses hypothesis tests and confidence intervals. Chapters 8 to 10 present models such as analysis of variance, regression, curve fitting, and model validation. Chapter 11 discusses classification and clustering methods. Chapter 12 presents advanced modeling methods. Chapter 13 discusses survival analysis. Finally, Chapter 14 presents collaboration methods, incorporating custom JMP tools and meta-analysis, as an example.
Additional Resources
For downloads of sample data presented in this book, please visit my author page at:
This site also includes downloadable color versions of selected figures that appear in this book. Since this book is printed in black and white, you might find that some color figures are easier to interpret and understand.
Please visit this site regularly, as I will provide updates on the content.
We Want to Hear from You
SAS Press books are written by SAS Users for SAS Users. We welcome your participation in their development and your feedback on SAS Press books that you are using. Please visit to do the following:
Sign up to review a book
Recommend a topic
Request information on how to become a SAS Press author
Provide feedback on a book
Do you have questions about a SAS Press book that you are reading? Contact the author through or .
SAS has many resources to help you find answers and expand your knowledge. If you need additional help, see our list of resources at .
About the Author

Trevor Bihl is both a research scientist/engineer and an educator who teaches biostatistics, engineering statistics, and programming courses. He has been a SAS and JMP user since 2009 and provides various biostatistics and data mining consulting services. His background includes multivariate statistics, signal processing, data mining, and analytics. His educational background includes a BS and MS from Ohio University and a PhD from the Air Force Institute of Technology. He is the author of multiple journal and conference papers, book chapters, and technical reports.
Learn more about this author by visiting his author page at . There you can download free book excerpts, access example code and data, read the latest reviews, get updates, and more.
Chapter 1: Introduction
1.1 Background and Overview
1.2 Getting Started with JMP
1.3 General Outline
1.4 How to Use This Book
1.5 Reference
1.1 Background and Overview
This book evolved from personal experiences in both teaching and consulting in biostatistics. Although many biostatistics textbooks show computer outputs and results, they rarely show how to generate the results. Biostatistics instruction is also commonly theoretical and based on solving simple problems by hand. However, real-world data is usually more complicated than the simple examples, and I found that frequent collaborators-PhD-educated researchers who performed and managed experiments-needed more understanding of software to analyze their data. This is difficult because such researchers often spend a majority of their time performing experiments and use statistical software sparingly. They also often do not have access to a dedicated biostatistician in their office or are competing for the time of their office s single biostatistician. Although such researchers might know the mechanics of a statistical method, they might not how to generate meaningful results using software.
Therefore, a practical, how-to guide to biostatistics was needed. There are many software applications available for statistics and biostatistics, so why JMP? As an educator, I found JMP an advantage to teaching. I could spend more time on theory and interpretation because JMP does not require scripts and syntax. As a collaborator and consultant, I found my colleagues would readily gravitate toward JMP and its results because of the graphical user interface (GUI) format and its ease of use. And finally, unless you want to code algorithms themselves, as a researcher, you will find JMP to be more user-friendly, correct, and developed when compared to many other competing packages. Incidentally, if you want to code, SAS programming abilities do exist in JMP. Thus, you can fully use JMP for analysis ranging from simple to complex and customized.
This book presents and solves problems germane to biostatistics with easy-to-reproduce examples. The book is also a general biostatistics reference that leverages the topics found in leading biostatistics books. This chapter introduces JMP, presents a general outline of the book contents, and provides a brief guide to using this book.
1.2 Getting Started with JMP
When you first run JMP, you will be greeted with a Tip of the Day ( Figure 1.1 ). There are 62 tips of the day, and they show up whenever you start JMP. These tips can be useful to new JMP users in gaining familiarity with the software. However, if you don t want to see these tips further, you can do the following:
1. Clear Show tips at start-up .
2. Click Close .
Figure 1.1 Initial Tip of the Day

After you close the Tip of the Day , you are greeted with the primary JMP interface seen in Figure 1.2 . Here, you can load data, create a new data table, or look for recently used files. If this is the first time you have opened JMP, there will be no recent files to consider. Thus, you must load or create a new data table.
To load a file:
Click File Open .
Click on the third icon on the taskbar.
To create a blank data table:
Click File New .
Click on the first icon on the taskbar.
Alternatively, if you want to load a built in JMP example data file, you can do so. A variety of files are available.
To load example data files:
1. Click Help Sample Data .
2. Select a data file under the method of interest.
Also, you can select individual or multiple data tables in the Window List and then close all of these files. This is advantageous if you inadvertently opened many files, such as in a mistakenly setup Internet open.
To close many open data tables:
1. Select the windows of interest.
2. Right-click and select Close .
3. You will then be prompted to save these files.
Figure 1.2 JMP Primary Interface

If you create a new data table, you will be presented with Figure 1.3 . Here, you see that there is a spreadsheet-like table, with a Column 1 ready for you to start considering. Also, when you have loaded and analyzed data, you can save these results to the JMP data table and instantly reload at a later date, as will be discussed in Section 14.2.
Figure 1.3 New Data Table

1.3 General Outline
With this basic usability knowledge from Section 1.2, you are now ready to consider bio-statistical data analysis. Biostatistics covers a wide variety of topics ranging from simple hypothesis tests to complex nonlinear algorithms. This book aims to cover the range of methods with varying levels of detail. To do so, this book is organized sequentially as outlined in Table 1.1 .
Table 1.1 General Outline of Biostatistics Using JMP: A Practical Guide Method Chapter Introduction 1 Data Wrangling: Data Collection 2 Data Wrangling: Data Cleaning 3 Initial Data Analysis with Descriptive Statistics 4 Data Visualization Tools 5 Rates, Proportions and Epidemiology 6 Statistical Tests and Confidence Intervals 7 Analysis of Variance (ANOVA) and Design of Experiments (DoE) 8 Regression and Curve Fitting 9 Diagnostic Methods for Regression, Curve Fitting and ANOVA 10 Categorical Data Analysis 11 Advanced Modeling Methods 12 Survival Analysis 13 Collaboration and Additional Functionality 14
1.4 How to Use This Book
In Chapters 2 and 3 , this book moves to data-wrangling issues, such as data collection and cleaning. Since upward of 80% of your time can be spent in making messy data usable (Lohr, 2014), learning the tools JMP has for assembling and cleaning data is key and is covered in Chapters 2 and 3 .
In Chapters 4 and 5 , you can learn the basics of descriptive statistics and data visualizations in JMP. Following this, the primary focus is on modeling, which involves creating a mathematical representation (a model) of data or a system in order to make inferences about it.
After this, you have a few different paths available:
Chapter 6 discusses epidemiological and geographical interpretations.
Chapter 4 also discusses developing custom equations.
Chapter 7 discusses the various hypothesis test and confidence interval methods.
Chapters 8 to 10 discuss linear models such as analysis of variance (ANOVA), regression, and model validation.
Because of the interrelation of the underlying methods of regression ( Chapter 9 ) and ANOVA ( Chapter 8 ), diagnostic and remedial measures for these methods are discussed in Chapter 10 .
Chapter 11 discusses classification methods, such as logistic regression, and clustering methods, such as k-means.
Chapters 7 to 10 largely deal with a continuous dependent (e.g., Y, variable (prediction)). Methods to analyze a discrete dependent variable are presented in Chapter 11 .
Chapter 12 presents advanced modeling methods (e.g., factor analysis, neural networks, and control charts).
Chapter 13 introduces the vast array of survival analysis methods in JMP.
Chapter 14 presents methods that facilitate collaborating in addition to sources of additional functionality.
If you are using a previously created data set, then it is advantageous to start with data-wrangling methods and then look at the various analytical tools this book discusses. However, if you are starting a new experiment and will be collecting data, then you should start looking at Section 8.4.1, which discusses experimental design considerations and how to develop and select factor levels for an experiment.
1.5 Reference
Lohr, S. (2014, Aug. 18). For big-data scientists, janitor work is key hurdle to insights. New York Times , p. B4.
Chapter 2: Data Wrangling: Data Collection
2.1 Introduction
2.2 Collecting Data from Files
2.2.1 JMP Native Files
2.2.2 SAS Format Files
2.2.3 Excel Spreadsheets
2.2.4 Text and CSV Format
2.3 Extracting Data from Internet Locations
2.3.1 Opening as Data
2.3.2 Opening as a Webpage
2.4 Data Modeling Types
2.4.1 Incorporating Expression and Contextual Data
2.5 References
2.1 Introduction
Many examples in textbooks and manuals such as this book are very orderly and clean. However, real-world data is rarely orderly and clean (authors actually spend a lot of time searching for and fabricating good example data) and up to 80% of your time can be lost in wrangling to make messy data usable (Lohr, 2016). Medical and biological data is also becoming increasingly big in nature (Bihl, Young II, and Weckman, 2016), and thus wrangling the data can be of interest (Marx, 2013). To analyze messy real-world data, data wrangling comes into play.
Data wrangling, conceptualized in Figure 2.1 , involves taking raw data, extracting and cleaning it, and developing data features for analysis. The boundaries are not always clear when data wrangling ends and when statistical analysis begins. Thus, some overlap exists between data wrangling and statistical analysis when you begin to select/extract/analyze data features (e.g., the columns of a JMP data table).
Figure 2.1 Data-Wrangling Overview, adapted from (Boehmke, 2016)

Data wrangling is not frequently discussed in conjunction with analysis because books, including this one, focus on developing knowledge in using the presented methods. However, it is important to mention data-wrangling issues since the real world is not full of clean, textbook-style data. Thus, this book considers two data-wrangling tasks with Chapter 2 focusing on data collection in JMP and Chapter 3 focusing on data cleaning using JMP tools. Discussion of data analysis methods then makes up the majority Chapters 4 to 13 of this book. This chapter will present data collection methods built into JMP. JMP can load both JMP and SAS format files, in addition to loading Excel and CSV files. Moreover, data can be imported from text files and Internet webpages using the sophisticated data importation tools in JMP.
2.2 Collecting Data from Files
JMP supports opening a wide variety of data file formats, as detailed in Table 2.1 with annotation on, provided that JMP can load or save in that format. By supporting a wide variety of data types, JMP facilitates work on older data sets, with users who do not use JMP, and with a wide audience. To aid in usability, this book will consider examples from some of these data types (*.txt, *.sas7bdat, *.csv).
Table 2.1 Data Files Supported by JMP Type Extension Load Save Comma-separated *.csv X X Data files containing text *.dat X X ESRI shapefiles *.shp X Flow Cytometry versions 2.0 and 3.0 *.fcs X HTML *.htm, *.html X JSON files *.json X X MATLAB *.m, *.M X Microsoft Excel *.xls, *.xlsx X X Minitab Portable Worksheet *.mtp X Plain text *.txt X R *.r X SAS transport *.xpt, *.stx X X (*.xpt) SAS versions 7 through 9 *.sas7bdat, *.sas7bxat X X (*.sas7bdat) SPSS files *.sav X Tab-separated *.tsv X X Teradata database *.trd X Triple-S *.sss, *.xml X xBase data files *.dbf X
2.2.1 JMP Native Files
JMP uses a propriety data format native to JMP software. As with any file format, this format describes how to handle the pieces of the file. However, not all software applications have the keys to unlock all data types. Thus, JMP knows how to assemble a data table from a *.jmp file, but other software applications likely do not. However, various advantages exist to the JMP native format. If you perform an analysis in JMP, you can save this analysis in the JMP file and reload it on a different computer exactly as you left it. Although you could save to a *.csv (or other formats, as shown in Table 2.1 ), the ability to reload an analysis would be lost by saving outside the *.jmp format. JMP Sample Data
A wide variety of sample data sets are available in JMP to illustrate the application of specific methods and to provide data for analysis and demonstration. In JMP 13, 543 sample data files are included and available for use. Their applications range from simple to complex, and this enables you to become familiar with both JMP data and methods through exploring them.
To access the folder containing all sample data available in your copy of JMP:
1. Select Help Sample Data Library .
2. In the Explorer window that opens to the data directory, double-click on the file of interest to load it.
To access the sample data listed by method or domain of interest:
1. Select Help Sample Data .
2. Select the method or domain of interest (e.g., click Analysis of Variance) .
Note that you can further see the type of method these files best correspond with.
3. Select an example file of interest (e.g., Blood Pressure) .
4. A data table for this file is then opened.
2.2.2 SAS Format Files
SAS files are natively supported by JMP. However, a wide variety of SAS file extensions are in use (e.g., *.sas, *.ss2, *.sc2, etc.). Thus, some care and control might be of interest when loading a file from SAS into JMP. For an example, you can load a SAS data file to JMP:
1. Select File Open .
2. Find the SAS file of interest and click once on it.
3. After you select a SAS file, JMP enables you to specify how to open the file. (See Figure 2.2 .)
4. Click Open when you are satisfied with your selection.
5. You will then be greeted with a JMP data table.
As hinted at in Figure 2.2 , JMP knows what to do with a SAS file and thus JMP is asking you whether you want your JMP data table to have column names from the SAS variables names or the SAS variable labels. After making this selection and opening the file, you will be greeted with a JMP data table.
Figure 2.2 Opening a SAS File

2.2.3 Excel Spreadsheets
Microsoft Excel is a spreadsheet software with worldwide use, and thus JMP readily handles the importation of data from Excel spreadsheets. Since Excel can contain data in multiple sheets of the same file, some care and user interaction might be necessary to import data from Excel. An example will be considered using a generic data file ExcelExample.xlsx, which contains nothing more than two columns of user-generated numbers.
For an example, you can load an Excel data file to JMP:
1. Select File Open .
2. Find the Excel file of interest and click once on it.
3. After you select an Excel file, JMP enables you to specify how to open the file. (See Figure 2.3 .)
a. The key consideration at this point is whether Row 1 values are column/variable names or are numbers.
4. Click Open when you are satisfied with your selection.
5. You will then be greeted with the Excel Import Wizard . (See Figure 2.4 .)
a. At the top left, you can view the Excel data.
b. At the top right, select the Excel sheet of interest. (This example has only one sheet with numbers.)
c. Specify needed considerations using the options in the middle (e.g., where the data is located).
6. Click Import when you are satisfied that the view in the top left is what you want the data table to look like.
Figure 2.3 Opening an Excel File

Figure 2.4 Excel Import Wizard Dialog Box

2.2.4 Text and CSV Format
JMP can load text files as well. However, some care must be taken in this process since text files can be highly unstructured. To facilitate this reality, JMP incorporates a variety of options when opening a text file. For this example, a small file, TextExample.txt, was created with two tab-separated columns of data. Loading a *.csv file will be similar.
To load a text file:
1. Select File Open .
2. Find the text file of interest and click once on it.
3. After a text file is selected, JMP enables you to specify how to open the file. (See Figure 2.5 .)
A variety of options exist, as shown in Figure 2.5 . Selecting Data, using Text Import preferences lets JMP use whatever preferences have been selected. Selecting Data, using best guess lets JMP examine the data and use what it believes to be the best approach to open the data. Selecting Plain text into Script window will not load the file as a data table, but as a script text file. Here you can select Data with Preview , which will allow you more control of the text loading and show additional functions that can assist in loading text files.
Figure 2.5 Opening a Text File

After you select Data with Preview and click OK , you are greeted with the Import dialog box, as shown in Figure 2.6 . Here, you see the text file that you have loaded. For this text file, you see a blue line separating the Day and Meas columns, indicating which parts of the text file JMP will assign to data table columns. The text file under consideration has columns separated by tabs. JMP was able to detect this and its best guess is that the file is tab delimited. However, you could select a wide variety of delimited or fixed-width fields. In addition, you could have selected a subset of this file or a compatibility standard, as shown in the expanded sections at the bottom of Figure 2.6 .
To continue loading:
1. Ensure that the lines for data columns at the top of the Import dialog box are as desired.
2. Click Next .
3. Change the names of the columns, if needed.
4. Click Import .
We are now greeted with the resultant data table as shown in Figure 2.7 .
Figure 2.6 The Text File Import Dialog Box with All Fields Displayed

Figure 2.7 Data Table from Text Imported File

2.3 Extracting Data from Internet Locations
Data is not always conveniently in a spreadsheet or file. Frequently, data can be found on webpages, and thus you need to bring the data to the software. Various options exist to accomplish this task, including copying and pasting the data into a spreadsheet. However, this can be inefficient. To improve the process, you can directly load a webpage and extract its data using JMP. To collect data from the Internet in JMP, begin by selecting File Internet Open .
The dialog box shown in Figure 2.8 will appear. Type or paste the URL into the field. For this example, consider the following URL:
This is a Wikipedia article that has data from road casualties in Great Britain from 1926 to 2015.
Figure 2.8 Internet Open

After you type the URL into the field, you must specify how this will be opened via the Open As drop-down menu. JMP supports three options, listed in Table 2.2 . For this example, examine opening as data or text.
Table 2.2 Internet Open Data Options Option Meaning Data JMP searches for tables on the URL and then presents options Text URL is opened as a text file Webpage URL opens in a built-in JMP browser; specific tables can then be selected
2.3.1 Opening as Data
If you select the default option, JMP will open the webpage as data. An dialog box like that in Figure 2.9 will appear. This dialog box shows the various items JMP sees as tables:
1. Select the desired table or tables.
2. Click OK .
This example shows only the first available table selected. The resultant data tables will appear, as shown in Figure 2.10 .
Figure 2.9 Importing a Table When Opening as Data

Figure 2.10 Imported Data Table

2.3.2 Opening as a Webpage
After JMP opens the URL as a webpage, you can use the page, Figure 2.11 , as you could any browser. If you investigate this webpage in the JMP browser, you will find that there are two tables of possible interest on this webpage.
Figure 2.11 Wikipedia Page for Reported Road Casualties in the JMP Browser

After you become familiar with the webpage, perform the following steps to create a data table:
1. Select File Import Table as Data Table .
2. An dialog box like that in Figure 12.2 will appear.
3. Select the desired table or tables.
4. Click OK .
If you select both options, JMP will instantiate two data tables: the table seen in Figure 2.10 , and the table seen in Figure 2.12 , which has additional information. However, you should not begin to investigate the data in Figure 2.12 yet. First, you should review the URL; if you examine the webpage as shown in Figure 2.11 , you will find that the data in Figure 2.12 is for the year 2008 only. Second, the Ref. and Note columns of both Figure 2.10 and Figure 2.12 could be either very troublesome or very useful, depending on what you plan to do next.
Figure 2.12 Types of Casualties

2.4 Data Modeling Types
JMP data tables can contain columns with continuous values, categorical (nominal or ordinal) values, and additional types (such as pictures). How these columns are modeled within JMP relates to the data type and modeling type. A variety of available data modeling types and their associated symbols are presented in Figure 2.13 with Figure 2.13 a showing modeling types for numeric data and Figure 2.13 b showing additional modeling types. Basic data types include numeric, character (for text), row state (as for a column that shows the states of individual rows), and expression (such as pictures).
Figure 2.13 Data Modeling Types in JMP13

The numeric data type should be used for continuous random variables, ordinal and nominal should be used for discrete values (both numeric and text), and none should be used for columns that fit none of these categories. Characters are frequently seen in categorical groups (e.g., a nominal column with gender) and with unstructured text (e.g., raw observational data from a lab technician). Expressions can be useful for contextual data (e.g., you want to include a picture from a stained slide to link the graphical data to the numeric results that you will analyze). Although the focus of this book will be on numeric data and nominal character data, an example will be seen in Section 2.4.1 for the built-in Big Class Families data set. Additional discussion on included expression data will also be presented in Section 5.6.2 for the built-in Big Class Families data set. Discussion of editing and formatting column properties and data or modeling types can be found in Section 3.2.3.
2.4.1 Incorporating Expression and Contextual Data
If you want to add in a picture for each column, you could increase the value of the data table by providing contextual information, such as pictures of each subject or picture of the slide that is associated with a given sample. An example is presented in Figure 2.15 for the built-in Big Class Families data set, which includes pictures of the hypothetical students in the hypothetical middle school class under analysis.
Figure 2.14 Big Class Families Data Table

However, how to create such a data table might not appear straightforward at first. Two general approaches can be used to bring in such contextual information. The first one is to open a webpage that included figures associated with observations using the Internet open features (See Section 2.3.2.). However, you might want to add pictures from a folder into the data table. To do this, you would create a data table and then add in a column with the contextual data in the form of a picture. To add such contextual information:
1. Specify the column of interest as the Expression modeling type:
a. Right-click on the column.
b. Select Column Info .
c. Select Expression from the Data Type drop-down menu.
d. Click OK .
2. Drag and drop the desired figure to each individual cell.
Further editing and formatting of column properties and data or modeling type can be found in Section 3.2.3. In addition, incorporating such contextual expression data will be presented in Section 5.6.2 for the built-in Big Class Families data set.
2.5 References
Bihl, T. J., Young II, W. A., and Weckman, G. R. (2016). Defining, understanding, and addressing big data. International Journal of Business Analytics (IJBAN), 3 (2), 1-32.
Boehmke, B. (2016). Data Wrangling with R . Springer.
Lohr, S. (2016). For big-data scientists, janitor work is key hurdle to insights. New York Times . Retrieved from .
Marx, V. (2013). Biology: The big challenges of big data. Nature, 498 (7453), 255-260.
Chapter 3: Data Wrangling: Data Cleaning
3.1 Introduction
3.2 Tables
3.2.1 Stacking Columns
3.2.2 Basic Table Organization
3.2.3 Column Properties
3.3 The Sorted Array
3.4 Restructuring Data
3.4.1 Combining Columns
3.4.2 Separating Out a Column (Text to Columns)
3.4.3 Creating Indicator Columns
3.4.4 Grouping Inside Columns
3.5 References
3.1 Introduction
After data has been collected and imported into JMP, as described in Chapter 2 , you would obviously like to analyze this data. However, raw data can have various issues that require cleaning. For example, a variable (a column in a JMP data table) might contain both haemaglobin and Haemaglobin . Although a person can tell that these are both the same, computers pay attention to case and would treat both as separate groups. Thus, data cleaning (step 2 of the data-wrangling framework in Figure 2.1 ) must often be considered. This chapter will present various data-cleaning methods to sort data, combine vectors, and restructure data. JMP capabilities go beyond these presented, and thus this chapter is intended to be a springboard for helping you become familiar with these tools.
3.2 Tables
All data that is processed in JMP is handled via tables. Tables can contain columns with continuous and categorical (nominal or ordinal) values. The best description of tables in JMP comes from an example. For the first example in this chapter, consider two variables: subject number and age for 366 subjects in the dermatological data set (Demiroz, Govenir, and Ilter, 1998), as available from the UCI Machine Learning Repository (Lichman, 2013). In this chapter, a subset of the data will be considered with two columns of data: subject number and subject ages. As originally organized, the ages are sorted based on the order number (1 to 366) in which they appeared in the original data set; however, organizing based on age makes it possible to more logically sort the data. To load this data set, begin by opening
Figure 3.1 shows the results of the data being loaded. Notice that there are four columns (data features) with the following names: SUBJ, AGE, SUBJ 1, and AGE 1. As initially considered, the SUBJ and AGE columns were split into two columns, each with 183 observations. For analysis, these columns need to be appropriately combined. When considering this data table, note that SUBJ is the subject number of each observation and ranges sequentially from 1 to 366. As initially considered, SUBJ s indices correspond with the row numbers in the table. Although this example was contrived, such examples can be seen in real-world data sets such as MNIST (LeCun, Cortes, and Burges, 1998) and in many data sets presented in another source (Hand, Daly, Lunn, McConway, and Ostrowski, 1994).
Figure 3.1 Example Data from Dermatological Data Set

On the left of the table are three small windows. The first, DermatologyAge, describes what files and script are associated with this JMP file. This currently considers only the date file itself and is described as a locked file.
The second window, seen in Figure 3.2 , describes the types of data columns that are presented with symbols. These were discussed in Section 2.4. It currently indicates (4/1), which refers to there being four data columns and one column currently selected. If you were to select a column or single observation, this would change to indicate the number of associated columns selected. Inside this window are the column names with graphical annotation to the type of data feature. The SUBJ and SUBJ 1 columns are currently defined as nominal (with a symbol of red bars), and the AGE and AGE 1 columns are currently defined as continuous (with the symbol being a blue right triangle). If you right-click on either symbol, options to change the modeling type to nominal or ordinal appear. However, the SUBJ and SUBJ 1 columns cannot be changed to continuous as their initial data type is non-numeric. The next two sections cover concatenating columns and editing the modeling type.
Figure 3.2 Modeling Types of Data Columns in DermatologyAge Data

The final window contains properties for the rows of the data table. The total number of rows is the first quantity, 183. The subsequent quantities refer to the number of currently selected rows, labeled rows, and any hidden or excluded rows from analysis. Excluding, hiding, and labeling rows permits user-centric abilities such as considering the removal of selected points (e.g., outliers), sequestering points for training/testing, and labeling points for annotation. These abilities further allow you to use JMP for analysis without deleting rows from the data table.
3.2.1 Stacking Columns
To combine columns by stacking one on top of another, it is possible to copy and paste the desired cells. However, this would involve storing the copied cells on your computer s clipboard, which might be memory-intensive for a large data table. One alternative is to use the Utilities tools available in the Cols toolbar:
1. Selected Tables Stack .
The following dialog box will appear, as shown in Figure 3.3 .
Figure 3.3 Stack Controls in JMP

2. Click on SUBJ and SUBJ 1 .
3. Select the Stack Columns button.
4. Clear Stack By Row (this would interleave the data).
5. Type DermatologyAge into Output table name .
6. Type SUBJ into the New Column Names field of Stacked Data Column .
7. Click OK .
The table should look like the data table in Figure 3.4 . Notice that the title is DermatologyAge 2 . This is because the original data table is already named DermatologyAge , and thus JMP will not use the name twice for opened files. It is also necessary to repeat the process exactly for AGE and AGE 1. Moreover, it is critical that you follow the steps identically for subsequent columns. For example, if you failed to deselect Stack By Row in the second iteration, you would inadvertently make the data table out of order.
Figure 3.4 Dermatological Data After Stacking SUBJ Columns

After following these directions, notice that the resultant data table, Figure 3.5 , includes a single column for SUBJ and a single column for AGE. However, there are two minor issues. First, there are additional columns of both Label and Label 2. These are used to identify which SUBJ column this SUBJ number came from, and similarly for Age. Secondly, the SUBJ column is still in a nominal modeling type format. The following sections show you how to delete columns and how to use the Column Properties platform to fix data property issues.
Figure 3.5 Dermatological Data After Stacking AGE Columns

3.2.2 Basic Table Organization
Adding and reordering columns are two common tasks for spreadsheets. Although it is straightforward to do these in JMP, it is important to preserve the data structure since it is statistical software and not merely a spreadsheet. Although you can highlight, cut, and copy cells and columns, you can also initialize columns to random numbers and reorder as needed. Reordering Columns
In JMP, you can also reorder columns. To reorder columns, drag and drop columns within the columns panel, as shown in Figure 3.2 . Alternatively, use built-in tools:
1. Click on a given column name (e.g., AGE ).
2. Select Cols Reorder Columns Move Selected Columns .
3. Click on To first .
4. Click on AGE .
5. Click OK .
Now AGE appears as the first column in the data table, as shown in Figure 3.6 .
Figure 3.6 Dermatological Data After Reordering Columns

For the rest of this chapter, the data was reordered with SUBJ as the first column and AGE as the second column. Deleting Columns
If you find that you have unwanted blank columns or if you want to delete a column, here is how to delete a column:
1. Right-click on a given column (e.g., Label) .
2. Select Delete Column .
Figure 3.7 shows the result: the Label column has been removed.
Figure 3.7 Dermatological Data After Deleting Label Column

Alternatively use the Cols menu:
1. Click on a given column or columns (e.g., Label 2 ).
2. Select Cols Delete Columns .
The Label 2 column no longer appears in the data table. Adding Columns
New columns can be added via a few different approaches in JMP. First, you can double-click on an empty column and it will be populated with blank values ( ) and a generic column name (e.g., Column 12). You can edit the column name and properties as described above.
You can also add columns with more control by using the Cols drop-down menu:
1. Select Cols New Column .
2. A column properties box will appear, as shown in Figure 3.8 , and you can specify the characteristics as above.
3. Change settings as needed (e.g., change the value in Number of columns to add if you want multiple columns).
4. Click OK .
Now an individual, or multiple, column will appear that can be instantiated with control over the location, types of values, and names on new columns.
Figure 3.8 New Column Commands

To add random values, possibly for sorting data randomly, in the New Column platform:
1. Select Initialize Data Random .
2. Select the desired random number type (e.g., Random Integer was selected here to facilitate randomly sorting).
3. Change the random number settings as needed (e.g., 1 for minimum and 366 for maximum was used here since the data has 366 observations, as shown in Figure 3.9 ).
4. Make any other changes to the new column properties.
5. Click OK .
Figure 3.9 New Column Commands with Random Data Options

The resultant additions, Figure 3.10 , enable you to sort the data randomly, if you want. For such a process, follow the methodology in Section 3.3, The Sorted Array.
Figure 3.10 Dermatological Data with Added Random Column

3.2.3 Column Properties
When loading data from spreadsheets, it is common to have formatting issues. Formatting issues, such as numbers having text properties when a table is read in, or placeholder values, such as NAN or MISSING, can make analysis difficult since one text string in a column will define the entire column as non-numeric. To view or change column properties, consider SUBJ first.
1. Right-click on SUBJ.
2. Select Column Info . Figure 3.11 shows the dialog box.
Figure 3.11 Column Details of SUBJ Column

Here you can change the name of the column (you can also do so by double-clicking on a column name), change the data table (choices include character, numeric, and row state), and change the modeling type (continuous, nominal, and ordinal). When you are handling numeric data that is misidentified as character data, change data type to numeric and then change modeling type to the appropriate setting (e.g., continuous for both columns of the DermatologyAge data table).
To change SUBJ to be continuous:
1. Click on the drop-down menu for Data Type .
2. Select Numeric .
3. Click on the drop-down menu for Modeling Type .
4. Select Continuous .
To make additional column properties available for viewing or changing:
1. Click on the Column Properties drop-down menu.
2. Select the desired appropriate field.
When you are finished with Column Info, click OK .
After following these directions, notice that the data type of SUBJ has changed to continuous and both data columns are continuous.
3.3 The Sorted Array
The ordered array, or sorted array, refers to a data structure that has been organized in some manner. Common organization methods include sorting from smallest to largest of a specific variable or alphabetical (for categorical data). Sorting data is a primary task to help you become familiar with data. Although this is very helpful in personally understand the data, you do not need to sort the data to compute statistical values or to produce a histogram.
For an example in sorting, consider the DermatologyAge data table, as above:
1. Open
2. Select Tables Sort .
You are presented with the sorting dialog box shown in Figure 3.12 .
Figure 3.12 The Sort Platform

3. Select AGE for By . This tells JMP which column of data to sort the table by.
4. Select Replace Table . This will replace the original table with an age-sorted table.
The sorting dialog box should appear as shown in Figure 3.13 .
Figure 3.13 The Sort Platform, Set Up to Sort by AGE

5. Click OK .
After you ve clicked OK, look at the data table, as shown in Figure 3.14 . You can instantly tell that it has been reorganized because the SUBJ row is no longer organized sequentially starting at 1. In addition, all of the observations with missing AGE values have been sorted to appear first (rows 1-8); thus, the locations of the missing values were easily found. If you consider the nonmissing data, the first data value appears in row 9, which was SUBJ number 120, who has an age of 0. When JMP performs further analysis on the data, it ignores missing values and thus uses an implicit deletion imputation (which is similar to Microsoft Excel).
Figure 3.14 AGE Sorted Dermatological Data

3.4 Restructuring Data
In addition to the basic column operations, JMP enables you to heavily restructure a data table. For this example, consider the Consumer file included in JMP as a sample data set. This data set considers many fields. Of most interest will be the biomedically relevant columns related to teeth care for flossing and the columns related to brushing (which record when or whether a subject flosses and brushes). These columns are located about in the middle of the data table.
3.4.1 Combining Columns
In some cases, numbers and variables are spread across columns after being loaded from the original file. To alleviate this, combine columns by selecting them and tell JMP how you want to combine them. Here s an example in combining columns:
1. Select the columns Floss After Waking Up , Floss After Meal , Floss Before Sleep , and Floss Another Time .
2. Select Cols Utilities Combine Columns .
3. Type in the new column name in the Combine Columns dialog box, as shown in Figure 3.15 .
4. Determine the Delimiter you want.
5. The default of a comma will result in data that looks like 1,0,0,0 as shown in the Floss 2 column of Figure 3.16 .
6. Having no delimiter will result in data that looks like 1000, as shown in the floss0 column of Figure 3.16 .
7. Click OK .
The resultant column s appearance greatly depends on the delimiter used. For an example, Figure 3.16 shows the result using a comma (consistent with a CSV file) or no delimiter. Although you might have a preference for one type (no delimiter might be helpful for some tasks), be careful because it would be difficult to separate a non-delimited column.
Figure 3.15 The Combine Columns Dialog Box

Figure 3.16 Combined Floss Columns

3.4.2 Separating Out a Column (Text to Columns)
The opposite action might be necessary. For example, you might load in a data table and see that a variable has many comma-separated fields. To break such a column out into separate columns:
1. Select the columns of interest (e.g., Floss 2 from Section 3.4.1).
2. Select Cols Utilities Text to Columns .
3. Specify the delimiter.
4. Click OK .
In this example, four columns will now be extracted. These are currently text columns; if the underlying data is numeric, this could easily be changed by changing the column properties.
3.4.3 Creating Indicator Columns
For methods where a numeric value is easier to interpret than a categorical or text value, you might want to create an indicator column (a column of 1s and 0s). Here is an example of creating an indicator column:
1. Select the columns of interest (e.g., I come from a large family ).
2. Select Cols Utilities Make Indicator Columns .
3. Select the indicator column features needed, as shown in Figure 3.17 . Append Column Name is very useful and will include the original column name to help bookkeeping later.
4. Click OK .
Figure 3.17 The Indicator Column Dialog Box

How the resultant column will look depends on the options selected as shown in Figure 3.17 . For an example, Figure 3.18 shows the resultant two indicator columns (one where a 1 indicates the original column said Agree , the other where a 1 indicates the original column said Disagree ). These can now be used when on needs a numeric indicator, rather than a categorical indicator. After you select Append Column Name in Figure 3.18 , the data table appears as seen in Figure 3.19 with indicator columns for both agreement and disagrement (binary, 0 or 1 values) with the I come from a large family variable.
Figure 3.18 The Make Indicator Dialog Box

Figure 3.19 Data Table with Indicator Columns

3.4.4 Grouping Inside Columns
In some cases, case (upper/lower) issues, similar responses, and other differences result in there being many similar values in a data column. For example, you can examine the last column, Reasons Not to Floss, of the Consumer data. If you examined the data table, you would see that this column is a modeling type called Unstructured Text and that there are 398 unique entries in a data set that has 448 rows. To condense this data column into a more parsimonious set of outputs, examine the data with the Recode platform:
1. Select the columns of interest.
2. Select Cols Recode .
3. The Recode dialog box will appear, as shown in Figure 3.20 .
At the top of Figure 3.20 , notice that there are counts of each unique response in both the Original (Old Values) column and the New Values column. Initially, both columns have the same count of unique values, 398 for this data.
Figure 3.20 The Recode Dialog Box

To analyze the data:
1. Click the red triangle next to Reasons Not to Floss .
2. Select Group Similar Values .
3. The dialog box as shown in Figure 3.21 will appear.
4. Depending on the nature of the data, you might want to change some settings. For this example, you can safely leave all options checked.
5. The difference ratio is what percentage difference JMP is allowed to group. The default is 0.25, which means that JMP will group responses that are 25% similar.
6. Click OK .
Figure 3.21 The Grouping Options Dialog Box

At this point, notice that very similar responses have been grouped together, as shown in Figure 3.22 , and that you have 345 unique responses with some very similar responses grouped together and highlighted in gray. If you were to change the difference ratio to 50% or less, you would have 293 unique columns. Continuing this process, you could find a good balance of difference ratio and grouping options. Thus, you would want to find a reasonable difference ratio, one that benefits the final analysis (having a sufficiently few number of responses to understand) while not losing resolution in the data (by having a sufficiently unique set of responses).
To complete the example:
1. Click Done .
2. Select the type of output. In general you would want something like this:
New Column to place a new column next to the original column
In Place to overwrite the original column
3. This column will now appear in the data table.
Figure 3.22 Initial Recoded Results

3.5 References
Demiroz, G., Govenir, H. A., and Ilter, N. (1998). Learning differential diagnosis of Eryhemato-Squamous diseases using voting feature intervals.

  • Accueil Accueil
  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • BD BD
  • Documents Documents