Business Statistics Made Easy in SAS
364 pages

Vous pourrez modifier la taille du texte de cet ouvrage

Business Statistics Made Easy in SAS


Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
364 pages

Vous pourrez modifier la taille du texte de cet ouvrage

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus


Learn or refresh core statistical methods for business with SAS® and approach real business analytics issues and techniques using a practical approach that avoids complex mathematics and instead employs easy-to-follow explanations.
Business Statistics Made Easy in SAS® is designed as a user-friendly, practice-oriented, introductory text to teach businesspeople, students, and others core statistical concepts and applications. It begins with absolute core principles and takes you through an overview of statistics, data and data collection, an introduction to SAS®, and basic statistics (descriptive statistics and basic associational statistics). The book also provides an overview of statistical modeling, effect size, statistical significance and power testing, basics of linear regression, introduction to comparison of means, basics of chi-square tests for categories, extrapolating statistics to business outcomes, and some topical issues in statistics, such as big data, simulation, machine learning, and data warehousing.
The book steers away from complex mathematical-based explanations, and it also avoids basing explanations on the traditional build-up of distributions, probability theory and the like, which tend to lose the practice-oriented reader. Instead, it teaches the core ideas of statistics through methods such as careful, intuitive written explanations, easy-to-follow diagrams, step-by-step technique implementation, and interesting metaphors.
With no previous SAS experience necessary, Business Statistics Made Easy in SAS® is an ideal introduction for beginners. It is suitable for introductory undergraduate classes, postgraduate courses such as MBA refresher classes, and for the business practitioner. It is compatible with SAS® University Edition.



Publié par
Date de parution 30 octobre 2015
Nombre de lectures 3
EAN13 9781629600444
Langue English
Poids de l'ouvrage 2 Mo

Informations légales : prix de location à la page 0,0130€. Cette information est donnée uniquement à titre indicatif conformément à la législation en vigueur.


The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015 . Business Statistics Made Easy in SAS® Cary , NC: SAS Institute Inc.
Business Statistics Made Easy in SAS®
Copyright © 2015, SAS Institute Inc., Cary, NC, USA
ISBN 978-1-62959-841-3 (Hard copy)
ISBN 978-1-62960-044-4 (Epub)
ISBN 978-1-62960-045-1 (Mobi)
ISBN 978-1-62960-046-8 (PDF)
All Rights Reserved. Produced in the United States of America.
For a hard copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others' rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government’s rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414
October 2015

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.
Last updated: April 18, 2017
Introduction to the Book

Welcome to Business Statistics Made Easy in SAS by Professor Gregory John Lee. This book is a breakthrough in business statistics learning, with a fresh new approach to explaining and teaching the exciting area of data analysis.
This book is designed as a user-friendly, practice-oriented text to teach businesspeople, students and others core statistics concepts and applications.
Business Statistics Made Easy in SAS steers away from complex mathematical-based explanations, and also avoids basing the explanations on traditional concepts such as distributions, probability theory and the like, which tend to lose the practice-oriented reader.
Instead of these traditional approaches, this book employs many features that have proved successful in a great number of MBA and other classes, some of which are completely innovative. These features include the following:

Unique templates for understanding statistics : There are several chapters that present a process template overview of statistical thinking, in other words, they attempt to show the reader how the general process of statistical thinking works. These chapters are a completely fresh way to teach people how statistics works as a whole. Chapter 2 gives a process overview of statistics in general; Chapter 11, an overview of statistical analysis as a process; and Chapter 12, an overview of how to analyze a given statistic after it is generated. These chapters are mostly innovative, and enable readers to grasp the statistical method as a whole rather than learning this in the traditional way, in which a person is expected to learn individual techniques and concepts, and then piece them together to form the bigger concept of the statistical method.

Extrapolation of statistics to business strategy, financial impact, and problem solving : I believe that one of the weaknesses of traditional statistics texts, from the point of view of the average business reader, is the lack of application beyond the statistics itself. I attempt to overcome this through two completely innovative chapters that take statistical outcomes and extrapolate findings to much broader implications, such as the profitability of business cases. This allows readers to answer the question “Why should I care?” that so often plagues statistics courses. I also take this approach extensively in practice questions.

Binding case : At the beginning, in Chapter 1, I build a central, illustrative statistical case that most of the rest of the text builds on and uses throughout for illustration. This helps to focus the discussion and ground readers in a well-understood context.

Practicality through SAS® Studio or SAS®9 : The core idea behind the text is to focus on practical implementation through a specific statistics package, in this case SAS. The book has been written to work well for both the exciting new SAS Studio as well as SAS 9 (specifically I use SAS 9.4 or SAS 9.3). Extensive screenshots of each major step are provided to guide the user carefully through using the package, and the data and prewritten code used for every example are given with the textbook, which the reader can simply open and run to get the output. The reader can then usually change dataset and variable names to run the same programs on other data situations.

Non-technical exposition : There is very little mathematical development, aside from some limited development in appendices for the interested reader. The text is developed verbally in a logical and clear way, often using metaphorical analogies that help the reader to connect the statistics concepts to life examples with which he or she is familiar. I avoid starting off with hard-to-understand terminology; instead I ease the reader into it from the perspective of what the terminology is really trying to achieve.

Pictorial and metaphorical explanations : There is extensive use of unique pictorial and metaphorical explanations – not just diagrams but various different figures that help to explain the concepts. This proves exceptionally useful for teaching the less technical user, and adds completely new pedagogical features to many difficult sections, such as the one on the concept of power. I also include decision-making flowcharts for some techniques that have been hugely popular with readers.

Case vignettes : I also have other illustrative cases of actual statistics applications in real life to help readers see the applicability of the techniques.

Practice questions and datasets : I include a large number of practice questions and datasets. I also run online assessments that readers can take to self-assess or instructors might use to assess; these may be adjusted for the text.
The Book’s Use of SAS Programming
There are many several analysis and statistics programs available, each with substantial merit, including but certainly not limited to SAS, SPSS, STATA, EViews, R, STATISTICA, and NCSS. Different disciplines tend to develop their own favorites. This book specifically uses SAS because I believe it to have enormous and world-leading merit.

A Quick Introduction to SAS

SAS is not just one program. Instead it is a family of programs that mostly draw on a powerful central program. In this book we will use SAS Studio or the latest SAS 9 releases, which are based on keyword-entry programming. Two of SAS’s other programs (SAS® Enterprise Guide® and JMP®) are point-and-click interfaces. However, SAS has many other programs as well, ranging from specialist time series analysis to big corporate data analytics to matrix programming to geographic information systems.

Why SAS?

Here are several reasons for using and choosing SAS.

SAS is the most powerful statistics package : There is literally no statistical or analytical technique that SAS cannot accomplish. First, SAS already has programs for most of the analytical techniques that students and practitioners can possibly need. Even if SAS lacks something, it can technically be programmed in directly by the user using modules such as Interactive Matrix Language (IML), although admittedly this takes a lot of technical ability. (I do it all the time.)

Specifically, compared to SAS, other packages lack too much of what we need : I like other packages too, and respect the popularity, tradition and offerings of products named earlier. However, these packages lack too many things that business students, professors and practitioners need. Business is widely multi-disciplinary, ranging from the financial and economic-type disciplines to more social science disciplines such as industrial psychology or marketing. Business requires wide-ranging statistical techniques. In addition, more mathematical types of techniques like linear programming should be available, along with data analysis, if disciplines like operations and logistics are to be served well. SAS has it all, or the ability to make it, whereas I know of no other package that integrates such complete data manipulation (including major functionality for accessing databases and manipulating datasets), so many statistical techniques (including cross-sectional techniques, full time series modules, structural equation modeling, classification models and many others), and such powerful mathematical techniques (such as matrix programming, operational research, etc.). I should also say that other programs inevitably have certain things that are more conveniently provided than in SAS. For instance, SPSS has bootstrapping at the click of a button. However, the overall loss from using other packages is simply too high.

SAS caters for all levels and types of users : There is a fallacy that SAS is difficult to use, because it stems from its programming roots. However, in reality, anyone can use the point-and-click interfaces such as SAS Enterprise Guide and JMP, which approximate the SPSS-type environment. The more serious users can try their hands at programming at various levels of complexity which, after all, is just a few keywords. In this book, I provide code files that eliminate the need for readers to learn programming from scratch anyway.

SAS is deeply embedded in the organizational world : On balance, SAS is the package of choice for serious corporate analysts, as well as those in other organizations such as government. There are various reasons for this: aside from the above, SAS is unparalleled in being able to draw on and work with database technologies; it has serious scalability; it is very stable; and it offers specialized business analytics solutions. If you are using this book as a teaching tool, part of your concern should be the employability of your graduates, in which case you want to align your teaching with the practices of external organizations. Other packages have powerful organizational penetration too, but SAS is simply more widespread and geared up for the big data era.

SAS skills are seriously marketable : A related note is that SAS skills add marketability to a CV. There is a thriving market for SAS analysts and, while I do not see all students or practitioners as being specialists, marketability nonetheless accrues on a level I have not seen with other packages.

SAS has great support : SAS has fabulous support structures including:

Massive reading resources : SAS has a large official knowledge library, both online and in print, supplemented by multitudinous unofficial online papers. This means that readers who need help can easily find a variety of examples, perspectives and the like. The SAS publishing arm provides by far the best practical set of books and manuals, eclipsing any other provider.

Great helpfiles : SAS has what I think are the best helpfiles in many cases, including many worked examples for every technique.

Serious e-learning : SAS has major online e-learning streams. This is a unique resource that allows professors to co-teach the more technical aspects of certain courses, and get external examination.

Great teaching resources : SAS provides fantastic teaching resources (e.g. slides, manuals, case studies, etc.).

Online and other communities : SAS hosts and has fantastic online communities that can help solve problems as well as frequent high-profile conferences and events.
I note as an aside that it is lack of such resources that hampers many of the providers. The stark example here is the freeware market: these programs are free and popular with academic statisticians as a result, but there still is lamentably poor support around their use. SAS itself has now released a major freeware product with SAS® University Edition, but backed by its formidable support. Readers need all the support we can muster.

A Note on Learning Statistics through SAS Programming

There are two predominant ways to run SAS:

You can use various point-and-click windows to perform tasks in SAS. This is relatively simple to use, and favored by most non-technical people. If you were using the point-and-click options, you could, for instance, open and use SAS Enterprise Guide instead of programming in SAS.

SAS – notably in Base SAS® - often uses programming code to input keywords that tell it what dataset to analyze, which variables to analyze, and what statistical analysis to do on these variables.
While point-and-click options are easier at first and more popular, there are several reasons why I base this book on the programming code of SAS:

Point-and-click is very limiting : All point-and-click programs, from SAS Enterprise Guide to SPSS, are very limited in what they have been programmed to do, with a large amount of the statistical universe either left out altogether or constrained to simple versions. While this may seem alright to a beginner for whom the starter topics are well covered in point-and-click programs, as soon as you or your organizational colleagues start to get more advanced, the limited versions become constricting and problematic. The simple fact is that users with expanded needs often then have to start accessing the programming back-ends of such programs anyway! While this may change in the future, starting your journey in SAS with easy-to-use programming code that has been written for you is the best way to make sure that later you are ready to go further on your own.

Programming is efficient : The programming code input method is very efficient and advantageous. It is far quicker than using point-and-click and takes far less memory, and also you can save programming code for later use. Finally, point-and-click takes a lot of time to go through if you are in a classroom teaching situation, whereas opening and running a programming code file is quick.

Saving and re-using programming code : You can save the keywords you used in a programming code file and re-use them time and time again (see for instance the programming code files in the “SAS Code” folder). Generally, once you have the keywords you like to use, the only thing you have to do is change the names of the variables.
This book mostly uses programming code. Because of the advantages of programming code, it is this type of input method I will mostly use and teach in this book. You will not have to learn what keywords to use, I will give you programming code files (see the “Textbook SAS Materials” folder), and each time we run an analysis you will be directed to open and run a pre-existing file.
If these reasons for using the programming-based SAS do not seem as attractive as using point-and-click, then consider looking at SAS Enterprise Guide as an alternative option.
Additional Book Resources

This book comes with substantial additional resources available through the book’s website at including:

Microsoft PowerPoint slides : Well-formulated and animated slides are provided for each chapter.

Textbook datasets and code files and programs : The textbook comes with pre-prepared datasets and program files. Each chapter is accompanied with prewritten code files which the reader can simply open and run to get the output. The reader can then usually change dataset and variable names to run the same programs on other data situations. In addition, there are two type of programs provided:

Bespoke programs and macros written for the book : Some of the code files provided are an exciting set of programs written exclusively by myself to help readers get directly to the types of outputs you really want. For instance, in the regression chapters I provide macros in which you have to put only the names of the dataset and variables and a few other details to get advanced, formatted regression outputs (including nifty additions like p-value stars), robust regression including comparisons to the normal model, and bootstrapped regressions including again comparisons to the normal model.

Usual SAS programs in which the reader can simply manipulate the dataset and variable names.

Extensive exercise and exam questions : There is a large bank of extensive and mixed-format questions for most chapters, including multiple choice, theory questions, longer and shorter questions based on SAS outputs, and longer or shorter questions based on datasets and instructions for SAS analysis.

Additional online chapters : There are several extra chapters that are not printed with the book but that are available online with the book materials, including advanced topics on regression (hierarchical regression, mediation, moderation, and non-linear regression).
About the Author

Professor Gregory Lee is currently the Research Director and an Associate Professor in Research Methodology and Decision Sciences at the AMBA-rated Wits Business School.
He has published prior books on human resources (HR) metrics, and has many article publications in the international arena such as the Human Resource Management Journal , European Journal of Operational Research , Scientometrics , Journal of Business-to-Business Marketing , The International Journal of Human Resource Management , International Journal of Manpower , Review of Income & Wealth , Journal of Human Resource Costing & Accounting and many others.
He focuses on issues in human resource management, notably HR metrics (in which he has established himself as a leading expert) and other areas such as training, employee turnover and the employee-customer link.
He has served in many capacities within the international academic field. He has sat on the Graduate Management Admissions Council (GMAC©) advisory council, the editorial boards of the Journal of Organizational and Occupational Psychology , and engages in frequent reviewing for many journals.
In addition, he is a well-known consultant, writer and speaker in the corporate and practical management arenas, notably in the area of HR metrics, but extending to other areas such as human resources strategy and foresight.
Thank you to my wonderful wife Sayora whose steadfast patience and support makes her a midwife to this project. Further thanks go to Murray de Villiers of SAS South Africa for brokering the book and supporting the courses through which the material arose. As with all such books, the constant probing and constructive inputs of my students has been the book’s canvas. Further thanks go to my editorial team at SAS Publishing, and my family who helped in many ways.
Chapter 1: Introduction to the Central Textbook Example

The Company
Current Research Needs of the Company
Your Brief for the Case Example
Extended Analytical Skills Needed in the Project
Last updated: April 18, 2017

This book begins with a fictional example that will be used as an illustration throughout the text. The following sections discuss the case and introduce some of the data, which will become familiar to the reader as we navigate many different business analysis examples through the book.
Last updated: April 18, 2017
The Company

Your company is Accu-Phi, a new but burgeoning player in the corporate accounting software space.
Accu-Phi started out with a software offering that launched with a widely popular freeware version which attracts advertising revenue. Later, the company produced a premium version for an annual fee, which offers enhanced features and support. Initially, marketing the software and running a sales and support call center was a big focus for the firm.
The latest evolution in the company is a plan to offer consulting and training to existing customers, which we will generally call “services.” So far, the company has instituted a pilot services offering in one of its territories. The intention of the pilot program is to assess the potential of making consulting and training a major part of the overall business plan, with a nationwide rollout. This strategy comes with a substantial increase in costs, taking the company from a relatively small development firm with centralized staff to a multi-city, full services offering.
Various research projects will help the company in the coming period. One of these is explained next.
Last updated: April 18, 2017
Current Research Needs of the Company

The Company’s Current Research Project

Introduction to the Company’s Project

In the book, we will focus on one major research project for Accu-Phi. Accu-Phi has now been selling services to the pilot group for over a year. Some 279 customers have bought services from the pilot territory. Among this group, you are interested in explaining and even predicting first year services sales in dollars, largely because as a business you want to maximize sales.
Based on your studies or business background, you have used theory and experience to choose other things that you think might explain, predict or at least associate with sales. The next sections discuss these.

Other Variables that Explain, Predict or Associate with Sales

We also have data on the following:

License - is a description of what license the customer has, where the data would either be “Freeware” or “Premium.” You believe that premium software customers should buy more services.

Size - is a description of the size of the customer by turnover. You categorize this description into three broad levels, namely “Small,” “Medium,” or “Big” based on certain thresholds. Your initial belief is that bigger customers will buy more services.

Trust - refers to the trust the customer has in your product and company. You have measured trust through four questions in an online survey that you sent to the key account holder at the client. Various areas of management theory, past research and your experience suggests greater trust may lead to greater sales, all else being equal.

Customer satisfaction - is also seen as important, and is also a core concept in previous theories and studies. You also measure satisfaction through four questions in a paper-based customer survey. Again, various areas of theory, research and your experience suggests greater satisfaction may lead to greater sales, all else being equal.

Enquiries refers to the average number of enquiries about the core software product logged with the call center or online help by customers, per month, since starting use of the product. This data is provided by your in-house customer relationship management (CRM) data system. You believe that more enquiries may associate with higher sales, because customers who make more queries are more engaged with the software and perhaps more needful of consulting and/or training to help them with their requirements.
Figure 1.1 First lines of initial dataset shows the first twenty rows of an initial dataset, as it might appear in a data sheet (we will use this data in SAS analyses throughout the book). Here, we have limited the data to customers who have bought services.

Figure 1.1 First lines of initial dataset

Notes: This table is not all the data: I have omitted most of the Trust and Satisfaction variables to enable the table to fit the page. You will see the whole dataset when we open the SAS data file later in the text. Note also that there are mistakes in the data that we will detect and clean later.
Because sales is your key focus in this study, these other variables will mostly be seen as issues that can help explain or predict sales.
Figure 1.2 The variables in the case study explaining/predicting sales shows a simple model in which all other explanatory variables mentioned above explain or predict sales.

Figure 1.2 The variables in the case study explaining/predicting sales

There are other issues that the case will consider, such as:

Analyzing the customer base by types of licenses and sizes

Relating the other variables (for instance, does higher satisfaction lead to higher trust?)

Focusing on differences in core variables (like sales) between sizes and licenses.
Last updated: April 18, 2017
Your Brief for the Case Example

Let us say that your CEO calls you in one day and says the following to you:
“ I want you to analyze the data you have on the first year of the pilot service offering. I have a few questions in mind that I think will help guide us towards what to do next. Could you write a report for me on them? They are:
First, how did the first-year sales go? The board suggested a good benchmark for successful first-year sales would be $75,000 per customer. How did we do against that?
Second, are our customers satisfied with the product? I know you have some recent data on this. I want us to aim for an average customer who indicates satisfaction.
Third, to what extent do they trust us? Again, I want to aim for pretty high trust levels, say 75% trust levels.
Fourth, how many enquiries did customers make? I’m wondering about whether to spin off a separate enhanced customer relationship management project, and this data will help me to know what to do about that.
Fifth, do sales, satisfaction, trust or enquiries differ depending on whether the customer has a premium or freeware contract? I’m pretty sure premium customers buy more, and it would help us to guide our marketing efforts if we knew the high-value segments.
Sixth, do sales, satisfaction, trust or enquiries differ depending on customer size? Once again, we’re thinking it’ll help with market segmentation.
Seventh, what is the distribution of licenses between the levels of size? I’m pretty sure the bigger companies buy more premium licenses, but I want the most up-to-date data.
Finally, and most importantly, are services sales seemingly substantially associated with any of the other variables? I specifically want to know what the drivers of sales are. Can we maybe move towards predicting sales levels? ”
Throughout the rest of this book, we will discuss business statistics techniques that can help us to answer the CEO’s questions.
Last updated: April 18, 2017
Extended Analytical Skills Needed in the Project

Extending your Skills

Merely analyzing the above data on its own is not necessarily enough. Such analysis will give us intermediate technical, statistical findings. However, these lack meaning without further assessment, interpretation and reporting, as discussed next. There are two major subsidiary tasks needed by the successful and well-rounded business analyst.

Reporting Skills

First among the additional skills is excellent reporting and packaging skills. The consumers of business intelligence and analytics are often non-technical managers, directors, teams, co-workers, or the like. These stakeholders may not be able to digest technical statistical findings, or may not have the time. Your CEO in the above example, for instance, might only give you 5 minutes at an executive or board meeting to discuss your findings. You’ll need to reduce and package them appropriately.
One of your essential tasks is therefore to learn how to package data findings in a way that is most appropriate to the reader. This may require additional skills over and above merely analyzing data, such as graphing, dashboarding, learning to produce HTML or other formatted reports, and many others. Luckily, SAS – the analytics suite this book is based on – has a ”black-belt” in all these skills. See Chapter 9 and Chapter 16 for examples of more intuitive analysis.

Interpretation of Business Statistics

This supplementary skill speaks to the ability to interpret statistical results so that they take on greater business meaning, perhaps helping to guide strategy, financial decisions, and the like.
I have a saying for business statistics, which goes something like:
Great business analytics is not about numbers. It is about words: the explanations, interpretations, strategies and decisions that spring from the technical data analyses.
Unfortunately, too many business analytics projects stop at the technical statistical findings, and the analysts lack the skills or confidence to extrapolate the findings further to business implications.
Many data analyses can have varied practical interpretations depending on the way they are presented. Adding a benchmark, as the CEO did a few times earlier in this chapter, immediately makes a statistical finding in context of the benchmark.
This book has a dedicated chapter on extrapolating business analytics to business outcomes, specifically financial outcomes. See Chapter 17 for more on this. Such skills transform average business analysts who can speak “statistics” but cannot speak “business”, into indispensable strategic powerhouses.

Data Architecture Skills and Knowledge

Many times, the data gathering and storage in a company is a significant challenge. Data can be from different sources, dispersed around the company or outside the firm, in varied formats, coming in huge volumes or speed, and so on.
Various solutions have been developed to help guide and organize the information architecture of the firm. The well-rounded data analyst will know about and be able to work with various data sources and solutions. The final chapter in this book talks about such issues and skills, including big data, data warehousing, machine learning and algorithms, and others.
Now that you understand the core textbook case, turn to the next chapter for a broad overview of the general statistics process.
Last updated: April 18, 2017
Chapter 2: Introduction to the Statistics Process

Introductory Case: Big Data in the Airline Industry
Introduction to the Statistics Process
Step 1: Your Needs & Requirements
Step 2: Getting Data
Step 3: Extracting Statistics from the Data
Step 4: Understanding & Decision Making
Summary: Challenges in the Statistics Process
Advice to the Statistically Terrified
Last updated: April 18, 2017
Introductory Case: Big Data in the Airline Industry

The airline industry is an extraordinarily difficult one in which to operate. Competition is high, profit margins are tight, and there are constant risks to revenue (ranging from fuel price fluctuations to terrorism to volcano ash!) Any edge in conducting business can add substantially to an airline business.
Optimization of certain processes using data and statistics has become a substantial competitive tool in the airline industry. For example, statistics have been used to build models that can:

Improve the efficiencies of flight plans, allowing the firms to cut costs (e.g. Alaska Airlines, 2013).

Predict the future value to the company (customer lifetime value) of passengers. For example, Malthouse & Blattberg (2005) used SAS® software to try predict the 20% most valuable customers, which would allow airlines to target marketing campaigns.

Predict no-shows on tickets, allowing airlines to overbook efficiently. I do note that this is not popular with us passengers when it goes wrong, and may fall afoul of some laws in certain countries. However, in one example, Lawrence, Hong and Cherrier (2003) found that their predictive model of no-shows could increase revenues in one test case by 0.4% to 3.2%.

Predict and improve the use of frequent flier programs (e.g. Berengueres & Efimov, 2014; Liou & Tzeng, 2010).
Frequent flier programs can be substantial tools for airlines in attracting and retaining customers. Having said this, they can be expensive, so assessing the relative gain/loss ratios of these programs is key.
In a recent example of predicting the value proposition of frequent flier programs, Berengueres & Efimov (2014) worked on Etihad Airlines data from 2008 to 2012, comprising approximately 1.8 million unique passenger flight records, demographics, frequent flier program transactions, and the like. This is already ”big data” in that it is a dataset of substantial size, which grows and expands rapidly as data is added to every day and every flight. See Chapter 18 for more on the notion of big data.
In their study, Berengueres & Efimov (2014) demonstrate statistical methods that Etihad can use to predict key issues in the frequent flier program, such as whether new passengers are likely to join the program soon, and whether members are likely to upgrade to higher levels in the scheme, giving them more privileges. Etihad incorporated these predictions into Customer Relationship Management software, which led interactions with the passenger.
In massive improvements over previous predictions, the model accurately predicted 97% of new or promoted members if three months of data was used.
The model was finally used to create a computer app, which guides airline personnel in decisions such as whether to grant a given passenger an upgrade.
It is such clever links between data and real on-the-ground business decision making that can turn statistics from a boring subject to a serious strategic edge for organizations in the 21st century.
Last updated: April 18, 2017
Introduction to the Statistics Process

Statistics often seems like an impenetrable jungle full of dangers and dinosaurs (statistics professors?) However, at the heart of most statistics is a relatively simple process, which anyone can understand and learn to apply so that the world of statistics becomes easier to understand.
Figure 2.1 The fundamental process of statistics displays this process in a pictorial format, and the following sections briefly describe each step.

Figure 2.1 The fundamental process of statistics

Last updated: April 18, 2017
Step 1: Your Needs & Requirements

You start off with a need to do statistics, which may seem like a weird statement but then again you’re reading this book.... Perhaps your need arises from:

A specific problem or opportunity, say a business problem, like the need to test a potential pharmaceutical drug or to exploit a new opportunity. The Accu-Phi company, introduced in Chapter 1, has various business problems such as identifying and understanding their services sales numbers.

You may have a more general need to understand your particular world better (perhaps you are an insurance company that wants to analyze its claims data for any patterns it can find).

Perhaps you have a research report for a university degree or other need.
Often, your requirements help you understand the type of data you require, the statistics you need, and the like. All too often, businesspeople or readers sabotage their own research because they do not adequately understand what they wanted to achieve in the first place, causing them to gather the wrong data or do the wrong analyses. Therefore, try to have a really good idea of your needs prior to going any further. Identifying the right problem is a major up-front challenge.
Note that I am not saying your intended outcomes from a statistics project must be exactly defined. It is fine to say that you just want to explore your company’s data for useful patterns, or the like. However, knowing what you do and do not want to achieve will invariably help guide your efforts.
In addition, always think deeply about the context of the data situation. By context, I mean the environment of the situation (such as the culture of the people involved), the physical environment, and the like. Consider how the context affects the data involved, and whether this context gives you extra information about what or who to study.
For instance, say you are studying financial investment portfolios. Your study and expectations should be heavily affected by the broader financial or economic environment at the time. If there is a major financial and economic crash in the middle of the time period of your study, your entire thinking about the problem and possibly your data choices may change. In the case of Accu-Phi, you have already made some choices about what might affect sales – are there other aspects of the sales process and environment issues you have omitted?
Last updated: April 18, 2017
Step 2: Getting Data

The Importance of Data in Statistics

Data is at the heart of statistics. Basically, data is information. It is any type of information about the things you are trying to analyze. It may be information about customers, or companies, or shares, like the Accu-Phi sales and other customer data. It may come to you in numbers, words, phrases, sentences, pictures, or other formats. If you can record it in a consistent and retrievable way, it is data.
For instance, say you are a manager of an automobile manufacturing plant. You might want to understand your production efficiencies better. You need information to do so, perhaps speed of production of each car produced, number of defects of each car, and the like. This is raw data.
Data gathering and cleaning is a huge step, because gathering the wrong information means you will get the wrong answers. (You have doubtless heard the expression “GIGO,” which stands for “Garbage in, Garbage out.” This is especially true in statistics, where wrong data means your study may well be nonsense.)
Continuing with the automobile manufacturing example, if you get inaccurate data on the speeds of production then any further analysis will have the wrong answers and your decisions will be made on this wrong information.
I discuss the critical data step in far more detail in Chapter 3 and Chapter 4. For now, I summarize the major data challenges as follows:

Data challenge 1: Focusing on the right observations (your population and samples). For instance, who or what are you studying? Which people, companies, and the like?

Data challenge 2: Choosing issues to analyze (constructs). Are you interested in demographics of people, profitability of companies, economic variables of countries? It’s important to pick the right constructs and constructs that really matter.

Data challenge 3: Once you have gathered your data, making sure it has been cleaned, that is, it has no major faults that could derail your analysis.

Data Are Not Statistics

One thing that people often fail to understand at first is that data itself is not statistics, and data has limited use without statistics.
In the Chapter 1 Accu-Phi example, they have a customer dataset describing demographics, sales, and other features of each customer. This is useful for sure: your customer representatives can access a particular customer’s details when meeting him or her, which may lead to better interactions.
However, individual raw data does not really help us to answer the bigger and broader questions about our situation. As we’ll see in the next section, statistics are really summaries of the data. Statistics help to explore and describe many data points simultaneously, as a whole. This in turn allows for a more general view of whatever you are studying. For example, Accu-Phi want a general view of what the data tells them about the drivers of sales as a whole, which could help us create more targeted and effective marketing initiatives. The statistics would be a few summary numbers that inform us about this concept or question.
Last updated: April 18, 2017
Step 3: Extracting Statistics from the Data

Brief Introduction to Statistics

As discussed in the previous section, a dataset is not usually helpful on its own. The process is actually to extract from your data just a few representative numbers – we call these statistics – that tell us something about the data and therefore give us condensed information about the piece of world we are studying. Here are some examples:

An average : The simple average (say the average age of your employees) is a good example: the average is a single number summarizing all age data of all your employees.

Statistics that measure relationships between concepts : Other types of statistics summarize relationships between concepts. Figure 2.1 The fundamental process of statistics gives an example, where we wish to test to see if employee engagement (one concept measured by data) is related to customer retention (another concept and set of data). Perhaps you wish to see if higher employee engagement leads to higher customer retention. Just a single statistical number (or perhaps a few) can summarize this relationship between these two concepts;

Patterns over time : Perhaps you are interested in the movement of a single data variable over time, like the sales of a product over many months. All the sales figures may be boiled down to a few statistics that tell us the extent to which the sales figure have been growing, shrinking or staying stable, or perhaps statistics that summarize whether there are seasonal highs or lows in the data.
This is not the end of the process. Once you have extracted these statistics you need to assess two things about them.

Do We Trust the Statistics? Are They Accurate?

Just because you extracted a representative number from a dataset – such as an average – does not mean you should believe or trust the result. An important initial step in the statistical universe is scrutinizing the results for accuracy, even before we understand what the statistics mean. This is a topic that Chapter 12 addresses in detail.

What Do the Statistics Mean?

Having established that the statistic is accurate enough to trust, the major step is to establish what it means. Getting a representative statistic does not necessarily mean that you understand what it is telling you!
The skill of understanding what a statistic actually means is one many people lack even if they are good at the process of deriving the statistics. So what if you can find some statistical number that represents the effectiveness of your new drug on a disease? Do you really understand what the statistic is saying about the effectiveness? Can you think further about the potential demand for the drug, its profitability, its comparisons to other remedies, and what effect the drug would then have on people’s broader lives? Therefore, really understanding the meaning and impact of a statistical number is perhaps the most crucial skill of all.
Based on the above, there are two levels of understanding a statistic:

Understanding a statistic in its context : Every statistic needs to be assessed for meaning and impact in its own context. By this I mean asking questions like whether the statistic is big or not. This would have to be decided in comparison to some benchmarks. For instance, if you have a statistic informing you about how customers react to an advertisement, you need to ask whether the average reactions are good enough (big enough) or not. What is your benchmark? Is it no (zero) customer reaction to the ad? Is it reactions to your past ads, to your competitors’ ads? Your benchmark is highly contextual, and different benchmarks can affect your judgment of what the statistic means. Chapter 11 further addresses this skill of assessing the meaning of statistics.

Extrapolating the meaning of the statistic to broader contexts : Really good business statisticians are able to go beyond the statistic on its own. They are able to extrapolate the meaning to further contexts. For instance, continue with the example above in which statistics measure the impact of an ad on customers. Showing you have impacted customers is one thing. However, can you then take the changes in the customer and show how much profitability or return on investment the ad can then generate overall? If you are testing a drug, can you combine statistics on its healing efficacy with other information to extrapolate the ultimate impact on the buyer’s lifespan or quality of life? Can you again estimate return on investment? Chapter 17 discusses this in more detail.
In addition, if you are a complete beginner in statistics note the following.

The Process of Generating Statistics: Math Versus Computers

For centuries prior to the advent of computerization, if you wanted to take raw data and get statistics you needed to use mathematics and hard work. Underlying most statistics is some form of mathematics. I discuss this further in Chapter 11.
These days, luckily, we usually use computer programs to do the hard number crunching for us. Examples of such programs are SAS, SPSS, STATA, Statistica, NCSS, Microsoft Excel, and a great many more. These programs look at your raw data and – based on your telling them what you want – they tell you what the estimated statistics are, from averages to complex relational statistics.
As discussed in the preface to the book, this book uses SAS, which is one of the world’s most used statistics suites and is the leading player in business analytics particularly.
It is crucial to note that many statistics mistakes occur because people either give the computer program bad data or the wrong instructions, or because people do not adequately scrutinize the computer results. Please understand that the fact that the computer has given you a statistical result does not mean that it is the right output or in fact that the result should be taken seriously! The computer is just a tool, and cannot do the real work for you, like making sure your data is right, understanding which statistical tests are appropriate, and understanding the results and what to do about them. One of my favorite acronyms is “PICNIC,” which stands for “Problem In Chair Not In Computer.” So, try to be people who do not rely on the computer for more than it can do.
Last updated: April 18, 2017
Step 4: Understanding & Decision Making

The final step is to decide what the statistics mean, to learn new things about the world in which you are interested, and to apply this knowledge to your decisions if necessary. Statistics based on share returns data might help you understand the stock market better, as well as help you make better investment decisions. Statistics about customers might help you to sell more and be more profitable. Examples abound.
One key is to remain humble and hard working in statistics. No single statistical study is definitive: the next time you try to study the same phenomenon you might get very different data and therefore very different results. You should therefore scrutinize your results for trustworthiness; be realistic about how much your results can generalize to future samples, timeframes and contexts; and perhaps participate in testing them again to see whether you get the same results again. Never make overly strong assumptions about what you have found, but be brave and take leaps where it seems appropriate. Chapter 17 discusses more on this topic of what to make of your findings.
Last updated: April 18, 2017
Summary: Challenges in the Statistics Process

Figure 2.2 Fundamental statistics process with challenges below summarizes the above discussion regarding the essential steps and challenges in the basic process of statistics.

Figure 2.2 Fundamental statistics process with challenges

Last updated: April 18, 2017
Advice to the Statistically Terrified

Some subjects of study are like eating a pie. You do it one bite at a time, each bite takes you a bit further, and eating half the pie means that you have absorbed half of it. Personally, I think business administration subjects are a bit like this. If you are studying the ”four Ps” of marketing (Product, Price, Placement, Promotion) then you could read or understand only half of it and still have picked up various strands and bits that you could apply.
I don’t think mathematics and statistics are like eating a pie. It’s sometimes hard to feel like you’ve gotten anywhere at all even after you’ve been studying and reading for a while. My metaphor for studying these subjects is vending machines. You know how, when you wish to get something from a vending machine, your coin sometimes just won’t drop into the machine? Instead it keeps falling through into the return tray so that you have to either abandon the attempt or keep trying to put in the coin. Grrr …
Some people persevere. They keep trying to get the machine to accept the coin. Sometimes they try something different, like standing on the coin, licking it (hopefully not in that order), or scratching the coin against some metal. Some go through all the coins in their wallet or purse hoping for success.
I think math and stats are “penny dropping” subjects, like the coin in the vending machine. Studying an area in these subjects often draws a complete blank on the first attempt: you simply don’t understand anything at all despite hours of reading and a lecture. Worse: often the second and third attempts still leave you feeling completely lost. These subjects are not like pie-eating, where it’s a linear process and you can get it one little bit at a time.
However, the news is not bad! I believe that if you persevere and keep trying eventually the “penny will drop” like the coin eventually being accepted by the vending machine. I think statistics and mathematics are subjects in which you can go from completely not understanding to completely understanding in one single ”Eureka!” moment. The penny has dropped, and you finally get it.
Obviously not everyone operates like this – some people can still understand stats one bite at a time. However, I strongly believe that for the rest of us mortals the secret is to keep reading and studying – perhaps different views on the same subject – until the penny has dropped!
Last updated: April 18, 2017
Chapter 3: Introduction to Data

Introductory Case: Royal FrieslandCampina
Brief Introduction to Samples, Populations & Data
Basic Characteristics of Variables
Last updated: April 18, 2017
Introductory Case: Royal FrieslandCampina

Royal FrieslandCampina is a Dutch corporation specializing mainly in dairy foodstuffs. The group is one of the biggest and oldest dairy cooperatives in the world. With roots stemming back to 1879, the current company is a merger of two dairy giants (Royal Friesland and Campina), offering a wide range of products in more than 100 countries. As of 2013 the group enjoyed revenues of 11.4 billion Euros, and employed over 21,000 people.
A challenge for the company in the late 2000s was succession planning for key staff, especially since a large proportion of the workforce – in the order of two thirds – worked overseas. Succession planning (that is, planning who can replace key figures in a company should their jobs become vacant) requires various pieces of HR data. These include:

Employee records, which capture each employees position, experience, qualifications, and the like

Performance reviews of individuals over time and various key performance areas (KPAs)

Job information, such as job descriptions (what the job entails) and job specifications (profiles of the hypothetical person required to do the job)

Extra skills inventories related to individuals, such as training completed and the like
However, data capturing and collation was problematic for the company. Managers would capture information in different, unlinked sources, using software (such as Microsoft Word and Microsoft Excel) that is not suited to an integrated and multinational exercise of this sort. Information was often missing. This led to wholly inadequate data for succession planning, making it difficult and slow to accomplish human resource planning.
Elsbeth Janmaat, the management development manager, explained:
“We had a lot of information, but it was held in several sources. This made it difficult to use, and complex to update. Also, having disparate sources meant the data were fragmented, so we couldnt see the complete picture.” (Pollitt, 2007: 21)
The groups HR specialists realized that they required a specialized and integrated information system to collate and gather the required information. They settled on Cezanne, a tailored database system that they believed would be suitable for their needs.
Friesland wanted to look at career planning from both their own point of view (what important roles might need filling, and who would be best from their perspective) as well as the employees view (where did employees see themselves going, and what did they see as motivational movements?). Dave Pollitt commented:
“... the Friesland Foods HR team is able to record information about both people and positions, and look at them side by side, so that career and succession plans can be aligned with the ambitions of employees as well as the needs of the business.” (Pollitt, 2007: 22)
The software also facilitated reviews of managerial advancement, as well as reviews of performance and competencies of individual. The web-based nature facilitated easy access, and plans at the time were to extend the database across wider uses in the future. Overall, the success of the system demonstrates the power of well-designed and useful databases as a support for management decisions.
We see in the vignette above that data and data analysis can make a big difference in many business situations. This chapter will introduce the core types of data and the basic features important to using and collecting data.
Last updated: April 18, 2017
Brief Introduction to Samples, Populations & Data

Initial Concepts

When you do a statistical analysis, you are gathering and analyzing information about things – specifically you generally gather information on a certain set of issues about some group of objects :

Group of objects (observations ): We usually study some group of objects, such as a group of people (perhaps consumers, employees, hospital patients), companies (such as all the company customers of Accu-Phi in the central textbook example), countries (perhaps all the South American countries), and the like. We want our analysis to reflect accurately and usefully on this group of objects. We call the group of objects observations .

Set of issues (constructs & variables): The issues involved are usually important attributes of the group of objects. For example, in our case example from Chapter 1, our group of interest is customers, and we are interested in multiple attributes about those customers such as their size, satisfaction, trust, enquiries, sales, and so forth. If it is employees, perhaps we wish to know about their productivity, satisfaction, and turnover. We call these issues in which we are interested constructs and variables . A construct is the idea of the thing we are trying to analyze (e.g. business profitability is an idea about something that companies have). A variable is an actual measurement of a construct, for instance, taking actual accounting measurements of business profitability.
The following sections give initial points on each of these facets of statistical research, before we start talking about actual data.

The Observations We Study: Samples & Populations

We have stated that when we do statistics we usually study some group of interest, called our observations . Employees, customers, firms and the like can be observations. There are a few options open to us when we decide to study a group:

Studying a sample as a representation of a broader population : Sometimes, it is not an option to study the entire group of interest, because the entire group is simply too big – and therefore expensive and hard – to study all at once. Imagine trying to study every single finance firm in the country, or every employee, or every consumer of bread! We call the complete group of interest the population . Instead of trying to study the complete population, we often draw a sample of the complete group of interest, which is a smaller group that is manageable and that we believe adequately represents the overall population. For instance, instead of studying all consumers of bread we may study 25,000 consumers of bread split up over all the major geographical zones to get a group (sample) that is roughly representative of the overall population. Once we find certain things about our sample (for instance that bread consumption differs in certain ways by geography), we then infer that our sample-based statistical results are reflections of what happens in the general population. As another example, we might study hospital patients and infer that what we find in the sample we studied applies to all people with similar conditions and of similar backgrounds. This generalizability to the general population of interest is necessary so that your statistical findings are meaningful in the wider setting of your research.

Studying the entire group of interest : Sometimes we get to study the entire group of interest, although this is rare. Having said this, the big data era, driven by incredible computing power and collection methods, can increasingly allow you to study all observations. This is called a census. However, even if your group of interest is small enough to study as a complete set (perhaps its every diamond dealer and you can contact them all), you cannot currently study the group as it will exist in the future, and you may want your statistics to reflect truths that will remain true for the population in the future, so, in a way, the current group is still treated as a sample.

Forming Data Tables

Basics of Data Tables

As stated above, each person / customer / firm (or the like) on which you gather data is an observation . You gather multiple measurements from each observation, and each commonly-measured attribute is a variable . For instance, in the case study from Chapter 1, for each customer Accu-Phi has measured license type, size, sales, various trust measures, and the like.
Generally, if we are doing quantitative statistical analysis we seek to organize this data into a data table (“dataset”), in which each row is an observation and each column is a commonly-measured variable. Figure 3.1 Example of a data table shows a piece of such a data table.
The aim of most of this chapter is to discuss the characteristics of such data and the contents of such data tables. The next few sections briefly discuss generating datasets from the basic unit of a single piece of data to final datasets.

Figure 3.1 Example of a data table

Raw Data Records and Single Data Points

In its most simple format, data exists in what we call ”raw” format. Simply, this is information that you have gathered from observations on the variables of interest, before that information is gathered into a data table where you can compare between individuals.
Take surveys for example. You may have sent out questionnaires asking people about certain issues. Say that you get a stack of completed surveys. This is data (information) but in raw format. It has yet to be captured and placed together in a data table. At present, it is just a collection of single data points.

Numerical Versus Text (Character) Data

When you capture a single piece of information about something, you have a choice between capturing it as a number or a word.
Imagine, for example, that you want to capture performance appraisal scores for employees. Your performance appraisal system gives the supervisor a choice between the following: “Below average,” “Acceptable,” “Above average” and “Exceptional.” In 2015, you gave Doris an “Above average.” You could type “above average” into the cell. But, instead of doing that, you could decide that “Below average” = 1, “Acceptable” = 2, “Above average” = 3 and “Exceptional” = 4. Then you would type ”3” into the cell for Doris. As long as you know what the ”3” means, you can keep it as a number.
It is important to note that just because a piece of data is captured as a number does not mean it has numerical (i.e. mathematical, algebraic) properties. There is certain number-based data where the number used has no such numerical properties. For example, a credit card number has no true numerical properties (e.g. it cannot sensibly be added or subtracted from another credit card number). It is in fact more like character data than numerical – it is almost like a name.

Datasets: Primary and Secondary

Ultimately for a statistical analysis you want data in table format as discussed previously, which gathers together all observations and variables.
Broadly, there are two types of datasets: primary and secondary.

Primary datasets are those you have built yourself from raw data records. For example, if you design and distribute a survey, gather the answers, and capture the data into a dataset, you have built a primary dataset.

Secondary datasets are those that someone else has already built, and that you have gained access to. Three examples include:

In the Chapter 1 Accu-Phi example, your statistical analysis will probably draw on accounting records for sales data.

An employee database is probably secondary data because it already exists in a companys database system. Each row is an employee, and each column is then a piece of data corresponding with that employee.

In finance you get share price or similar data from databases such as Bloomberg, with the companies in rows and their various pieces of financial information in columns.
Note that the way particular types of data are captured depends entirely on the software system being used. Although the end dataset is always a table such as that in Figure 3.1 Example of a data table above , there are two broad options for actually capturing the data:

You can capture the data directly into a table format, as you might do in Microsoft Excel;

Sometimes you capture data using a different pop-up box for each observation; this is called a database program. Microsoft Access is an example of such a database program. Such programs still ultimately give you a data table.

Integrating Datasets

When you wish to analyze data from more than a single source, you need to gather together the data from your observations and variables into a table. For example, Chapter 1 shows the Accu-Phi case study dataset, which integrates survey data, sales data, enquiries data (likely gathered from a Customer Relationship Management database), customer demographic data, etc. To enable this integration, you would need to have a unique customer number that allows you to match each observation from each different source. Luckily, SAS is particularly good at data manipulation such as matching up datasets, as well as everything else. See Chapter 5 and Chapter 18 for more on data integration issues.
Appendix A at the end of this chapter deals with some extra complications in data and datasets.
Having introduced the “destination” for which we are aiming, namely the generation of datasets, the next section discusses the various data-related issues and possibilities. In datasets, the columns are variables, and these describe commonly-measured attributes of the observations.
Last updated: April 18, 2017
Basic Characteristics of Variables

Introduction to Variable Characteristics

Once you have data, there are various things you may want to analyze about it. One vital reason for wanting to use numerical data is that it is easier to analyze, and there is more you can do with it.
For example, Figure 3.2 Sample spreadsheet of performance appraisal data shows performance appraisal scores for the period 2003 to 2006 using a scoring system for employees scoring them from 1 to 10.

Figure 3.2 Sample spreadsheet of performance appraisal data

Imagine you want to analyze the 2006 performance appraisal data from Figure 3.2 Sample spreadsheet of performance appraisal data . You could:

Analyze how many of each category of scores there are (e.g. how many 6s).

Ask what the score of the average employee is.

Ask how much spread there is in performance between the different employees.

Compare relationships between different bits of data.
Take a look at the 2006 scores from Figure 3.2 Sample spreadsheet of performance appraisal data . If we look at the column of data, then we see that we have a range of data from 2 (the lowest in the list) to 10 (the highest). If we order it from lowest to highest then it is:
2, 3, 5, 5, 6, 6, 6, 7, 8, 10
We see two major things about the data already:

It is data that seems to run from low to high with a fair number of options.

It has a center ; that is where the middle of the data is situated. With the naked eye you would probably place the center at about 6.

It has a spread , in other words it is spread out on either side of the center.
There is more we could ask: what kind of data is it (compared, say, to an appraisal system where employees are ranked only on a system of “Below Average,” “Acceptable” and “Above Average”)?
There are therefore multiple aspects or characteristics of numerical data that we need to understand – these characteristics and our understanding of them have a fundamental impact on our ability to analyze them as well as our understanding of the analysis.
As we have seen, each variable is measured across multiple observations, such as a survey question measured across many employees. Because the variable is measured across many observations, which can differ in their response to the variable, there are a range of measured responses to each variable. This brings up three very important characteristics of variables:

Type of variable . This depends on what was being measured, and fundamentally affects the type of data analysis you can use the variable in.

Centrality : This is the most representative response on the variable across the whole sample.

Spread : This represents the range of responses across the sample.
Let us explore these three characteristics of data more thoroughly.

Variable Characteristic 1: Type

Basics of Variable Types

Figure 3.3 Types of variables represents the four types of variables on a scale that increases from left to right in level of statistical information. As seen there, the four types of variables are categorical, ordinal, interval and ratio data, although we can often place interval and ratio together and call them ”continuous data.”

Figure 3.3 Types of variables

The following sections expand on the differences in the types of variables.

Ratio Data

Ratio data has a natural zero point and can take any value upwards from it. For example, take the age of your employees:

Zero (0) days old is an absolute zero.

The difference between someone 100 days old vs. 1000 days old is meaningful. This is a characteristic of ratio data: the difference between two scores is a meaningful piece of data.

In addition, the ratio between any two scores in such data is meaningful. For example, 30 years old is half as old as 60 years old. This is another characteristic of ratio data: the ratios between scores are meaningful.

Interval Data

This form of data is similar to ratio data in that technically one can have any value in a range, but there is no absolute zero.

In interval data, the magnitude of difference between two data points is mathematically meaningful, but since there is no natural zero point, ratios do not make sense.

For example, take time. The difference between 1 January 2009 and 30 September 2007 is 459 days, but based on what zero point? The Big Bang? Birth of Jesus? SAS uses 1 January 1960 as the zero point for time, Microsoft Excel uses 1 January 1900. These are arbitrary zero points. In such cases, the ratio between scores is meaningless, for example, 1 January 2009 divided by 30 September 2007 does not make mathematical sense!

Ordinal Data

This type of data indicates order (i.e. rank ) in a series, but the difference between scores is not mathematically meaningful:

For instance, say you ask someone to tick age groups in a survey where 1=0-16 2=17-25 3=26-35 4=36-45 etc.

A score of 1 (age 0-16) indicates a lower age than 2 (17-25), but the difference between a score of 1 and 2 in this example is not mathematically meaningful!

Categorical Data

Here, a data point is a number that represents membership in some category:

For example, you could be asking people to indicate their marital status in a survey.

You could give a list of possible statuses, each one of which would be scored. In one configuration, you could score 0 = Married, 1 = Single, 2 = Divorced. But these numbers are arbitrary, and could be any order!

For example, an alternate scoring system could be 0 = Single, 1 = Divorced, 2 = Married. Do you see that the numbers are not numerical; they are merely tags? The actual data has no numerical value and could just as well be words.

The Importance of Variable Type

There are good reasons for covering this material. When it comes to statistical analysis of this data, interval and ratio (i.e. continuous) data have the best statistical qualities. As we will see later, when it comes to relating sets of data through techniques such as regression, the type of the main (dependent) variable you are trying to analyze fundamentally affects the type of statistical analysis you do.
For instance, in the case of the Chapter 1 case study on Accu-Phi, the main focal/dependent variable is sales. This is a ratio variable (it runs on a fine scale from low to high; differences between sales levels are meaningful; it has a natural zero) and therefore continuous. Because it is continuous, certain types of statistics can be employed on it.
Generally, when measuring variables, a good guide is to try measuring them as interval or ratio data. More on this is discussed in the next chapter.

Variable Characteristic 2: Centrality

As mentioned earlier, when looking at a variable, centrality is the most representative data point, the score most of the sample tends towards. The average or mean is an example of a specific centrality statistic, although as explained in Chapter 7 you cannot use the average for all data. The type of centrality measure depends on the type of variable. I discuss different types of centrality statistics in Chapter 7.
A quick note: measures of centrality are by far the most used and important measures of practical statistics, especially in business . Market research, for instance, is principally concerned with average levels of variables such as spending or customer attitudes, especially between geographical or demographic segments. Financial research might look at average returns in various investment portfolios. HR research wants to know about variables such as average time to fill a vacancy.
However, measures of centrality are only a start to really understanding data. The spread of the data is also crucial.

Variable Characteristic 3: Spread

Remember that different observations vary in their scores on a given variable. If you measure age across a sample of multiple people, their ages will differ. Spread expresses this crucial characteristic of a variable, namely that there is a range of data on either side of the middle. The larger the spread of the data away from the middle, the less that the central score represents the whole dataset and the more that every observation is different from the others.
As an illustration, take a look at two different sets of data in Figure 3.4 Examples of different spreads in data below.

Figure 3.4 Examples of different spreads in data

In Figure 3.4 Examples of different spreads in data we see that:

In the top set of data, data points of the same value have been stacked on top of each other to illustrate density – see that three people had scores of 6. Rough spread is indicated by the dotted arrows.

The data at the bottom is a dataset with less spread (the dotted arrows are smaller). Because we have taken away the 2 and 10, there is less data far away from the center point. Therefore, on average the data is more clustered around the center point and less spread out.
Data that has a natural order from low to high (interval, ratio, and even ordinal data) inevitably has a range: that is, such variables range from some low to some high. The ages of workers in the company, for instance, might range from an absolute low of 17 to a high of 67. Categorical data also has spread, but this is harder to think about on a low to high basis, for such data you are thinking about relative distribution of observations between categories.
Minima and maxima of sets of data can be useful to know as an absolute measure of spread, because they tell you the absolute lowest score to the absolute highest. They do not, however, tell you anything about where in the range the center of the data lies, or where the majority of the data lies. The employee age endpoints of 17 and 67 might be extreme outliers (there might be very few employees with age as low as 17 or as high as 67). Statistical measures of spread that capture the majority of observations without also taking in the extreme outliers at the far ranges are better measures. There are several such spread-base statistical measures, such as the standard deviation and inter-quartile range, as discussed in Chapter 7.
Spread might be the most important thing in statistics. I encourage you to read the spread sections later in the book carefully until you understand them. Later, we will see why this concept is so critical.

Choosing the Right Variables and Measures

Business statistics succeeds or fails depending on whether you pick the right data (observations and variables) to study, and whether you measure them in a valid and reliable way. Chapter 4 discusses these issues in more depth, as well as the broad challenges around actually gathering, capturing and cleaning datasets.
Last updated: April 18, 2017
Chapter 4: Data Collection & Capture

Correct Sampling
Choose Constructs and Variable Measurements
Initial Data Capture: Which Package?
Dealing with Data Once It Has Been Captured
Database & Data Analysis Software
Some Complications in Datasets
End Notes
Last updated: April 18, 2017

Using data obviously implies collecting/measuring it first, and then capturing these measurements into a database or spreadsheet. I cannot cover a massive amount of research methodology in this book, so the enthusiastic researcher should also study a complete book on methods. This chapter makes a few critical points about data collection and capture for statistics.
Have another look at Chapter 2, notably the process of statistics diagram. The major data challenges discussed there are choosing the correct samples and populations, choosing the correct constructs and variables, and asking the right questions. I cover each of these topics here. If there is one major point that can be made about getting research methodology right it is this:
There are two cardinal errors that can be made in research; all other methodology issues are secondary to these. The two cardinal errors are measuring the incorrect variables or having an incorrect (wrong or non-representative) sample or set of observations. If your sample is poor then your analysis will not be meaningful to the population you wish it to represent, and you will be unlikely to replicate your findings. If you measure the wrong set of variables then you have missed the focus of the study. Even if you have measured some correct variables but you have an inadequate variable set, your study is being conducted without information in its correct context. We call this specification error. Having good methodology and doing good analysis on the wrong variables or wrong sample is often problematic at best and frequently meaningless.
Having said this, it is also important to do a good job on the other methodology details, such as how you have measured and analyzed data. I make some further comments in the next few sections.
Last updated: April 18, 2017
Correct Sampling

As discussed briefly in the previous chapter, if you wish to make inferences about a certain population of observations, then gathering all the data from the entire population is obviously best. However you often cannot do this for practical reasons such as difficulty and cost. For example, you might need to survey only a sample of your employees to ascertain job satisfaction and similar constructs, with the task of using these sample-based results to make inferences to the general population.
Choosing the correct sample that will do a good job of reflecting the broader population is important and desirable. You need to know many aspects of research methodology in order to do sampling really well. The following are perhaps the most important points:

Representivity : You need to ask whether your sample really represents the population, that is, is it the right profile, size, etc. For example, if the population is 40% white and 35% male, it is best if your sample can also reflect this. If your sample is not representative, there are procedures, for example, weighting the results, but there is never a substitute for sampling well in advance to represent the population. In Chapter 1, does Accu-Phi’s pilot services group adequately reflect the national population of potential customers?

Sample size : Is your sample big enough to represent the population and do the required statistical analysis (if you are doing statistics)? The following applies:

In some forms of analyses (e.g. regression), there are size cut-offs for how small the population can be (for example the central limit theorem suggests that after about 30-40 observations, population normality can be roughly assumed, so that certain forms of statistics are stable if not strong).

In statistics, there are often formal sample size calculations you can do in advance to estimate necessary sample sizes needed to do certain statistics.

Generally, bigger is better, although there are caveats to this too.
For more, once again, please do consult research methodology books.
Last updated: April 18, 2017
Choose Constructs and Variable Measurements

Introduction to Constructs & Variables

Having thought about who or what you are studying, the next step is to think about what you are studying about your observations (constructs and variables).
Recall that in the previous chapter I defined a construct as the underlying concept that you are trying to measure and a variable as the actual measurement that gives you data. These are not identical because you can use different measures (variables) of the same concept (construct). For instance, I might want to measure the construct ”profitability of businesses,” but I might generate very different variables reflecting this construct depending on whether I get data on their share price data or internal accounting data. There are statistical techniques that explicitly examine the construct-measured variable link, notably factor analysis; however, these techniques fall outside of the scope of this book.
Therefore, another critical first step in any analysis is to choose which constructs to analyze (the basic concepts involved) and then to gather actual data on the constructs. Here are a few considerations.

Choosing Constructs & Associated Data

Importance of Construct Choice

How do you choose which constructs to include in a data analysis? This is an important question, as it guides the kind of data you need. The most desirable situation is to know already what you are interested in studying, and why it is important. It is true that, sometimes, you have data but no preconception of the real use or even meaning of the data. This sometimes happens in exercises such as data mining, which we discuss briefly later.
However, if you take a more systematic view, there are various options for the constructs and associated data that you choose.

Focal Constructs & Data

You will always have one or more constructs on which you wish to focus, such as revenue (proxied by Sales data in the Accu-Phi case). Focal constructs are the key concepts of interest in your study, the things you care about and that stimulated you to start a research project in the first place. There are several considerations here.

Importance of the focal constructs .: In choosing focal constructs consider first and foremost their importance: Your focal constructs should be chosen out of necessity. You know why you chose the focal constructs (I hope!). You chose them because they are important to you or because they could make a difference if they could be understood, explained or predicted. Or perhaps you chose them because you have a research question about them that could change your business, the industry or your academic community.

Aims behind focal constructs : There are various possible aims when analyzing a focus variable. Sometimes you only wish to understand the way they distribute. For instance, you may just wish to examine the relative sizes of companies in a certain industry. However, very often you wish to go further and explain or predict one or more focal variables. For instance, you may wish to explain employee turnover in your company, or predict share value in a stock exchange.

Feasibility of measuring the focal constructs : While you may wish to research certain core ideas or constructs because they are important, obviously you need to be able to measure them feasibly. This is crucial and brings us back to the research methods imperatives. Issues to consider here include:

Sufficient observations: Whether you can get enough observations of the focal measure and any predictor constructs.

Reliability and validity of measures : Whether you have (at least) reliability of the measures underlying focal constructs. In addition, you aim for maximum validity. Basically, if you can’t really get data that measures your construct (e.g. if you wanted to measure company success but realize that your data isn’t really measuring it reliably or at all) then you can’t do good statistics.

Cost, difficulty, and ethicality of gathering focal construct data : Obviously some data is harder or more costly to gather, and some data is less or more ethical to gather. I leave these issues to research methodology books and courses, but you must consider whether you actually can gather data for your focal variables.

Predictor Constructs (Optional)

Predictors are those constructs that you believe might explain or predict your focal constructs. You do not always need predictors. However, you do need them when you seek to explain or predict one or more focal constructs using their relationship to other variables.
For example, in the Chapter 1 Accu-Phi case, constructs such as trust, satisfaction, and enquiries might be thought of as predictors.

Control Constructs & Data

Often you also seek to include constructs that in themselves are not of primary interest to you, or even that you believe explain or predict the focal constructs, but that do express the relevant environment of your focal constructs.
Demographic descriptors and differentiators of your observations (such as individuals or firms) are often a case in point. Say you want to explain or predict a certain behavioral concept such as employee turnover. While you might have good reason to position some demographic constructs as direct predictors of turnover (e.g. older and longer-tenure staff are less likely to resign), you may not have strong reason to believe that others (perhaps those such as gender and race?) would actually cause turnover. Instead, these other constructs may be included as environmental factors in a way that is not quite cause-and-effect. Such control variables create a proper data context for the focal variables.
Understand that if you do not include measures of crucial control constructs, you are analyzing your core focal constructs out of their proper context and therefore you will not quite get the right statistical results in many types of analyses. When you do analyses that involve balancing multiple focus and predictor variables against each other, if you do not also balance these in the context of their environment (as represented by control constructs), then you may get the wrong picture.
As an example of the usefulness of controls, I was involved in helping a group of scientists study Fetal Alcohol Syndrome (damage to babies in the womb due to mothers drinking). Fetal Alcohol Syndrome damages many things, including growth (in this case, the focus variable). In this particular study they injected all the pregnant mice with alcohol and some of the mice with methanol. The idea was to test whether the growth in the babies, once born, was better among mice injected with or without methanol. The basic data showed no difference, however the data did not include the genders of the babies! Obviously male mice grow bigger and faster than female mice. So, when I included gender as a control variable, the data then showed an effect for the methanol. Gender was an important control variable in the context of babies’ growth; without it, the patterns in the data were obscured.

Conclusion on Construct & Data Choice

The basic lesson of the very first step is that you will always have constructs on which you want to focus (focal constructs), sometimes you wish to explain or predict these using other predictor constructs (not always), and sometimes you wish to analyze your focal constructs in their relevant context as represented by control constructs (not always).

Gathering Data: Question & Answer Formats

Question & Answer Format Issues

Each time you gather a piece of data to capture into a database, in some way you are implicitly asking a question, even if it is not expressed verbally. There are three major considerations when deciding how to gather data:

What are we asking for ? This involves the question formats: the content and format of the questions we are asking.

Who are we asking ? This involves from whom we are trying the get answers.

How are we asking ? This involves answer formats: the way we are inviting answers to be given.

Question Formats & Data Sources

With regard to question formats and who we are asking, there is a lot to think about.
First, there are various ways to gather data, that is, various ways to ask the questions. Some of these options include:

Using existing data: In business, a lot of data already exists in databases within or without the business, and the data is at least partially ready to be accessed. Examples here include existing customer databases, stock lists, and the like. We usually gather existing data from databases through business intelligence software queries. Chapter 18 discusses some challenges around data storage and access in businesses.

Real-time data from sensor data or other recording devices: Although already-existing datasets are common in firms, some data may be gathered in real time through computer-based recordings, such as recording website clicks by customers. Some may also come from physical sensors in various locations and processes, such as GPS sensors that measure the location and movement of your logistics fleet. You can place this data into storage in datasets, but increasingly we seek to analyze it as it is generated. Chapter 18 discusses this in more detail.

Speaking to people face-to-face and recording their answers: This can be a good method as it encourages communication about the meaning of questions and answers, and therefore improves clarity and focus. Interviews and focus groups fall into this category. However, this data gathering method is time consuming and expensive.

Asking respondents to fill in a survey or form of some kind : For example, an employment application form is a type of data gathering.

And so on – there are many data gathering methods!
The main point to consider here is which data gathering method best suits your needs. GPS data on your logistics fleet may exist, but it may not be the best data for answering questions about why drivers stop.
Second, each question can have more or less accuracy and specificity about exactly what it is you are asking for. A specific question or data query can be expected to give more precise answers, and therefore better data. But beware: you need to be very sure of whether you are being as specific as you think you are. For example, how do you ascertain a customer’s age?

If you just ask for an age, some will give you an age based on an upcoming birthday and some will base the answer on the most recent birthday. Being more specific and asking for age based on the most recent birthday , for instance, would be more accurate.

Even more accurate, you could ask the person for a birth date. This will be accurate to the day as long as the person tells you the truth and you record it correctly.
As stated previously, always try to think of a way to get continuous (ratio or interval) data, so long as this is feasible. Sometimes this is not possible: gender, for instance, will always be categorical. But at other times, asking things in ordinal categories, such as age bands, provides inferior data to asking for more finely tuned data (as in the example of age question formats above). In this regard, there is a trade-off sometimes in terms of quality of data vs. complexity, for which see the answer format material in the next section.
Third, some question formats involve broad, open-ended questions, such as ”where do you see yourself in five years’ time?”. Such query formats can obviously be expected to render very diverse answers dealing with lots of issues. You may need to do what is called content analysis on such data, where you break it down and do intermediate analysis on the content of the answers before it can be put into an analytics process. This relies on the analyst to have certain qualitative research skills.
In a similar vein, more subjective questions, such as people’s mental states or conditions (e.g. personality assessments), may need careful psychometric development before the questions are usable. In such cases using previously-developed question formats or sets is a good idea. Also, when assessing more subjective or hard-to-measure variables, it is often a good idea to ask more than one question about the variable. This would require a multi-item scale; see the next section for more on this.


Even with the same question, you can invite people to answer in different ways. For example, think of the question “Do you trust Accu-Phi’s cloud storage to keep your accounting data safe?” You could ask this in an informal interview or a focus group, and invite open-ended answers. Alternately, you could re-cast it as an invitation in a customer survey for people to respond to the statement “I trust Accu-Phi’s cloud storage to keep our accounting data safe,” and invite answers on a very specific answer format, which could then take various forms. The way you invite people to answer questions is important.
Several issues arise when thinking about answer formats. Again, specificity can be important. For example, you might ask in an employee form, “How long did you work in your last job ______________”. When thinking about this example, consider the following points:

This question could result in answers that are expressed in years or months, or both. People who worked in their last job for some years and some months (e.g. one year and four months) might round it off to the nearest year, which would weaken the accuracy.

Even worse, someone could put a number (e.g. ”6”) without stipulating whether he or she means months or years.

Instead, asking “ How long did you work in your last job (answer as accurately as possible to the nearest month): ___ years ___ months ” would result in much better data.
When providing people with a specific answer scale you need to decide how to structure it. For example you could ask people to respond to the statement “I am satisfied with my career development options here.” You could offer them an answer scale as seen in Figure 4.1 Example of a question with a Likert scaled answer format (a Likert-type scale):

Figure 4.1 Example of a question with a Likert scaled answer format

Or, you could use something like the format seen in Figure 4.2 A semantic differential answer format with an online slider scale (a semantic differential scale that is completed online using a slider that the respondent can slide anywhere between the two extremes):

Figure 4.2 A semantic differential answer format with an online slider scale

Once again, see methodology books for various other response formats. There are many such formats. One mistake beginners make in research using surveys or similar instruments is to ask too many questions using the same response format, for example, fifty questions on five issues all of which are assigned the classic five-point Likert response format from Strongly Disagree to Strongly Agree. This is monotonous and often leads to poor responder engagement with the survey, whereas mixing up the answer format can help.
With regard to such formally-structured answer scales, several considerations arise. One is how to capture the data. You would often capture the answer as a number in the datasheet (e.g. if a respondent ticks the third option in the alternate semantic differential scale below, perhaps capture this as a ”3”, as in Figure 4.3 Correspondence between survey answer data entry below).

Figure 4.3 Correspondence between survey answer data entry

How many response options should you give? In the Likert example above there were five options; in the semantic differential scale there might be a wide range, depending on how the sliding scale works. This question is a hotly debated one in the world of research methodology:

For the purposes of doing any serious analysis where the data runs from low to high, at least five answer options is good, seven is generally better, but exceeding ten can be too complex for the respondent.

Computer-based answer formats, where the person answers on the computer screen, can allow for more accurate answer formats (e.g. where the person slides a marker between two points on a scale as in Figure 4.2 A semantic differential answer format with an online slider scale .) The exact position of the slider from the far left edge can be estimated by the computer exactly.

If the question regards picking one or more options in an arbitrary list (e.g. a list of recruitment sources from which the person was recruited, or a list of office supplies the person uses regularly) then the number of response items is not really limited.
As stated above, in some cases we need to ask more than one question about a variable to start measuring it accurately. This is dealt with next.

Designing Multi-Item Measures

In data analysis, we often have only one measurement of a variable. In employee records, for example, we may only have one data entry for employee age, start date, etc. This is fine when the data is a measurement of something very objective and when there is little scope for measurement error.
Sometimes however we are measuring something that is very subjective, or might have high scope for measurement error. Psychometric measures, for example, when we are surveying people about subjective mental constructs such as satisfaction, are far trickier. In such cases, there is a greater risk of measurement error, since people understand survey questions differently. For instance, asking employees in a survey, “Are you satisfied with your job?” might be construed as:

“Do you like your job , that is, just the tasks that you do?”

“Are you satisfied with all the job elements , that is, not just the tasks but also the pay, working conditions, etc.?”

“Are you happy with the whole company, including elements like their corporate ethics?”
These differences in interpretation or understanding make the answers less reliable. In addition, a given single-item measure might be biased by other things, such as an employee rushing to fill in the survey and not really thinking hard about a question. Even more objective assessments such as performance appraisals by supervisors are open to measurement error of various kinds.
In cases where the interpretation is in the control of the person organizing the measurements, it is generally better to gather data on a construct using more than one measurement. This is referred to as “ multi-item assessment. ” In our Chapter 1 Accu-Phi example, two of the constructs measured (trust and satisfaction) are measured with multiple questions all asking different aspects of the construct. For example, we might have constructed a survey using the questions seen in Figure 4.4 Possible multi-item measures of trust and satisfaction for each variable.
You will generally be able to find multi-item scales on most business, psychological, sociological and other constructs in academic journal articles, which you can often fairly safety adopt and adapt to your uses. These will have been previously validated through careful scientific methods. If you wish to use a previously designed scale – and you probably should, if possible, since these have been tested for validity – try to find scales tested and adapted for your local context (geography, industry, and the like). You often will.

Figure 4.4 Possible multi-item measures of trust and satisfaction

There are two issues you may wish to consider when designing or choosing multi-item scales:

Reverse-worded items . Reverse-worded question items are those that run in the opposite direction to the flow of logic used by the majority of items in the list. For example, in Figure 4.4 Possible multi-item measures of trust and satisfaction , there are four items for satisfaction. The first three are items where a high score indicates high satisfaction. However, the fourth is an item where a high score indicates low satisfaction. These are called “reverse-worded questions.” There are two things to understand about reverse-worded items.

The perceived benefits of reverse-worded items . Why would you use reverse questions in the mix? It has traditionally been believed that they work to deter respondents from answering the questions in the list in a less than thoughtful manner, perhaps just answering all of the items in a generally positive manner but without thinking hard about differences in the answers.

The downside of reverse-worded items. Unfortunately, the data from reverse-worded items tends to not mix well with the data from the rest of the list, which jeopardizes the reliability of the overall scale. I suggest trying to avoid their use for this reason. However, if you use previously designed scales (especially older ones), they will sometimes include reverse items.

The ordering of multi-item scales in a bigger research instrument . The second question is whether to keep the items from a multi-item scale together in a research instrument (for instance, in a survey, whether to ask all the trust questions together, one after the other, in a section of the survey). If at all possible, I suggest not keeping the items together, but rather interspersing them between other questions. This keeps respondents from falling into a pattern.
Finally, having designed your question and answer protocols for data collection, and having captured the data, you now have to deal with various issues in the raw data. I discuss these issues next.
Last updated: April 18, 2017
Initial Data Capture: Which Package?

Advanced statistical analysis packages such as SAS, SPSS, STATA and others all allow you to enter your data straight into them. However, I do not always advocate this, because each has their own format and you cannot always easily translate a spreadsheet saved in a given program into another statistical package.
Intermediate database and spreadsheet hardware and software (such as data storage systems or even – in the case of very small and stable datasets – Microsoft Excel) are sometimes a better starting point for data capture, at least where data volume is low or the speed of arrival is moderate. Once the raw data is captured in such systems, you can import the data into any of the major statistical analysis suits.
Chapter 18 does discuss big data situations in which alternatives to traditional data storage might be required.
Last updated: April 18, 2017
Dealing with Data Once It Has Been Captured

Post-Capturing Issues

Once you have gathered and captured raw data into your spreadsheet or database, you are ready to do further cleaning and organization of the data. The major things you might wish to deal with are (a) checking for mistakes, (b) dealing with the multi-item scales, and (c) dealing with missing data. The next sections deal with each of these issues.

Checking for Mistakes in Your Data

When doing data analysis, it is crucial that you prevent mis-entered data from corrupting the entire analysis. For example, if we manually capture data on a 1- to 5-point answer scale, it is possible to enter a ”55” by mistake. If this happens, even simple averages and correlations might be badly affected by the mis-entered data point.
To check data for each column, you can at least check the minimum and maximum values in the range to ensure that there are no major errors. I like to also check the frequencies of all answers; we will see later how to do this. If you do run across a mis-entered data point, then simply correct it.
Chapter 9 will tell you how to use SAS descriptive and associational statistics to check, clean and prepare data.

Dealing with Missing Data

Missing data, where a variable score has not been gathered for an observation (for example where various people have not answered a survey question), can be a serious issue in most data analysis situations.
In many data analysis situations, missing data is especially problematic. If a single variable involved in the analysis is missing its data point, then most statistical programs will throw out the entire observation ! Since missing data is common in many datasets, especially survey-type data, you can lose substantial amounts of data this way. Figure 4.5 Example of the missing data problem in data analysis illustrates this problem.
Missing data can severely affect the accuracy of statistics, so you must think about and analyze the missing data issue carefully before starting statistical analysis. Note that missing data can be a problem in a variable (if many observations do not have data for a specific variable column) or for a given observation (if a given observation row is missing data for a lot of the variables).

Figure 4.5 Example of the missing data problem in data analysis

I suggest several practical steps and a process to use when performing missing data analysis.
First, always try to pre-design the study to minimize the missing data problem . Sometimes you can plan your research methodology in such a way that missing data is minimized. For instance:

Multi-item scales (where each variable is measured through multiple measurements as discussed previously) can help. If most of the multiple items have data, then when you aggregate the items into the final total variable score, you can smooth over the missing pieces. For instance, if your Stress index is made up of 10 survey questions and someone fails to answer two of them, the average of the other eight survey items should be a reasonable approximation of the overall Stress score. This does not work for those specific individual observations where most of the data is missing for a given set of multi-item scales. (If all of the Stress items are missing then you have nothing to work with, if most of the Stress items are missing, the combined index is possibly inaccurate). In the case of these individuals, you could either replace missing data or delete the individuals from the analysis, if necessary. (See below for more on both these options).

If you use an online or computerized data collection method, it can be programmed to not allow missing data, although this can discourage participation. (Sometimes people leave out data for good reason, especially if you ask silly questions that don’t apply to everyone.)
Second, assess missing data within observations : Here you assess how much missing data exists in each row. If an observation is missing any data at all, it is often left out in analyses (not always), although we’d like to avoid this. Usually we try to save data. However, some observations have so much missing data (such as people who stop answering surveys half-way through) that they’re not worth saving. Chapter 9 discusses how to do this assessment practically in SAS. You can decide whether to try save the observation. In the following two situations, saving an observation might not be worth the trouble:

Missing dependent variables : If you are doing a study with definite dependent variables that you are trying to explain or predict, then I do not recommend trying to retain any observations that are missing the data for the dependent variables. Delete them, instead. Any attempt to guess or smooth over the main dependent variables is likely to throw out the whole analysis.

Large section missing : If an observation is missing data for a lot of the final main variables, then consider deleting it.
Third, assess and deal with missing data in variables : Once you are content that each observation has most of its relevant data and variables, start analyzing the variables (columns). Again, Chapter 9 shows you how to do this practically in SAS. Differentiate between those variables that stand alone as single-item variable measures (such as a person’s age) and those that are part of bigger multi-item scales:

Variables that stand alone as single-item variable measures: Here you are concerned with the proportion of missing data and whether it is small enough to either leave alone (you will generally lose all observations for which it is missing) or do something about it.

If the proportion of data missing in the variable is extremely large you may have no other option than to delete the variable, but this depends on the importance of the variable.

If the proportion missing is moderate then you might consider remedial action. There are various options:

Simple imputation : Under some (but not all) circumstances you can replace the missing data with a score that reflects the usual answers for that question. For example, if respondent #5 left out a trust question, we could replace the blank with an answer reflective of the rest of the group’s answers on that question. For answer scales using a 1-5 or 1-7 type scale (e.g. the Likert scale), you could replace the blanks with the median of each question (not the average). For answer scales with truly continuous data, for example, percentages, we could replace with an average.

More complex solutions : There are more complex missing data protocols through which you can deal with the issue (notably multiple imputation, full information maximum likelihood, and bootstrapping). I cannot go into much more detail here; the interested reader should consult more specialist texts [ 1 ] . If at all possible, use these.

If the proportion missing is very small (say less than 5% but you can decide) consider leaving it alone. However, are you sure whether this is right? One way to check is to run the missing data analysis with each option (leaving it alone, replacement, variable deletion, etc.) and see if there is a big difference in results.

Variables that are items in a multi-item scale: Most of the above considerations for single-item measures also apply to these, except that you have one extra option. With multi-item scales we can sometimes smooth over the missing items when we aggregate. For instance, say that for a given observation you have ten survey items for a scale and without missing data the answers are 2, 3, 3, 4, 2, 3, 2, 4, 1, 2. When we aggregate these items into a variable score by averaging we get 2.60. Now, if the person is missing one of the scores, the average of the other nine will range between 2.44 and 2.78 depending on which is missing. Is this an acceptable approximation? The smaller the proportion of missing items the better; the choice is yours as to whether you are happy to take this route.
Chapter 9 discusses how to analyze and deal with missing data in SAS.

Dealing With Multi –Item Scales

The Multi-Item Scale Issue

In the previous section on research methodology around gathering data, I highlighted the usefulness in many situations of using multi-item scales (multiple measurements of a single construct). Once you have gathered this raw data, you need to undertake various analyses of the multi-item scales.
The major question when you have a multi-item scale is “ do these different variables actually form part of the same thing – in which case I can potentially aggregate them into a summary variable – or not? ”
Figure 4.6 Pictorial summary of collapsing multi-item scales into aggregate scores shows a pictorial version of this process, where some initial Satisfaction variables and other initial Trust items may or may not be eligible to gather together into aggregated, summary variables.

Figure 4.6 Pictorial summary of collapsing multi-item scales into aggregate scores

Short Summary of the Main Tasks in Preparing Multi-Item Scales

Given the above discussion, there are two main steps when you have multi-item scales.

Establish whether the set of variables in a multi-item scale really do belong together . Just because you have a set of items that you think indicate one underlying idea – like Trust – does not mean that the data agree. As a technical issue, you would have to deal with a situation where some of your items address trust and some measure distrust – we call these reverse items. More importantly, there are statistical tests to establish whether or not the final set of trust sub-variables seem to form a consistent common measure. Chapter 9 deals with such tests.

If the items are eligible to group together, then aggregate them . There are different ways to aggregate sub-items into one main item. You might average the sub-items, sum them (if no missing data exists), or use various, more complex techniques. Your final step is to choose your aggregation method.
Chapter 9 deals with the basic practical SAS steps for dealing with multi-item scales.

  • Accueil Accueil
  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • BD BD
  • Documents Documents