Announcement

Creating Data Files

In 1994, this is how I began a review of of the program Epi Info version 5.01 for Chance magazine:

One semester, the data sets I wanted to use in class were already in machine-readable form. I placed them on the [then mainframe!] computer for my students as a matter of convenience to spare them the chore of data entry. Around the middle of the term, this prompted a student to ask, "Where do data files come from?"

Where do data files come from? With any luck, they will come from someone else, but eventually we all end up helping others deal with putting data files together.

Methods for recording data often appear to be selected as afterthoughts, and they often use tools designed for other purposes.

In 2000, this note began
Things haven't changed much over the last six year. Data continue to arrive from a variety of sources, entered in ways suggesting they weren't necessarily recorded with an eye toward future analysis.
Today, it's fair to say that things are different. The world needed a standard, so it chose the Excel spreadsheet. With Microsoft's dominance of the spreadsheet market, data are often collected directly into Excel spreadsheets. However, this is only part of the reason why Excel has become the standard. Users demand the ability to transfer data between programs. Software companies cannot ignore this and remain in business. Every program has its "Save As" options. Different programs may provide different options, but Excel is always on the list because it can be found on so many computers. Those who design program that import data, such as statistical program packages, recognize this reality and see to it that their programs can read Excel spreadsheets.

Excel has become analogous to an airline's hub. Just as getting from point A to point B involves passing through the hub, getting data from program A to program B often involves passing through Excel. There are still many cases where one program can read another program's data files directly, but if I were asked today "Where do data files come from?" my response would be, "Probably from someone's Excel spreadsheet."

Still, it is not good enough merely that data be entered into a spreadsheet. There are ways to enter data so that they are nearly unusable. Many spreadsheets require considerable massaging before they are suitable for analysis. This note provides some guidelines to minimize the work of importing Excel spreadsheets into standard statistical packages.

The standard data structure is a table of numbers in which each row corresponds to an individual subject and each column corresponds to a different variable or measurement. If for example, a study recorded the identification number, age, sex, height, and weight of 10 subjects, The resulting dataset would be composed of 10 rows and 5 columns.

Some recent programs use a different structure for repeated measurements on the same individual. The number of rows is determined by the total number of measurements rather than the number of subjects. Consider a study where 5 weekly blood pressure readings are made on each of 20 subjects. Until recently, the dataset would be composed of 20 rows and 6 columns (5 blood pressures and an id). Today, some programs want the data entered as 100 rows (5 rows for each of the 20 subjects) of 3 columns (id, week number(=1,2,3,4,5), blood pressure). The process of converting a dataset from one format to the other should be straightforward. The details will depend on the statistical program package.

Once the data structure is understood, the following rules should be observed to achieve a smooth transfer between a spreadsheet and a statistical program package.

The importance of including subject identifiers cannot be overstated. There are lots of ways to get into trouble very easily.

Inadequate identifiers make it difficult to merge files for analysis. It is highly probable that errors will result due to mismatching. The mismatching can be of two forms--either data from two subjects is combined or one subject's data gets split into multiple incomplete records. Regardless of the cause or the form of the mismatching, weeks, months, or even years of otherwise solid work will be compromised.


Copyright © 2000 Gerard E. Dallal