Data - Jose L. Duarte

Data handling and release

All of my data from published papers will be freely available to all, posted on this page.

I've developed a data handling procedure. Here's what you can expect:

When data collection ends for a given study, the raw data will be immediately sandboxed into a zip archive before anything else is done. I'll make a copy of that raw data, and all subsequent operations on the data -- cleaning, labeling variables, etc. -- will be on the copy. The raw data in the zip archive will never be touched again.
I will always post the sandboxed raw data here, along with the more usable cleaned and labeled dataset. This will ensure that anyone can refer to the raw, virgin data, compare it to the cleaned data, and see exactly what was done at the most granular level.
The data will always be in widely supported formats, such as .csv or Excel (.xls) files.
A note on the raw data: It will generally be a download from an online research platform like Qualtrics or SurveyMonkey, or in-person research software like MediaLab. The downloads are either .csv or .xls files. Sometimes in the past I've seen them include IP addresses, which can in theory compromise a participant's anonymity -- IP addresses can sometimes be traced to specific people, homes, or businesses. Last time I checked, there was no option in Qualtrics or SurveyMonkey to not collect IP addresses. I'm not sure if they still do it. If a raw dataset includes IP addresses -- or any other personally identifiable information -- I'll delete that column before posting it here. This is the exception to the promise of raw, virgin data -- participants' anonymity is more important. If I ever have to do that, it will be noted in the name of the zip file (e.g. DuarteEnvyStudy11B Raw Sandbox - IP addresses deleted.zip).
Whenever possible, my cleaned data files will include a lot of metadata. Metadata is data about data. In this case, it would be information about the study and the dataset. This is not yet a common practice in social psychology, but I expect it will be in the future. Here's the kind of metadata I'll try to include:

a. Design of the experiment, e.g. experiment vs. survey study vs. quasi-experiment; between-subjects or within-subjects; and details like 3 X 3 ANOVA
b. Number of groups
c. Number of participants
d. The nature of the participants: college students, community sample, employees at a particular workplace, etc.
e. The locale of the participants: Country, state, city, area codes, etc.
f. Variable mappings: e.g. RSE is Rosenberg Self-Esteem
g. Dates when data was collected
h. Disposition of outliers
i. Names of predictors
j. Names of dependent variables
k. Measures used: probably based on standardized abbreviations

You get the idea. I'm still thinking about metadata frameworks and the best approach, so expect updates to this page in the future. I'm also thinking about online respositories, a DropBox for social psychology data, better tools to ensure non-tampering, etc. I think it's ridiculous that people are allowed to make sweeping claims about human nature and behavior -- under the august banner of science -- based on a sequestered Excel file that only one or two people in the world have ever seen, and on analyses that have never been checked or audited.

Metadata tools:

How to include metadata in your data files? That was my question, and I've been working on a standardized XML schema for social psychology data. Along the way, I've discovered some other options. First, Excel already allows for some metadata. I had no idea. Check out your File Properties. Here's a good overview of what to do from Ekaterina Bespalaya over at AbleBits.

You might find that the built-in Excel features aren't quite enough to include comprensive and rich metadata. In that case, we turn to an awesome tool called Colectica. They offer a dedicated plug-in for Excel, and it's free. It embeds metadata into your Excel file, and allows much more flexibility and customization than the built-in Excel features. It also includes a variable editor, like you'd see in SPSS.

Moving to an even larger scale approach, check out the Data Documentation Initiative. They're working on a metadata standard for the social sciences, based on XML and possibly RDF. Also check out their Tools page. Discovering the DDI made me stop working on an XML schema for data and metadata -- when I have time, I'll dig into their framework and decide whether or not it fills the needs I was thinking of.

For more on Open Science, see the Open Science Framework.