South African Data Archive










Contact us:
South African Data Archive
National Research Foundation
P.O. Box 2600
Pretoria 0001
South Africa

Tel: +27 (0)12 481 4120
or +27 (0)12 481 4016
Fax: +27 (0)12 481 4231


Data Deposit Information and Forms

The following guidelines aim to set out the minimum standards required for the deposit of data at the South African Data Archive.

When depositing data with SADA, the Depositor must provide the following:

  1. A depositor's form, providing information about the principal investigator and depositor, the study description and technical issues and specifications.
  2. A copy of the codebook or code lists.
  3. Copies of the data files in a machine-readable format (e.g. ASCII or SPSS), and the data definition files detailing the characteristics of records and variables assigned.
  4. A copy of the data collection instrument(s) (e.g. Questionnaire and Interviewer manual if available).

Ownership and Copyright

It is important to note that the depositor always retains full copyright and ownership of the data. The data do not in any way belong to SADA. SADA stores, preserve, administers and controls access. In addition, the data are only available to other researchers subject to conditions laid down by the depositor. Depositors may provide unrestricted access to their data through SADA or may specify restrictions to which SADA adheres to.

Depositor's form information

Part 1: Information about the investigator(s) and depositor

  • The names, contact details and institutional affiliations of the principal investigators, co-investigators and depositor of the study are required in this section.

Part 2: Study description

The study description should include the following:

  • The title of the study.
  • The time period covered, that is, the start and end date of the field work period.
  • Geographic coverage of the data, for example, national.
  • The name(s) of organisation(s) responsible for data collection.
  • The details of any funding organisation(s).
  • The type of data collection, for example, survey or census.
  • The units of observation, for example, individuals, households or groups.
  • The number of observations or cases.
  • The number of variables.
  • The overall response rate.
  • Weighting procedures (if applicable).
  • Time dimensions, for example, cross sectional, longitudinal, panel or trend.
  • A brief study description, listing the objectives of the study.
  • The original language(s) employed in the study.
  • The data collection methods used, for example, face-to-face interviews or telephone surveys.
  • The type of questionnaire used (if applicable), for example, open-ended or structured.
  • The sampling method, for example, random, cluster or quota sampling.
  • A list of both published and unpublished papers or reports.

Part 3: Technical issues and specifications

This section should include:

  • The number of data files included.
  • Details on whether the data files are compressed or uncompressed, zipped or unzipped and if they can be merged.
  • Storage media, such as disks and CDs, should be clearly labelled, ensuring that the external labels and filenames correspond.
  • Variable names and value labels, where possible.
  • Information on derived variables and other recoding. The depositor should specify the following:
    • Source variables
    • The question numbers and names to which the original variables relate
    • The new variable label and value labels
    • If a derived variable is created for only part of a sample, this should be made clear
  • Procedures taken to correct errors. The documentation accompanying a dataset should describe how the checking of errors was done, and if different versions of the data are produced, this must be adequately described.
  • A brief description of all data and documentation files.

Please complete the electronic depositor's form

Codebook or codelist information

A codebook can be defined as documentation for a study which includes the complete technical description of each question or variable, as well as the actual location of the question in the data record.

The following information must be included in the codebook:

  • Identify names or numbers of variables: These are included when the data are prepared with certain software systems (for example, SAS, SPSS or Excel). A variable name is an abbreviation or summary for each question. Variable numbers are usually assigned sequentially to each question.
  • Location of variables: Each variable must have a data location. If a card format is used, card and column numbers must be assigned. If a logical record format is used, only column numbers are given.
  • Questionnaire text: The complete text of each question should be recorded. The use of abbreviated names may cause confusion since the name may not adequately convey what was asked of the respondent.
  • Explanatory text: The coding conventions employed and any interviewer instructions should be included with the codebook. Information contained on flash or show cards should also be part of the explanatory text.
  • Code categories: All coded fields of information, together with a description of each coded value, must be recorded. If abbreviations or other standardisations are systematically used, they should be defined in the codebook. Wild codes should be documented as wild codes.
  • Missing data: Missing data values for each variable should be defined clearly. If certain questions are applicable only to a subset of the population, that subset needs to be described in appropriate text or code description. There are two types of missing data:
    • Item non-response: The documentation should outline the reasons why specific terms are missing; it should note if specific conventions such as blanks were used for "not applicable"; if new values were estimated for missing codes, this should be detailed, as well as how they were calculated; and, if estimated values were flagged for identification purposes, this should also be discussed.
    • Case non-response: The documentation should detail if reasons were recorded for missing cases; if cases were retained for which there was only partial information; it should discuss if weights were used to compensate for non-responses or sample design, and what types of weights were used; and if weights were used, can weighted data be distinguished from non-weighted data.
  • Derived variables: These should be clearly marked and documented.
  • Confidentiality procedures: It is essential that the confidentiality and anonymity of data subjects be maintained. This may be difficult, especially in the case when geographic identifiers can be used to breach confidentiality. Thus, a full statement of confidentiality procedures (for e.g. excluding explicit references to persons, households or institutions) should be included.

Information on data files

  • Data should be adequately documented. The ideal format of the data is one in which the data are written in a standard format (e.g. ASCII or SPSS), and accompanied by a data definition file for the software used. SADA accepts data in ASCII, SPSS, SAS or Excel formats.
  • A separate code should be assigned to missing data, and this should always be made clear in the accompanying documentation. Blanks, or assigning zero as a code, should be avoided are far as possible, as this often causes confusion.

View All Datasets

Labour and Business

Political Studies

Social Studies

Surveys and Censuses

Recent Data Submissions

SADA is covered by data citation index
Data Citation Index

world data system
ICSU World Data Systems

ICPSR Summer Program in Quantitative Methods