Skip to Main Content

CS 257: Software Design

Prof. Jean Salac - Winter 2025

Be Sure You Know These Things About Your Data

With datasets, it's even more important than usual that you keep careful records about what you find as you find it. Datasets are changeable things, and they can also be stored in a way that makes it very very difficult to backtrack to something you found days ago.

Before you leave a webpage where you found interesting data, be sure you know....

  1. The search terms and search tool you used to find  the data
  2. The URL for the dataset
    AND ALSO BE SURE TO DOWNLOAD A LOCAL COPY
  3. The date you downloaded the data
  4. The authorship (scholar, PI, agency, etc)
  5. The exact name and version of the dataset
  6. Time period and/or geography covered
  7. Location of dataset overview/description information
    AND ALSO BE SURE TO DOWNLOAD A LOCAL COPY
  8. Location of technical documentation (code book, user guide, metadata, documentation, terms of use, etc)
    AND ALSO BE SURE TO DOWNLOAD A LOCAL COPY
  9. Data formats present
  10. Terms of Use
  11. Suggested Citation if provided

Here's a google doc to help you keep your records

Factors to Consider When Evaluating Statistics

Source

  • Who collected it?
  • Was it an individual or organization or agency? 
  • The data source and the reporter or citer are not always the same. For example, advocacy organizations often publish data that were produced by some other organization. When feasible, it is best to go to the original source (or at least know and evaluate the source).
  • If the data are repackaged, is there proper documentation to lead you to the primary source? Would it be useful to get more information from the primary source? Could there be anything missing from the secondary version?

Authority

  • How widely known or cited is the producer? Who else uses these data?
  • Is the measure or producer contested?
  • What are the credentials of the data producer?
  • If an individual, are they an expert on the subject?
  • If an individual, what organizations are they associated with? Could that association affect the work?

Objectivity & Purpose

  • Who sponsored the production of these data?
  • What was the purpose of the collection/study?
  • Who was the intended audience for or users of the data?
  • Was it collected as part of the mission of an organization? Or for advocacy? Or for business purposes?

Currency

  • When were the data collected? Not always close to when they were released or published -- there is often a time lag between collection and reporting because of the time required to analyze the data.
  • Are these the newest figures? Sometimes the newest available figures are a few years old. That is okay, as long as you can verify that there isn't something newer.

Collection Methods & Completeness

  • How are the data collected? Count, measurement or estimation?
  • Even a reputable source and collection method can introduce bias. Crime data come from many sources, from victim reports to arrest records.
  • If a survey, what was the total population -- how does that compare to the size of the population it is supposed to represent?
  • If a survey, what methods used to select the population included, how was the total population sampled?
  • If a survey, what was the response rate?
  • What populations included? Excluded?

Consistency / Verification

  • Do other sources provide similar numbers?
  • Can the numbers be verified?