Some recent input from a Data Planet user stressed that we need to be clear and consistent in how we use terminology related to statistics and data. We have some information on this in our help materials and we’ll highlight certain terms in this short post.
Data Planet publishes statistical datasets that are sourced from a wide range of public and private organizations. We call Data Planet an “aggregation” of datasets – a collection of datasets served up together. It’s the same concept as the big journal databases available through your library that bring together journals from different publishers – we do something similar but with statistical datasets as the content.
Data is really just information. (At Data Planet, we usually use the word data to cover both plural and singular cases – while datum is the precise term for the singular, usage varies and it just gets awkward - data is vs data are?) Data are often distinguished as quantitative to refer to numeric data and qualitative, which refers to some quality of the subject under investigation. Virtually all the data in Data Planet would be considered quantitative, although there are some exceptions, such as recession bars, which indicate times when the US economy is in or not in a time of recession .
Data Planet publishes statistics: A number that describes some characteristic, or status, of a variable, eg, a count or a percentage. Some of these statistics are considered derived statistics, which are calculated on the basis of other statistics. For example, the crime rate is derived based on the count of times a crime is committed in relation to the population of the area under investigation. In the example below, the vertical bars represent the crime rate per 100,000 population and the trend line in dark green shows the actual count of crimes:
Data Planet statistics are generally categorized as descriptive statistics, meaning counts, averages (means), percentages, etc., that summarize the quantitative information obtained during the data collection effort. Descriptive statistics are usually contrasted with inferential statistics, which are used to draw inferences about a population based on information collected on a sample of the population. For example, the Census Bureau collects data from a sample of the US population in conducting the American Community Survey, and then uses a series of statistical tests to create estimates of what the statistic would be for the entire nation. Information on the calculations used (t-tests, ANOVA, regression, etc.) in developing the estimates are typically found in the technical documentation on the survey methodology – you’ll find a link to this documentation in Data Planet in the information provided with the chart you create.
Finally, Data Planet publishes statistical datasets - A collection of related data items, eg, a count of the responses of survey participants. This term is used very loosely – the entire Census 2010 Summary File 1 can be considered a dataset as can any table published from the Census 2010 Demographic Profiles, eg, DPSF3. Sex for the Population Age 16 and Over from the 2010 US Census of Population and Housing: Demographic Profiles Databases Some Data Planet datasets are also time series, which indicates that the information is recorded over a period of time. The Census 2010 represents a single point in time so it would be considered a dataset but not time-series data.
Definitions provided here are drawn from W. Paul Vogt’s Dictionary of Statistics & Methodology: A Nontechnical Guide for the Social Science, 3nd edition, published in 2005 by SAGE Publications, Inc. Check your library – you likely have access to this or another dictionary/encyclopedia of statistics!