Data Formats: A Taxonomy of Variable Types

This came up in a statistics class and initially felt like vocabulary for its own sake. It turned out to be load-bearing. Every mistake I later made in early EDA (wrong chart type, wrong test, wrong encoding) traced back to not having classified the column correctly first. The taxonomy is not theory; it is a checklist that makes the analysis easier from the start.

Before any EDA, before any test, before any model, there is a prior question that determines almost every decision that follows: what kind of data is this column? The answer constrains which visualisations are valid, which statistical tests are applicable, and how the variable should be encoded for a model. Getting this classification wrong at the start propagates errors through the entire analysis.

The taxonomy splits into two families: qualitative (categorical) and quantitative (numerical). Within each family, the sub-types carry specific properties that matter in practice.

Quick Reference

Type	Family	Ranked?	Charts	Examples
Nominal	Qualitative	No	Bar, Pie, Stacked bar	Ethnicity, Gender
Ordinal	Qualitative	Yes (unequal gaps)	Bar (not Pie)	Grades, Ratings, Good/Bad/Very Bad
Binary	Qualitative (Nominal)	No	Bar	1/0, yes/no, on/off
Continuous	Quantitative	N/A	Histogram, Box plot, Line graph	Height, Weight
Discrete	Quantitative	N/A	Histogram, Box plot, Bar chart, Scatter plot	Age, Number of students, Population
Interval	Quantitative	N/A (no true zero)	Histogram, Box plot, Bar chart, Scatter plot	Temperature (°C / °F)

Qualitative Data

Qualitative variables describe categories or labels. The key property that differentiates the sub-types is whether the categories have a natural order.

Nominal data has no order. The categories are names, and no arithmetic relationship exists between them. Ethnicity, gender, and country of origin are nominal. Saying that category A is “greater than” category B is meaningless. Appropriate charts are bar charts and pie charts. One-hot encoding is the default encoding strategy for models. The mode is the only valid measure of center.

Ordinal data has a natural order but the distances between categories are not uniform or meaningful. A student grade of A is higher than B, and B is higher than C, but the gap between A and B is not necessarily equal to the gap between B and C. Survey ratings (1 to 5), satisfaction levels (poor, fair, good), and academic standings are ordinal. Bar charts are appropriate. Pie charts are not, because the slices imply independence of categories and do not represent the ordering relationship. Label encoding (preserving the order as integers) is preferable to one-hot encoding for ordinal variables passed to tree-based models. The median is the appropriate measure of center.

Binary is a special case of nominal with exactly two categories: yes/no, 1/0, on/off, true/false. It is nominal because there is no order (0 is not less than 1 in any meaningful sense for most binary variables; it is simply the other state). Binary variables are encoded as a single integer column. Fraud labels, churn flags, and pass/fail outcomes are binary.

Quantitative Data

Quantitative variables represent measured or counted quantities. The sub-types differ in whether the values are counts or measurements and whether a true zero exists.

Discrete data consists of countable values that exist only as natural numbers. You cannot have 2.7 students. Age as reported in whole years, number of transactions, and population counts are discrete. Discrete variables can be visualised with bar charts, histograms (with wider bins to avoid over-granularity), box plots, and scatter plots. Because the values are counts, gaps between values are meaningful and the data has a natural grid structure.

Continuous data exists on a number line with no gaps. Height, weight, temperature, and time elapsed are continuous. Any value within a range is theoretically possible. Histograms, box plots, and line graphs are appropriate. Unlike discrete data, a bin in a histogram for continuous data represents a range, not a single value.

Interval ratio (also called interval scale) data consists of measured values where the difference between values is meaningful, but there is no true zero. Temperature in Celsius and Fahrenheit are the standard examples: 0°C does not mean the absence of temperature, and 20°C is not twice as warm as 10°C in any physically meaningful sense. Ratios are not interpretable, but differences are. The same visualisations apply as for continuous data (histogram, box plot, bar chart, scatter plot).

A true zero would make the variable a ratio scale, where ratios are interpretable. Height has a true zero (no height), so 180cm is meaningfully twice 90cm. Income has a true zero. Most physical measurements are ratio scale. Temperature in Kelvin is ratio scale because 0 Kelvin is the true absence of thermal energy.

Why This Matters in Practice

The classification is not academic. It determines which operations are valid:

Computing the mean of a nominal variable (such as averaging gender codes of 0 and 1) produces a number that has no interpretation.
Applying ordinal encoding to a nominal variable implies a ranking that does not exist, which can produce spurious patterns in a linear model.
Using a pie chart for ordinal data obscures the ordering and invites misreading.
Running a chi-squared test on a continuous variable requires binning it first, and the result depends heavily on the bin choice.
Scaling an interval variable is valid (differences are meaningful), but interpreting a scaled value as a ratio (twice as large) is not.

The practical starting point for any dataset is a column-by-column classification before any summary statistics or visualisations are produced. A column named “rating” stored as an integer looks quantitative but may be ordinal. A column named “zip_code” is nominal despite containing numbers. The dtype in a dataframe is not a reliable indicator of the variable type.

What You Can Do Now

Build a simple data dictionary for your current dataset by annotating each column with its variable type before touching the data:

import pandas as pd

# Template for a data dictionary
data_dict = pd.DataFrame({
    "column": ["age", "gender", "satisfaction_rating", "income", "churned"],
    "dtype":  ["int", "str", "int", "float", "int"],
    "var_type": ["discrete", "nominal", "ordinal", "continuous", "binary"],
    "valid_center": ["mean/median", "mode", "median", "mean/median", "mode"],
    "valid_charts": [
        "histogram, box plot",
        "bar chart",
        "bar chart",
        "histogram, box plot",
        "bar chart"
    ],
})

print(data_dict.to_string(index=False))

Filling in the var_type column forces explicit decisions about each feature before any automatic processing touches it. It also serves as documentation that survives the analysis and explains choices made downstream.