top of page
Search

INFORMATION ENTROPY AS A MEASURE OF DIVERSITY

Updated: May 24, 2021

G. R. Harp


Introduction

This white paper is about machine learning problems that consider gender, ethnicity or similar variables as predictors in a model. This paper doesn’t deal with the definition of race categories or how to ‘measure’ race (or gender). This paper considers what you would do with categorical data, such as gender or race, and in particular with “diversity.” Different fields (e.g. physics or evolutionary biology) propose multiple “right” answers to choose from. But a too-simple measure, such as variance, is not a theoretically sound diversity measure.


Statement of the problem and definition of entropy


In one project, we were trying to come up with a practical definition of ethnic diversity and gender diversity in a human population. One approach is to simply count the number of people races that appear in the categories. But a variable that is 99% one category and 1% of a different category is not as diverse as a simple count would show. We also discussed using A) the variance or B) the ‘information entropy’ as a measure of diversity. Here we make a comparison of the latter two choices.


One suggestion is to use the variance as a measure of diversity. This may be sufficient for some simple comparisons, but variance has a couple of drawbacks. Variance is an extrinsic variable, meaning it has units. This makes it difficult to compare the variance (diversity) of two measures with different units. Suppose you have one column which is race (categories = 1, 2, 3, and 4), and a second column which is gender (categories = 1, 2, and 3). If you compute the variances of these two columns, it is not clear how can you compare one with the other.


Another proposal is to use information entropy (Shannon entropy) as a diversity measure.


The information entropy, often just entropy, is a basic quantity in information theory associated to any random variable, which can be interpreted as the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes. The quote is from Wikipedia’s definition of information entropy, and gives you some idea of what entropy means. An example helps: Suppose you have a classroom with 29 children of various races and then add another child from the same community. How can you predict the race of the 30th child before they arrive? How likely is it that you will guess wrongly, assuming the theoretically best guess? If your guess is wrong, then you are ‘surprised’ by the result. That’s entropy.


Please go to the Wikipedia page for a description of the formula:

ree

where p_i is the probability of category i

and some explanation. If you haven’t thought of it before, entropy can be a difficult concept to get your head around, so I refer you to the explanations on Wikipedia to get started.


An illuminating example


Instead, let’s consider an example. You have two cohorts of 100 people and some extrinsic variable X, whatever that may be. For convenience, we assume that X values are integers with 1 < X < 100. That is, there are 100 different categories for this variable.


In the first cohort, ninety-nine people have X=1 and only one person has X=100. Intuitively we think this sample has of very low diversity (whatever that means). A simple calculation shows that X has a variance of 97 units.


In the second cohort, the X follows a normal distribution with values from 1-100, like before. Intuitively, this is a highly diverse sample. However, by coincidence this cohort has an X variance of 97. (In case you think this is impossible, I have prepared two samples with exactly these distributions, shown below.)


As an intuitive measure of diversity we see that the variance is disappointing, because it doesn’t distinguish between these two very different groups. What about the entropy? In figure 1, there is a screen shot of a Jupyter notebook that calculates entropy using the scipy.stats.entropy() function in python. (A similar function exists in all mainstream statistics packages, including R.)


Figure 1: Jupyter notebook comparing the variance and information entropy of two different cohorts, designed to have an intuitive feeling of either high or low diversity. This source code is attached to this message.

ree

Skipping to the last cell in Figure 1, we compare the variances and normalized entropies for each of the cohorts. I’m inventing a new name for entropy in this case, ‘diversity quotient.’ The diversity quotient is a dimensionless measure that varies over the range 0-1. Comparing the variances and diversity quotients of the two cohorts, we see that the variances are almost equal while the diversity quotient somehow captures our intuitive sense of what diversity means.


Normalization


The only detail I haven’t explained is how we normalize entropy to a 0-1 scale. This is done by considering the entropy of the sample with maximum diversity – a sample where X takes a different value for every case. For a cohort with a sample N=100 size , this implies that X takes 100 different values. The entropy here is easy to calculate, since every category i has the same probability p = 1/100 = 1/N. There are N different categories, so knowing the maximum entropy,

ree

we form the normalized entropy (aka diversity quotient) as the sample entropy divided by the maximum entropy.



Summary


Here we present a short analysis that motivates using the information entropy as our diversity measure. Many other diversity measures are found in the literature. For example, a super-simple measure is simply to take the probability value of the dominant class. This is a measure that varies from 0-1, and is also monotonic with our intuition of diversity.


The best reasons to choose the entropy is because it has a well-defined meaning in information theory as “the maximum information content expressible by the sample.” Entropy is commonly used in genetics, where researchers want to compute the information carrying capacity of a single gene or piece of DNA. This answers the interesting question of, how much information is required to fully specify a single human being?


By dividing our sample entropy by the maximum entropy we have a metric, which we call ‘diversity quotient’, which is an intuitive measure of the fractional information content embedded in our sample, compared to the maximum information there could possibly be.







 
 
 

Comments


bottom of page