fbpixel Part 2 | IFT World
IFT Notes for Level I CFA® Program

R02 Organizing, Visualizing, and Describing Data

Part 2


5. Summarizing Data Using a Contingency Table

A contingency table is a tabular format that displays the frequency distributions of two or more categorical variables simultaneously. It can be used to find patterns between the variables.

Contingency tables are constructed by listing all levels of one variable as rows and all the levels of the other variables as columns in a table. For example, consider a contingency table created for a portfolio of 500 stocks based on two variables – sector and market capitalization.

Market Capitalization Variable

(3 Levels)

Sector Variable

 (4 Levels)

Small Mid Large Total
Financial 44 38 20 102
FMCG 130 54 46 230
Information Technology 57 34 21 112
Real estate 30 16 10 56
Total 261 142 97 500

Key points to note from the table are:

  • Each cell shows the number of stocks of each sector with a given market cap level. For example, there are 130 small-cap FMCG stocks. This count is also called joint frequencies.
  • The joint frequencies are added across rows and across columns, and the corresponding sub-totals are called marginal frequencies. For example, the marginal frequency of FMCG sector is 230, and the marginal frequency of small cap stocks is 261

Contingency tables can also be created using relative frequencies based on total count. Each number is expressed as percentage of the total number of stocks. For example, small cap FMCG stocks are 130 / 500 = 26% of the portfolio.

Applications

One application of contingency tables is for evaluating the performance of a classification model (using a confusion matrix). Suppose we have a model for classifying companies into two groups: those that default on their bond payments and those that do not default. The table below shows a confusion matrix for a sample of 1,000 non-investment-grade bonds.

Predicted Default

 

Actual Default Total
Yes No
Yes 150 10 160
No 6 834 840
Total 156 844 1,000

The table shows that the classification model incorrectly predicts default in 10 cases where an actual default did not occur. It also incorrectly predicts no default in 6 cases where a default did actually occur.

Another application of contingency tables is to investigate a potential association between two categorical variables. One way to test the potential association is to follow a three-step process:

  1. Add the marginal frequencies and overall total to the contingency table.
  2. Use the marginal frequencies to construct a table with expected values of the observations.
  3. Compare with chi-square value for a given level of significance.

These steps are demonstrated in the following example.

Example: Contingency Tables and Association between Two Categorical Variables

Suppose we randomly pick 200 mutual funds and classify them based on two parameters:

  • Fund style – Growth versus Value
  • Risk level – Low risk versus High risk.

This data is summarized in a 2 x 2 contingency table shown below.

Low Risk High Risk
Growth 67 19
Value 98 16

 

  1. Calculate the number of growth funds and the number of value funds.
  2. Calculate the number of low-risk and high-risk funds.
  3. Describe how the contingency table is used to set up a test for independence between fund style and risk level.

Solution to 1:

The marginal frequency for growth is 67 + 19 = 86

The marginal frequency for value is 98 + 16 = 114

Solution to 2:

The marginal frequency for low risk is 67 + 98 = 165

The marginal frequency for high risk is 19 + 16 = 35

Solution to 3:

To conduct a chi-square test of independence, we perform the following three steps.

Step 1: Add the marginal frequencies and overall total to the contingency table. We also show the relative frequency table for observed values.

Observed Values   Observed Values
  Low Risk High Risk       Low Risk High Risk  
Growth 67 19 86 Growth 78% 22% 100%
Value 98 16 114 Value 86% 14% 100%
165 35 200

 

Step 2: Use the marginal frequencies to construct a table with expected values of the observations.

Expected Valuei,j = (Total Row i × Total Column j)/Overall Total

For example,

Expected value for Growth / Low Risk is: (86 x 165) / 200 = 70.95

Expected value for Value / High Risk is: (114 x 35) / 200 = 19.95`1       qA

The table of expected values and the corresponding relative frequency table is presented below:

Observed Values   Observed Values
  Low Risk High Risk       Low Risk High Risk  
Growth 70.95 15.05 86 Growth 82.5% 17.5% 100%
Value 94.05 19.95 114 Value 82.5% 17.5% 100%
165 35 200

 

Step 3: The actual values and the expected values are used to derive the chi-square test statistic. This is then compared to a value from the chi-square distribution table for a given level of significance. If the test statistic is greater than the chi-square distribution value, then we can conclude that there is significant association between the categorical variables.

Instructor’s Note: You will understand this step better when you go over the reading on ‘Hypothesis Testing’.

6. Data Visualization

Visualization refers to the presentation of data in pictorial or graphical format to aid understanding of the data and for gaining insights into the data. There are multiple data visualization techniques, which are covered in the following sub-sections.

6.1 Histogram and Frequency Polygon

Histogram: A histogram presents the distribution of numerical data by using the height of a bar to represent the absolute frequency of each bin. The advantage of the visual display is that we can quickly see where most of the observations lie.

Suppose we are evaluating 200 stocks presented in the following frequency distribution table.

Price Range Number of Stocks
46.00 – 51.00 20
51.00 – 56.00 60
56.00 – 61.00 100
61.00 – 65.00 20

We can depict this data graphically through a histogram.

Frequency polygon:

A frequency polygon plots the midpoints of each interval on the X-axis and the absolute frequency of that interval on the Y-axis. Each point is then connected with a straight line.

Cumulative frequency distribution

Another graphical tool is the cumulative frequency distribution chart. Such a graph can plot either the cumulative frequency or cumulative relative frequency against the upper interval limit. The cumulative frequency distribution allows us to see how many or what percent of the observations lie below a certain value. The figure below is an example of a cumulative frequency distribution.

Notice that the slope is steep in the ‘51.00 -56.00’ to ’56.00 – 61.00’ segment because a large number of stocks (100) are added. The slope flattens out in the last segment because only 20 stocks are added in the last segment.

Example:

Which of the following statements is most likely to be inaccurate about histograms?

  1. A histogram is the graphical equivalent of a frequency distribution.
  2. A histogram is a form of a bar chart.
  3. In a histogram, the height represents the relative frequency for each interval.

Solution:

C is correct. In a histogram, the height represents the absolute frequency for each interval.

6.2 Bar Chart

A bar chart is used to plot the frequency distribution of categorical data. Each bar represents a distinct category, and the bar’s height is proportional to the frequency of that category.

The bar chart below shows that the sector in which the portfolio holds the most stocks is FMCG, with 230 stocks, followed by the IT sector, with 112 stocks.

A grouped bar chart (also called a clustered bar chart) can be used to show the frequency distribution of multiple categorical variables simultaneously.

The chart below shows that small cap FMCG stocks have the highest frequency – 130. Also, we can easily observe that small cap stocks are the largest sub-group within each sector.

A stacked bar chart is an alternative form for presenting the frequency distribution of multiple categorical variables simultaneously.

Bar charts can also be presented vertically instead of horizontally as shown below. Normally, the height of each bar is proportional to the value it depicts. However, sometimes the y-axis may be truncated, in which case the heights may not be proportional to the depicted values. In such cases, the graph needs to be evaluated more carefully.