A contingency table is a tabular format that displays the frequency distributions of two or more categorical variables simultaneously. It can be used to find patterns between the variables.
Contingency tables are constructed by listing all levels of one variable as rows and all the levels of the other variables as columns in a table. For example, consider a contingency table created for a portfolio of 500 stocks based on two variables – sector and market capitalization.
Market Capitalization Variable
(3 Levels) 

Sector Variable
(4 Levels) 
Small  Mid  Large  Total 
Financial  44  38  20  102 
FMCG  130  54  46  230 
Information Technology  57  34  21  112 
Real estate  30  16  10  56 
Total  261  142  97  500 
Key points to note from the table are:
Contingency tables can also be created using relative frequencies based on total count. Each number is expressed as percentage of the total number of stocks. For example, small cap FMCG stocks are 130 / 500 = 26% of the portfolio.
Applications
One application of contingency tables is for evaluating the performance of a classification model (using a confusion matrix). Suppose we have a model for classifying companies into two groups: those that default on their bond payments and those that do not default. The table below shows a confusion matrix for a sample of 1,000 noninvestmentgrade bonds.
Predicted Default

Actual Default  Total  
Yes  No  
Yes  150  10  160 
No  6  834  840 
Total  156  844  1,000 
The table shows that the classification model incorrectly predicts default in 10 cases where an actual default did not occur. It also incorrectly predicts no default in 6 cases where a default did actually occur.
Another application of contingency tables is to investigate a potential association between two categorical variables. One way to test the potential association is to follow a threestep process:
These steps are demonstrated in the following example.
Example: Contingency Tables and Association between Two Categorical Variables
Suppose we randomly pick 200 mutual funds and classify them based on two parameters:
This data is summarized in a 2 x 2 contingency table shown below.
Low Risk  High Risk  
Growth  67  19 
Value  98  16 
Solution to 1:
The marginal frequency for growth is 67 + 19 = 86
The marginal frequency for value is 98 + 16 = 114
Solution to 2:
The marginal frequency for low risk is 67 + 98 = 165
The marginal frequency for high risk is 19 + 16 = 35
Solution to 3:
To conduct a chisquare test of independence, we perform the following three steps.
Step 1: Add the marginal frequencies and overall total to the contingency table. We also show the relative frequency table for observed values.
Observed Values  Observed Values  
Low Risk  High Risk  Low Risk  High Risk  
Growth  67  19  86  Growth  78%  22%  100%  
Value  98  16  114  Value  86%  14%  100%  
165  35  200 
Step 2: Use the marginal frequencies to construct a table with expected values of the observations.
Expected Value_{i,j} = (Total Row _{i} × Total Column _{j})/Overall Total
For example,
Expected value for Growth / Low Risk is: (86 x 165) / 200 = 70.95
Expected value for Value / High Risk is: (114 x 35) / 200 = 19.95`1 qA
The table of expected values and the corresponding relative frequency table is presented below:
Observed Values  Observed Values  
Low Risk  High Risk  Low Risk  High Risk  
Growth  70.95  15.05  86  Growth  82.5%  17.5%  100%  
Value  94.05  19.95  114  Value  82.5%  17.5%  100%  
165  35  200 
Step 3: The actual values and the expected values are used to derive the chisquare test statistic. This is then compared to a value from the chisquare distribution table for a given level of significance. If the test statistic is greater than the chisquare distribution value, then we can conclude that there is significant association between the categorical variables.
Instructor’s Note: You will understand this step better when you go over the reading on ‘Hypothesis Testing’.
Visualization refers to the presentation of data in pictorial or graphical format to aid understanding of the data and for gaining insights into the data. There are multiple data visualization techniques, which are covered in the following subsections.
Histogram: A histogram presents the distribution of numerical data by using the height of a bar to represent the absolute frequency of each bin. The advantage of the visual display is that we can quickly see where most of the observations lie.
Suppose we are evaluating 200 stocks presented in the following frequency distribution table.
Price Range  Number of Stocks 
46.00 – 51.00  20 
51.00 – 56.00  60 
56.00 – 61.00  100 
61.00 – 65.00  20 
We can depict this data graphically through a histogram.
Frequency polygon:
A frequency polygon plots the midpoints of each interval on the Xaxis and the absolute frequency of that interval on the Yaxis. Each point is then connected with a straight line.
Cumulative frequency distribution
Another graphical tool is the cumulative frequency distribution chart. Such a graph can plot either the cumulative frequency or cumulative relative frequency against the upper interval limit. The cumulative frequency distribution allows us to see how many or what percent of the observations lie below a certain value. The figure below is an example of a cumulative frequency distribution.
Notice that the slope is steep in the ‘51.00 56.00’ to ’56.00 – 61.00’ segment because a large number of stocks (100) are added. The slope flattens out in the last segment because only 20 stocks are added in the last segment.
Example:
Which of the following statements is most likely to be inaccurate about histograms?
Solution:
C is correct. In a histogram, the height represents the absolute frequency for each interval.
A bar chart is used to plot the frequency distribution of categorical data. Each bar represents a distinct category, and the bar’s height is proportional to the frequency of that category.
The bar chart below shows that the sector in which the portfolio holds the most stocks is FMCG, with 230 stocks, followed by the IT sector, with 112 stocks.
A grouped bar chart (also called a clustered bar chart) can be used to show the frequency distribution of multiple categorical variables simultaneously.
The chart below shows that small cap FMCG stocks have the highest frequency – 130. Also, we can easily observe that small cap stocks are the largest subgroup within each sector.
A stacked bar chart is an alternative form for presenting the frequency distribution of multiple categorical variables simultaneously.
Bar charts can also be presented vertically instead of horizontally as shown below. Normally, the height of each bar is proportional to the value it depicts. However, sometimes the yaxis may be truncated, in which case the heights may not be proportional to the depicted values. In such cases, the graph needs to be evaluated more carefully.