 Part 1 | IFT World
IFT Notes for Level I CFA® Program
IFT Notes for Level I CFA® Program

# Part 1

## 1. Introduction

This reading presents tools and techniques for organizing, visualizing and describing data. These tools and techniques can help us convert raw data into useful information for investment analysis.

## 2. Data Types

Data can be defined as a collection of numbers, characters, words and text – as well as images, audio, and video – in a raw or organized format to represent facts or information.

Data can be classified in three ways:

• Numerical versus categorical data
• Cross-sectional versus time-series versus panel data
• Structured versus unstructured data

### 2.1 Numerical versus Categorical Data

Based on a statistical perspective, data can be classified into numerical data and categorical data.

Numerical data:  Numerical data (also called quantitative data) are values that represent measured or counted quantities as numbers. Numerical data can be further classified into two types:

• Continuous data: Data that can be measured and can take on any numerical value in a specified range of values. For example, the future value of a sum of money invested today. The FV can take on range of values depending on the investment period and interest rate.
• Discrete data: Data that can take numerical values that result from a counting process. The data is limited to a finite number of values. For example, the frequency of discrete compounding (m). The frequency could be monthly (m = 12), quarterly (m = 4), semi-yearly (m = 2), or yearly (m = 1).

Categorical data: Categorical data (also called qualitative data) are values that describe a quality or characteristic of a group of observations. It can usually take only a limited number of values that are mutually exclusive.  Categorical data can be further classified into two types:

• Nominal data: Categorical values that cannot be organized in a logical order. For example, classification of publicly listed stocks into different sectors, such as: energy, information technology, financials, health care etc.
• Ordinal data: Categorical values that can be organized in a logical order or ranked. For example, Standard & Poor’s star ratings for mutual funds. One star represents the group of mutual funds with the worst performance. Similarly, groups with two, three, four, and five stars represent groups with increasingly better performance.

Although the categories represented by ordinal data can be ranked, the numerical differences between the categories is not necessarily the same, and it cannot be used to draw inferences.

Example: Identifying Data Types

Identify the data type for each of the following items:

• Number of coupon payments for a bond
• Dividends paid by a stock
• Credit ratings for corporate bonds
• Hedge fund classification types

Solution:

Based on our above discussion, we can classify these items as follows:

• Number of coupon payments for a bond – Discrete data
• Dividends paid by a stock – Continuous data
• Credit ratings for corporate bonds – Ordinal data
• Hedge fund classification types – Nominal data

### 2.2 Cross-Sectional versus Time-Series versus Panel Data

Based on how data is collected, it can be classified into three types: cross-sectional, time-series, and panel.

Before we describe these data types, we need to understand two terms: ‘variable’ and ‘observation’.

• A variable (also called field, attribute, or feature) is characteristic or quantity that can be measured, counted, or categorized. A variable is subject to change. For example, the returns on Microsoft stock in a given quarter can be considered a variable.
• An observation is a value of a specific variable collected at a point in time or over a specified period of time. For example, if the returns on Microsoft stock in 2019 Q1 were 3%, then 3% is an observation.

Time-series data: Time-series data consists of observations for a single subject taken at specific and equally spaced intervals of time. For example, the quarterly returns of Microsoft stock from 2019 to 2020.

Cross-sectional data: Cross-sectional data consists of observations for multiple subjects taken at a specific point in time. For example, the quarterly returns in 2019 Q1 of a group of similar stocks – Microsoft, Oracle, and HP.

Panel data: Panel data is a combination of time-series and cross-sectional data. It consists of observations through time on one or more variables for multiple subjects. It is generally presented as a table. For example, the quarterly returns of Microsoft, Oracle, and HP from 2019 to 2020.

### 2.3 Structured versus Unstructured Data

Based on whether data is available in a highly organized form or not, it can be classified into structured and unstructured data.

Structured data: Structured data is highly organized in a pre-defined manner, usually with repeating patterns. It is easier to enter, store, query and analyze, without much manual processing. Examples:

• Market data: Daily closing stock prices and trading volumes.
• Fundamental data: Data contained in financial statement such as earnings per share.
• Analytical data: Data derived from analytics, such as cash flow projections.

Unstructured data: Unstructured data does not follow any conventionally organized forms. It is typically alternative data and is usually collected from unconventional sources. Based on the source, unstructured data can be classified into:

• Produced by individuals (i.e., via social media posts, web searches, etc.);
• Generated by business processes (i.e., via credit card transactions, corporate regulatory filings, etc.); and
• Generated by sensors (i.e., via satellite imagery, foot traffic by mobile devices, etc.).

### 2.4 Data Summarization

Raw data typically cannot be used by humans or computers directly to extract information and insights. The data usually has to be organized first. In the following sections we will discuss various techniques for organizing and summarizing data.

## 3. Organizing Data for Quantitative Analysis

Raw data is typically organized into either a one-dimensional array or a two-dimensional rectangular array (also called a data table) for quantitative analysis.

• A one-dimensional array is suitable for representing a single variable. For example, the closing price for the first 10 trading days for a company after it went public.
• A two-dimensional array consists of columns and rows to hold multiple variables and multiple observations, respectively. For example, quarterly revenue, EPS, and DPS for a company for the past two years.

## 4. Summarizing Data Using Frequency Distributions

A frequency distribution (also called a one-way table) is a tabular display of data summarized into a relatively small number of intervals.

Frequency distributions for categorical variables: The steps for constructing a frequency distribution for a categorical variable are:

1. Count the number of observations for each unique value of the variable.
2. Construct a table listing each unique value and the corresponding counts.
3. Sort the records by number of counts in descending or ascending order.

A sample frequency distribution of 200 companies across four sectors is presented below:

 Sector Absolute Frequency Relative Frequency Technology 22 11% Healthcare 50 25% Financial 58 29% Industrial 70 35% Total 200 100%

Points to note:

• Absolute frequency: The actual number of observations in a given interval is called the absolute frequency.
• Relative frequency: It is the absolute frequency of each interval divided by the total number of observations.

Frequency distributions for numerical variables: The steps for constructing a frequency distribution for numerical variables are:

1. Sort the data in ascending order.
2. Calculate the range of data.
3. Decide on the number of bins (k).
4. Determine bin width.
5. Determine bins.
6. Determine the number of observations in each bin.
7. Construct a table of the bins listed from smallest to largest.

A sample frequency distribution for 100 stocks with prices ranging between 45.00 and 65.00 is presented below.

 Stock Price (Min – Max) Absolute Frequency Cumulative Frequency Relative Frequency Cumulative Relative Frequency 45.00 – 50.00 25 25 0.25 0.25 50.00 – 55.00 35 60 0.35 0.60 55.00 – 60.00 29 89 0.29 0.89 60.00 – 65.00 11 100 0.11 1.00

Points to note:

• Range of the data = Maximum value – Minimum value = 00 – 45.00 = 20
• We decided to have 4 bins
• Bin width = Range / Number of bins = 20/ 4 = 5
• The end points of each bin are determined as minimum value + bin width i.e. 45.00 + 5.00 = 50.00, 50.00 + 5.00 = 55.00, 55.00 + 5.00 = 60.00, 60.00 + 5.00 = 65.00. Thus, we get the bins listed in the table above.
• Minimum values are included in the bins whereas maximum values are excluded. For example, the observation 50.00 will fall in the 50.00 – 55.00 bin and not in the 45.00 – 50.00 bin. However, the last bin includes the maximum values. The observation 65.00 will both fall in the 60.00 – 65.00 bin, since it is the last bin.
• Cumulative frequency: For an interval, it is calculated as the sum of the absolute frequencies of all intervals lower than and including that interval.
• Cumulative relative frequency: For an interval, it is calculated as the sum of the relative frequencies of all intervals lower than and including that interval.

Instructor’s Note: On the exam you are unlikely to be asked to construct a frequency distribution. However, you may be tested on the process and the terminology.

Example:

The actual number of observations in a given interval is called the:

1. absolute frequency.
2. relative frequency.
3. cumulative relative frequency.

Solution:

A is correct. The actual number of observations in a given interval is known as absolute frequency. Relative frequency is the absolute frequency of each interval divided by the total number of observations. Cumulative absolute frequency is the running total of all absolute frequencies.

Example:

Which of the following is most likely to be accurate?

1. An observation can fall in more than one interval.
2. The data is sorted in descending order for the construction of a frequency distribution.
3. The cumulative relative frequency tells the observer the fraction of the observations that are less than the upper limit of each interval.

Solution:

C is correct. The cumulative relative frequency tells the observer the fraction of the observations that are less than the upper limit of each interval. An observation cannot fall in more than one interval. The data is sorted in an ascending order for the construction of a frequency distribution.