This reading presents tools and techniques for organizing, visualizing and describing data. These tools and techniques can help us convert raw data into useful information for investment analysis.
Data can be defined as a collection of numbers, characters, words and text – as well as images, audio, and video – in a raw or organized format to represent facts or information.
Data can be classified in three ways:
Based on a statistical perspective, data can be classified into numerical data and categorical data.
Numerical data: Numerical data (also called quantitative data) are values that represent measured or counted quantities as numbers. Numerical data can be further classified into two types:
Categorical data: Categorical data (also called qualitative data) are values that describe a quality or characteristic of a group of observations. It can usually take only a limited number of values that are mutually exclusive. Categorical data can be further classified into two types:
Although the categories represented by ordinal data can be ranked, the numerical differences between the categories is not necessarily the same, and it cannot be used to draw inferences.
Example: Identifying Data Types
Identify the data type for each of the following items:
Solution:
Based on our above discussion, we can classify these items as follows:
Based on how data is collected, it can be classified into three types: cross-sectional, time-series, and panel.
Before we describe these data types, we need to understand two terms: ‘variable’ and ‘observation’.
Time-series data: Time-series data consists of observations for a single subject taken at specific and equally spaced intervals of time. For example, the quarterly returns of Microsoft stock from 2019 to 2020.
Cross-sectional data: Cross-sectional data consists of observations for multiple subjects taken at a specific point in time. For example, the quarterly returns in 2019 Q1 of a group of similar stocks – Microsoft, Oracle, and HP.
Panel data: Panel data is a combination of time-series and cross-sectional data. It consists of observations through time on one or more variables for multiple subjects. It is generally presented as a table. For example, the quarterly returns of Microsoft, Oracle, and HP from 2019 to 2020.
Based on whether data is available in a highly organized form or not, it can be classified into structured and unstructured data.
Structured data: Structured data is highly organized in a pre-defined manner, usually with repeating patterns. It is easier to enter, store, query and analyze, without much manual processing. Examples:
Unstructured data: Unstructured data does not follow any conventionally organized forms. It is typically alternative data and is usually collected from unconventional sources. Based on the source, unstructured data can be classified into:
Raw data typically cannot be used by humans or computers directly to extract information and insights. The data usually has to be organized first. In the following sections we will discuss various techniques for organizing and summarizing data.
Raw data is typically organized into either a one-dimensional array or a two-dimensional rectangular array (also called a data table) for quantitative analysis.
A frequency distribution (also called a one-way table) is a tabular display of data summarized into a relatively small number of intervals.
Frequency distributions for categorical variables: The steps for constructing a frequency distribution for a categorical variable are:
A sample frequency distribution of 200 companies across four sectors is presented below:
Sector | Absolute Frequency | Relative Frequency |
Technology | 22 | 11% |
Healthcare | 50 | 25% |
Financial | 58 | 29% |
Industrial | 70 | 35% |
Total | 200 | 100% |
Points to note:
Frequency distributions for numerical variables: The steps for constructing a frequency distribution for numerical variables are:
A sample frequency distribution for 100 stocks with prices ranging between 45.00 and 65.00 is presented below.
Stock Price
(Min – Max) |
Absolute
Frequency |
Cumulative Frequency | Relative Frequency | Cumulative Relative Frequency |
45.00 – 50.00 | 25 | 25 | 0.25 | 0.25 |
50.00 – 55.00 | 35 | 60 | 0.35 | 0.60 |
55.00 – 60.00 | 29 | 89 | 0.29 | 0.89 |
60.00 – 65.00 | 11 | 100 | 0.11 | 1.00 |
Points to note:
Instructor’s Note: On the exam you are unlikely to be asked to construct a frequency distribution. However, you may be tested on the process and the terminology.
Example:
The actual number of observations in a given interval is called the:
Solution:
A is correct. The actual number of observations in a given interval is known as absolute frequency. Relative frequency is the absolute frequency of each interval divided by the total number of observations. Cumulative absolute frequency is the running total of all absolute frequencies.
Example:
Which of the following is most likely to be accurate?
Solution:
C is correct. The cumulative relative frequency tells the observer the fraction of the observations that are less than the upper limit of each interval. An observation cannot fall in more than one interval. The data is sorted in an ascending order for the construction of a frequency distribution.