ironfern @ docs ~/math/grade-9/combinatorics/statistics-intro $

Introduction to Statistics

What print run should be used for a 9th-grade algebra textbook? Should a certain politician run for mayor in the upcoming election? How many kilograms of fish and seafood does the average Ukrainian consume per year?

Statistics helps answer these and many other questions.

📐Definition — Statistics

Statistics (from Latin status — state) is the science of collecting, processing, and analyzing quantitative data that characterize mass phenomena.

A statistical study consists of several stages: data collection, data processing and presentation in a convenient form, data analysis, and drawing conclusions and recommendations.

Data Collection

In statistics, the collection of objects used as the basis for a study is called a sample. It is important to note that a statistical conclusion based solely on sample size is not always reliable. Statisticians say that a sample must be representative (from French representatif — indicative).

Thus, data collection should be based on the massiveness and representativeness of the sample. Sometimes the sample may coincide with the entire set of objects under study.

Ways of Presenting Data

Collected information (data sets) can be conveniently presented in the form of tables, charts, and diagrams.

Example 1 — Ukrainian Students at Math Olympiads

Statement. A table presents the results of Ukrainian students at International Mathematical Olympiads from 1993 to 2016.

Solution. In many cases, data is conveniently presented as a bar chart (also called a histogram, from Greek histos — column and gramma — writing). Such information is easily perceived and well remembered.

Information can also be presented as graphs and pie charts: the circle represents the total quantity, and each category corresponds to a sector of the circle.

Data Analysis. Arithmetic Mean

📐Definition — Arithmetic Mean

The arithmetic mean (or simply mean) of a data set is the sum of all values divided by the number of values.

Example 2 — Average Number of Medals

Statement. Determine how many medals Ukrainian students won per year on average at International Mathematical Olympiads during 1993—2016.

Solution. We need to divide the total number of medals obtained over the period by the number of years. For the period 1993—2016 (24 years):

5+4+3+6+6+5+4+6+4+6+5+4+6+6+6+6+5+6+6+6+6+5+6+624=13024=5512.\frac{5 + 4 + 3 + 6 + 6 + 5 + 4 + 6 + 4 + 6 + 5 + 4 + 6 + 6 + 6 + 6 + 5 + 6 + 6 + 6 + 6 + 5 + 6 + 6}{24} = \frac{130}{24} = 5\frac{5}{12}.

Since no more than 6 medals can be won per year, the mean of 55125\frac{5}{12} indicates that the Ukrainian team performs admirably at this prestigious competition.

Note — Limitations of the Mean

The mean does not always accurately represent the situation. For example, if incomes in a country vary greatly between different social groups, the average income per person may not reflect the financial situation of the majority.

Frequency Table and Mode

📐Definition — Frequency Table

A table in which data values and the corresponding number of occurrences are recorded is called a frequency table, and the numbers in the second row are called frequencies.

📐Definition — Mode

The value that occurs most frequently in a data set is called the mode.

This word is familiar to everyone. We often say: “in fashion,” “out of fashion.” In everyday life, fashion (mode) refers to the set of views and preferences that the majority favors at a given moment.

The mode is the most important characteristic when the data set is not numerical.

Example 3 — Jeans Sizes

Statement. A well-known company planning to supply jeans to Ukraine surveyed a representative sample of 500 people.

Solution. The survey results:

SizeXSSMLXLXXLXXXL
Frequency527114512659407
Relative frequency (%)10.414.22925.211.881.4

The mode of this sample is size M, with a relative frequency of 29%. The company thus learned that the largest share of supply (approximately 29%) should consist of size M jeans.

Median

📐Definition — Median

The number in the middle of an ordered data set is called the median of that sample.

Example 4 — Choosing a Company by Price

Statement. A family plans to renovate their kitchen and wants to know the price of one square meter of ceramic tile. After studying price lists from 11 construction companies, they obtained the following data (prices in hryvnias, in ascending order):

80,80,90,90,100,130,180,200,300,450,500.80, 80, 90, 90, 100, \mathbf{130}, 180, 200, 300, 450, 500.

Solution. The mean of this data set is 200. However, the data shows that a price of 200 hryvnias is closer to the high end than the middle. The number 130 stands in the middle of the ordered data set. It is called the median. In this situation, the median helps the family choose a company with mid-range prices.

If the data set has an even number of values, for example:

1,4,4,7,8,15,24,24,1, 4, 4, \mathbf{7}, \mathbf{8}, 15, 24, 24,

then the “middle” consists of two numbers: 7 and 8. The median is defined as their arithmetic mean: 7+82=7.5\dfrac{7 + 8}{2} = 7.5.

Note — Measures of Central Tendency

The mean, mode, and median are called measures of central tendency of a data set. These methods can complement each other, and one may more accurately reflect a particular situation than the others.

Exercise — Statistics Problems
  1. Using the table of average annual temperatures in selected Ukrainian cities, construct a bar chart.

  2. Find the measures of central tendency of the data set: 3, 3, 4, 4, 7, 7, 7, 7, 8, 8, 10.

  3. Girls in a 9th-grade PE class performed high jumps. The teacher recorded the following results: 105, 65, 115, 100, 105, 110, 110, 115, 110, 100, 115 (in cm). Find the mean and the median of the data.