Probability and Research Design
Definition(s) of probability
We
could choose one of several technical definitions for probability, but for our
purposes it refers to an assessment of the likelihood of the various possible
outcomes in an experiment or some other situation with a “random”
outcome. Note that in probability theory the term “outcome” is used in a more
general sense than the outcome vs. explanatory variable terminology that is
used in the rest of this book.
In probability theory the term “outcome” applies
not only to the “outcome variables” of experiments but also to “explanatory
variables” if their values are not fixed. For example, the dose of a drug is
normally fixed by the experimenter, so it is not an outcome in probability
theory, but the age of a randomly chosen subject, even if it serves as an
explanatory variable in an experiment, is not “fixed ” by the experimenter, and
thus can be an “outcome” under probability theory.
The
collection of all possible outcomes of a particular random experiment (or other
well defined random situation) is called the sample space, usually abbreviated
as S or Ω (omega). The outcomes in this set (list) must be exhaustive (cover
all possible outcomes) and mutually exclusive (non-overlapping), and should be
as simple as possible.
We
use the term event to represent any subset of the sample space. One way to
think about events is that they can be defined before the experiment is carried
out, and they either occur or do not occur when the experiment is carried out.
In probability theory we learn to compute the chance that events like “odd side
up” will occur based on assumptions about things like the probabilities of the
elementary outcomes in the sample space.
Technically,
this mapping is called a random variable, but more commonly and informally we
refer to the unknown numeric outcome itself (before the experiment is run) as a
“random variable”. Random variables commonly are represented as upper case
English letters towards the end of the alphabet, such as Z, Y or X. Sometimes
the lower case equivalents are used to represent the actual outcomes after the
experiment is run.
Random
variables are maps from the sample space to the real numbers, but they need not
be one-to-one maps. For example, in the die experiment we could map all of the
outcomes in the set {1du, 3du, 5du} to the number 0 and all of the outcomes in
the set {2du, 4du, 6du} to the number 1, and call this random variable Y.
If we
call the random variable that maps to 1 through 6 as X, then random variable Y
could also be thought of as a map from X to Y where the odd numbers of X map to
0 in Y and the even numbers to 1. Often the term transformation is used
when we create a new random variable out of an old one in this way. It should
now be obvious that many, many different random variables can be
defined/invented for a given experiment.
A
few more basic definitions are worth learning at this point. A random variable
that takes on only the numbers 0 and 1 is commonly referred to as an indicator
(random) variable. It is usually named to match the set that corresponds to
the number 1. So in the previous example, random variable Y is an indicator for
even outcomes.
For any random variable, the term support is used to
refer to the set of possible real numbers defined by the mapping from the
physical experimental outcomes to the numbers. Therefore, for random variables
we use the term “event” to represent any subset of the support.
Ignoring
certain technical issues, probability theory is used to take a basic set of
assigned (or assumed) probabilities and use those probabilities (possibly with
additional assumptions about something called independence) to compute the
probabilities of various more complex events.
The core of probability theory is making predictions about the
chances of occurrence of events based on a set of assumptions about the
underlying probability processes.
One
way to think about probability is that it quantifies how much we can know when
we cannot know something exactly. Probability theory is deductive, in the sense
that it involves making assumptions about a random (not completely predictable)
process, and then deriving valid statements about what is likely to happen
based on mathematical principles.
For this course a fairly small number of
probability definitions, concepts, and skills will suffice.
For
those who are unsatisfied with the loose definition of probability above, here
is a brief description of three different approaches to probability, although
it is not necessary to understand this material to continue through the
chapter. If you want even more detail, I recommend Comparative Statistical
Inference by Vic Barnett.
Valid
probability statements do not claim what events will happen, but rather which
are likely to happen. The starting point is sometimes a judgment that certain
events are a priori equally likely.
Then using only the additional assumption
that the occurrence of one event has no bearing on the occurrence of another
separate event (called the assumption of independence), the likelihood of
various complex combinations of events can be worked out through logic and
mathematics. This approach has logical consistency, but cannot be applied to
situations where it is unreasonable to assume equally likely outcomes and
independence.
A
second approach to probability is to define the probability of an outcome as
the limit of the long-term fraction of times that outcome occurs in an
ever-larger number of independent trials. This allows us to work with basic
events that are not equally likely, but has a disadvantage that probabilities
are assigned through observation.
Nevertheless this approach is sufficient for
our purposes, which are mostly to figure out what would happen if certain
probabilities are assigned to some events.
A
third approach is subjective probability, where the probabilities of various
events are our subjective (but consistent) assignments of probability. This has
the advantage that events that only occur once, such as the next presidential
election, can be studied probabilistically.
Despite the seemingly bizarre
premise, this is a valid and useful approach which may give different answers
for different people who have different beliefs, but still helps calculate your
rational but personal probability of future uncertain events, given your prior
beliefs.
Regardless
of which definition of probability you use, the calculations we need are
basically the same. First we need to note that probability applies to some
well-defined unknown or future situation in which some outcome will occur, the
list of possible outcomes is well defined, and the exact outcome is unknown.
If
the outcome is categorical or discrete quantitative , then each possible
outcome gets a probability in the form of a number between 0 and 1 such that
the sum of all of the probabilities is 1.
This indicates that impossible
outcomes are assigned probability zero, but assigning a probability zero to an
event does not necessarily mean that that outcome is impossible (see below).
(Note that a probability is technically written as a number from 0 to 1, but is
often converted to a percent from 0% to 100%. In case you have forgotten, to
convert to a percent multiply by 100, eg, 0.25 is 25 % and 0.5 is 50% and 0.975
is 97.5%.)
“Every valid probability must be a number between 0 and 1 (or a
percent between 0% and 100%).”
We
will need to distinguish two types of random variables. Discrete random
variables correspond to the categorical variables plus the discrete
quantitative variables. Their support is a (finite or infinite) list of numeric
outcomes, each of which has a non-zero probability. (Here we will loosely use
the term “support” not only for the numeric outcomes of the random variable
mapping, but also for the sample space when we do not explicitly map an outcome
to a number.)
Examples
of discrete random variables include the result of a coin toss (the support
using curly brace set notation is {H,T}), the number of tosses out of 5 that
are heads ({0, 1, 2, 3, 4, 5}), the color of a random person’s eyes ({blue,
brown, green, other}), and the number of coin tosses until a head is obtained
({1, 2, 3, 4, 5, . . .} ). Note that the last example has an infinitely sized
support.
Continuous
random variables correspond to the continuous quantitative variables. Their
support is a continuous range of real numbers (or rarely several disconnected
ranges) with no gaps. When working with continuous random variables in
probability theory we think as if there is no rounding, and each value has an
infinite number of decimal places.
In practice we can only measure things to a
certain number of decimal places, actual measurement of the continuous variable
“length” might be 3.14, 3.15, etc., which does have gaps. But we approximate
this with a continuous random variable rather than a discrete random variable
because more precise measurement is possible in theory.
A
strange aspect of working with continuous random variables is that each
particular outcome in the support has probability zero, while none is actually
impossible. The reason each outcome value has probability zero is that
otherwise the probabilities of all of the events would add up to more than 1.
So for continuous random variables we usually work with intervals of outcomes
to say, eg , that the probability that an outcome is between 3.14 and 3.15
might be 0.02 while each real number in that range, eg, π (exactly), has zero
probability. Examples of continuous random variables include ages, times,
weights, lengths, etc. All of these can theoretically be measured to an
infinite number of decimal places.
It is also possible for a random variable to be a mixture of
discrete and continuous random variables, eg, if an experiment is to flip a
coin and report 0 if it is heads and the time it was in the air if it is tails,
then this variable is a mixture of the discrete and continuous types because
the outcome “0” has a non-zero (positive) probability, while all positive
numbers have a zero probability (though intervals between two positive numbers
would have probability greater than zero.)