How to find a and b in exponential function

Q: What is A and B in an exponential function?

Exponential functions are based on relationships involving a constant multiplier. You can write. an exponential function in general form. In this form, a represents an initial value or amount, and b, the constant multiplier, is a growth factor or factor of decay.

Q: How do you find a and b of an exponential function from two points?

If you have two points, (x1, y1) and (x2, y2), you can define the exponential function that passes through these points by substituting them in the equation y = abx and solving for a and b. In general, you have to solve this pair of equations: y1 = abx1 and y2 = abx2, .

Q: How do you find the a value of an exponential function?

Exponential functions are written in the form f(x)=abx f ( x ) = a b x . Initial Value: The initial value of an exponential function is the result of substituting x=0 into the function. In the exponential function f(x)=abx f ( x ) = a b x , the initial value is a .

If you're seeing this message, it means we're having trouble loading external resources on our website.

Table of Contents Show

Definition[edit]
Examples of exponential family distributions[edit]
Scalar parameter[edit]
Factorization of the variables involved[edit]
Vector parameter[edit]
Vector parameter, vector variable[edit]
Measure-theoretic formulation[edit]
Interpretation[edit]
Properties[edit]
Examples[edit]
Normal distribution: unknown mean, known variance[edit]
Normal distribution: unknown mean and unknown variance[edit]
Binomial distribution[edit]
Table of distributions[edit]
Moments and cumulants of the sufficient statistic[edit]
Normalization of the distribution[edit]
Moment-generating function of the sufficient statistic[edit]
Entropy[edit]
Relative entropy[edit]
Maximum-entropy derivation[edit]
Role in statistics[edit]
Classical estimation: sufficiency[edit]
Bayesian estimation: conjugate distributions[edit]
Hypothesis testing: uniformly most powerful tests[edit]
Generalized linear models[edit]
What is A and B in an exponential function?
How do you find a and b of an exponential function from two points?
How do you find the a value of an exponential function?

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Play with our fun little avatar builder to create and customize your own avatar on StudyPug. Choose your face, eye colour, hair colour and style, and background. Unlock more options the more you use StudyPug.

To write an exponential function given a rate and an initial value, start by determining the initial value and the rate of interest. For example if a bank account was opened with $1000 at an annual interest rate of 3%, the initial value is 1000 and the rate is .03. Then, rewrite the time variable of t/h as t/12, since the money increases by 3% every 12 months. Finally, plug in the values and write your exponential function as f(t)=1,000(1.03)t/12. To learn more, including how to find the continuous growth rate from an exponential function, scroll down.

Did this summary help you?YesNo

Thanks to all authors for creating a page that has been read 210,710 times.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family",[1] or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter;[a] however, a parametric family of distributions is often referred to as "a distribution" (like "the normal distribution", meaning "the family of normal distributions"), and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

The concept of exponential families is credited to[2] E. J. G. Pitman,[3] G. Darmois,[4] and B. O. Koopman[5] in 1935–1936. Exponential families of distributions provides a general framework for selecting a possible alternative parameterisation of a parametric family of distributions, in terms of natural parameters, and for defining useful sample statistics, called the natural sufficient statistics of the family.

Definition[edit]

Most of the commonly used distributions form an exponential family or subset of an exponential family, listed in the subsection below. The subsections following it are a sequence of increasingly more general mathematical definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.

Examples of exponential family distributions[edit]

Exponential families include many of the most common distributions. Among many others, exponential families includes the following:[6]

A number of common distributions are exponential families, but only when certain parameters are fixed and known. For example:

Notice that in each case, the parameters which must be fixed determine a limit on the size of observation values.

Examples of common distributions that are not exponential families are Student's t, most mixture distributions, and even the family of uniform distributions when the bounds are not fixed. See the section below on examples for more discussion.

Scalar parameter[edit]

A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form

fX(x∣θ)=h(x)exp[η(θ)⋅T(x)−A(θ)]{\displaystyle f_{X}(x\mid \theta )=h(x)\,\exp \!{\bigl [}\,\eta (\theta )\cdot T(x)-A(\theta )\,{\bigr ]}}

where T(x), h(x), η(θ), and A(θ) are known functions. The function h(x) must of course be non-negative.

An alternative, equivalent form often given is

fX(x∣θ)=h(x)g(θ)exp[η(θ)⋅T(x)]{\displaystyle f_{X}(x\mid \theta )=h(x)\,g(\theta )\,\exp \!{\bigl [}\,\eta (\theta )\cdot T(x)\,{\bigr ]}}

or equivalently

fX(x∣θ)=exp[η(θ)⋅T(x)−A(θ)+B(x)]{\displaystyle f_{X}(x\mid \theta )=\exp \!{\bigl [}\,\eta (\theta )\cdot T(x)-A(\theta )+B(x)\,{\bigr ]}}

The value θ is called the parameter of the family.

In addition, the support of fX(x∣θ){\displaystyle f_{X}\!\left(x\mid \theta \right)}

(i.e. the set of all x{\displaystyle x}

for which fX(x∣θ){\displaystyle f_{X}\!\left(x\mid \theta \right)} is greater than 0) does not depend on θ{\displaystyle \theta }

.[7] This can be used to exclude a parametric family distribution from being an exponential family. For example, the Pareto distribution has a pdf which is defined for x≥xm{\displaystyle x\geq x_{m}}

(xm{\displaystyle x_{m}}

being the scale parameter) and its support, therefore, has a lower limit of xm{\displaystyle x_{m}}. Since the support of fα,xm(x){\displaystyle f_{\alpha ,x_{m}}\!(x)}

is dependent on the value of the parameter, the family of Pareto distributions does not form an exponential family of distributions (at least when xm{\displaystyle x_{m}} is unknown).

Often x is a vector of measurements, in which case T(x) may be a function from the space of possible values of x to the real numbers. More generally, η(θ) and T(x) can each be vector-valued such that η′(θ)⋅T(x){\displaystyle \eta '(\theta )\cdot T(x)}

is real-valued. However, see the discussion below on vector parameters, regarding the curved exponential family.

If η(θ) = θ, then the exponential family is said to be in canonical form. By defining a transformed parameter η = η(θ), it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since η(θ) can be multiplied by any nonzero constant, provided that T(x) is multiplied by that constant's reciprocal, or a constant c can be added to η(θ) and h(x) multiplied by exp[−c⋅T(x)]{\displaystyle \exp \!{\bigl [}-c\cdot T(x)\,{\bigr ]}}

to offset it. In the special case that η(θ) = θ and T(x) = x then the family is called a natural exponential family.

Even when x is a scalar, and there is only a single parameter, the functions η(θ) and T(x) can still be vectors, as described below.

The function A(θ), or equivalently g(θ), is automatically determined once the other functions have been chosen, since it must assume a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of η, even when η(θ) is not a one-to-one function, i.e. two or more different values of θ map to the same value of η(θ), and hence η(θ) cannot be inverted. In such a case, all values of θ mapping to the same η(θ) will also have the same value for A(θ) and g(θ).

Factorization of the variables involved[edit]

What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must factorize (can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an exponentiation operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms:

f(x),g(θ),cf(x),cg(θ),[f(x)]c,[g(θ)]c,[f(x)]g(θ),[g(θ)]f(x),[f(x)]h(x)g(θ), or [g(θ)]h(x)j(θ),{\displaystyle f(x),g(\theta ),c^{f(x)},c^{g(\theta )},{[f(x)]}^{c},{[g(\theta )]}^{c},{[f(x)]}^{g(\theta )},{[g(\theta )]}^{f(x)},{[f(x)]}^{h(x)g(\theta )},{\text{ or }}{[g(\theta )]}^{h(x)j(\theta )},}

where f and h are arbitrary functions of x; g and j are arbitrary functions of θ; and c is an arbitrary "constant" expression (i.e. an expression not involving x or θ).

There are further restrictions on how many such factors can occur. For example, the two expressions:

[f(x)g(θ)]h(x)j(θ),[f(x)]h(x)j(θ)[g(θ)]h(x)j(θ),{\displaystyle {[f(x)g(\theta )]}^{h(x)j(\theta )},\qquad {[f(x)]}^{h(x)j(\theta )}[g(\theta )]^{h(x)j(\theta )},}

are the same, i.e. a product of two "allowed" factors. However, when rewritten into the factorized form,

[f(x)g(θ)]h(x)j(θ)=[f(x)]h(x)j(θ)[g(θ)]h(x)j(θ)=e[h(x)log⁡f(x)]j(θ)+h(x)[j(θ)log⁡g(θ)],{\displaystyle {[f(x)g(\theta )]}^{h(x)j(\theta )}={[f(x)]}^{h(x)j(\theta )}[g(\theta )]^{h(x)j(\theta )}=e^{[h(x)\log f(x)]j(\theta )+h(x)[j(\theta )\log g(\theta )]},}

it can be seen that it cannot be expressed in the required form. (However, a form of this sort is a member of a curved exponential family, which allows multiple factorized terms in the exponent.[citation needed])

To see why an expression of the form

[f(x)]g(θ){\displaystyle {[f(x)]}^{g(\theta )}}

qualifies,

[f(x)]g(θ)=eg(θ)log⁡f(x){\displaystyle {[f(x)]}^{g(\theta )}=e^{g(\theta )\log f(x)}}

and hence factorizes inside of the exponent. Similarly,

[f(x)]h(x)g(θ)=eh(x)g(θ)log⁡f(x)=e[h(x)log⁡f(x)]g(θ){\displaystyle {[f(x)]}^{h(x)g(\theta )}=e^{h(x)g(\theta )\log f(x)}=e^{[h(x)\log f(x)]g(\theta )}}

and again factorizes inside of the exponent.

A factor consisting of a sum where both types of variables are involved (e.g. a factor of the form 1+f(x)g(θ){\displaystyle 1+f(x)g(\theta )}

) cannot be factorized in this fashion (except in some cases where occurring directly in an exponent); this is why, for example, the Cauchy distribution and Student's t distribution are not exponential families.

Vector parameter[edit]

The definition in terms of one real-number parameter can be extended to one real-vector parameter

θ≡[θ1,θ2,…,θs]T .{\displaystyle {\boldsymbol {\theta }}\equiv \left[\,\theta _{1},\,\theta _{2},\,\ldots ,\,\theta _{s}\,\right]^{\mathsf {T}}~.}

A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as

fX(x∣θ)=h(x)exp⁡(∑i=1sηi(θ)Ti(x)−A(θ)) ,{\displaystyle f_{X}(x\mid {\boldsymbol {\theta }})=h(x)\,\exp \left(\sum _{i=1}^{s}\eta _{i}({\boldsymbol {\theta }})T_{i}(x)-A({\boldsymbol {\theta }})\right)~,}

or in a more compact form,

fX(x∣θ)=h(x)exp⁡(η(θ)⋅T(x)−A(θ)){\displaystyle f_{X}(x\mid {\boldsymbol {\theta }})=h(x)\,\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (x)-A({\boldsymbol {\theta }}){\Big )}}

This form writes the sum as a dot product of vector-valued functions η(θ){\displaystyle {\boldsymbol {\eta }}({\boldsymbol {\theta }})}

and T(x){\displaystyle \mathbf {T} (x)\,}

An alternative, equivalent form often seen is

fX(x∣θ)=h(x)g(θ)exp⁡(η(θ)⋅T(x)){\displaystyle f_{X}(x\mid {\boldsymbol {\theta }})=h(x)\,g({\boldsymbol {\theta }})\,\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (x){\Big )}}

As in the scalar valued case, the exponential family is said to be in canonical form if

ηi(θ)=θi∀i.{\displaystyle \quad \eta _{i}({\boldsymbol {\theta }})=\theta _{i}\quad \forall i\,.}

A vector exponential family is said to be curved if the dimension of

θ≡[θ1,θ2,…,θd]T{\displaystyle {\boldsymbol {\theta }}\equiv \left[\,\theta _{1},\,\theta _{2},\,\ldots ,\,\theta _{d}\,\,\right]^{\mathsf {T}}}

is less than the dimension of the vector

η(θ)≡[η1(θ),η2(θ),…,ηs(θ)]T .{\displaystyle {\boldsymbol {\eta }}({\boldsymbol {\theta }})\equiv \left[\,\eta _{1}({\boldsymbol {\theta }}),\,\eta _{2}({\boldsymbol {\theta }}),\,\ldots ,\,\eta _{s}({\boldsymbol {\theta }})\,\right]^{\mathsf {T}}~.}

That is, if the dimension, d, of the parameter vector is less than the number of functions, s, of the parameter vector in the above representation of the probability density function. Most common distributions in the exponential family are not curved, and many algorithms designed to work with any exponential family implicitly or explicitly assume that the distribution is not curved.

As in the above case of a scalar-valued parameter, the function A(θ){\displaystyle A({\boldsymbol {\theta }})}

or equivalently g(θ){\displaystyle g({\boldsymbol {\theta }})}

is automatically determined once the other functions have been chosen, so that the entire distribution is normalized. In addition, as above, both of these functions can always be written as functions of η{\displaystyle {\boldsymbol {\eta }}}

, regardless of the form of the transformation that generates η{\displaystyle {\boldsymbol {\eta }}} from θ{\displaystyle {\boldsymbol {\theta }}\,}

. Hence an exponential family in its "natural form" (parametrized by its natural parameter) looks like

fX(x∣η)=h(x)exp⁡(η⋅T(x)−A(η)){\displaystyle f_{X}(x\mid {\boldsymbol {\eta }})=h(x)\,\exp {\Big (}{\boldsymbol {\eta }}\cdot \mathbf {T} (x)-A({\boldsymbol {\eta }}){\Big )}}

or equivalently

fX(x∣η)=h(x)g(η)exp⁡(η⋅T(x)){\displaystyle f_{X}(x\mid {\boldsymbol {\eta }})=h(x)\,g({\boldsymbol {\eta }})\,\exp {\Big (}{\boldsymbol {\eta }}\cdot \mathbf {T} (x){\Big )}}

The above forms may sometimes be seen with ηTT(x){\displaystyle {\boldsymbol {\eta }}^{\mathsf {T}}\mathbf {T} (x)}

in place of η⋅T(x){\displaystyle {\boldsymbol {\eta }}\cdot \mathbf {T} (x)\,}

. These are exactly equivalent formulations, merely using different notation for the dot product.

Vector parameter, vector variable[edit]

The vector-parameter form over a single scalar-valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables. The resulting distribution is simply the same as the above distribution for a scalar-valued random variable with each occurrence of the scalar x replaced by the vector

x=(x1,x2,⋯,xk)T .{\displaystyle \mathbf {x} =\left(x_{1},x_{2},\cdots ,x_{k}\right)^{\mathsf {T}}~.}

The dimensions k of the random variable need not match the dimension d of the parameter vector, nor (in the case of a curved exponential function) the dimension s of the natural parameter η{\displaystyle {\boldsymbol {\eta }}} and sufficient statistic T(x) .

The distribution in this case is written as

fX(x∣θ)=h(x)exp(∑i=1sηi(θ)Ti(x)−A(θ)){\displaystyle f_{X}\!\left(\mathbf {x} \mid {\boldsymbol {\theta }}\right)=h(\mathbf {x} )\,\exp \!\left(\,\sum _{i=1}^{s}\eta _{i}({\boldsymbol {\theta }})T_{i}(\mathbf {x} )-A({\boldsymbol {\theta }})\,\right)}

Or more compactly as

fX(x∣θ)=h(x)exp(η(θ)⋅T(x)−A(θ)){\displaystyle f_{X}\!\left(\,\mathbf {x} \mid {\boldsymbol {\theta }}\,\right)=h(\mathbf {x} )\,\exp \!{\Big (}\,{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (\mathbf {x} )-A({\boldsymbol {\theta }})\,{\Big )}}

Or alternatively as

fX(x∣θ)=g(θ)h(x)exp(η(θ)⋅T(x)){\displaystyle f_{X}\!\left(\,\mathbf {x} \mid {\boldsymbol {\theta }}\,\right)=g({\boldsymbol {\theta }})\;h(\mathbf {x} )\,\exp \!{\Big (}\,{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (\mathbf {x} )\,{\Big )}}

Measure-theoretic formulation[edit]

We use cumulative distribution functions (CDF) in order to encompass both discrete and continuous distributions.

Suppose H is a non-decreasing function of a real variable. Then Lebesgue–Stieltjes integrals with respect to dH(x){\displaystyle {\rm {d\,}}H(\mathbf {x} )}

are integrals with respect to the reference measure of the exponential family generated by H .

Any member of that exponential family has cumulative distribution function

dF(x∣θ)=exp⁡(η(θ)⋅T(x)−A(θ)) dH(x) .{\displaystyle {\rm {d\,}}F\left(\,\mathbf {x} \mid {\boldsymbol {\theta }}\,\right)=\exp {\bigl (}\,{\boldsymbol {\eta }}(\theta )\cdot \mathbf {T} (\mathbf {x} )\,-\,A({\boldsymbol {\theta }})\,{\bigr )}~{\rm {d\,}}H(\mathbf {x} )~.}

H(x) is a Lebesgue–Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is absolutely continuous with a density f(x){\displaystyle f(x)}

with respect to a reference measure dx{\displaystyle \,{\rm {d\,}}x\,}

(typically Lebesgue measure), one can write dF(x)=f(x) dx{\displaystyle \,{\rm {d\,}}F(x)=f(x)~{\rm {d\,}}x\,}

. In this case, H is also absolutely continuous and can be written dH(x)=h(x)dx{\displaystyle \,{\rm {d\,}}H(x)=h(x)\,{\rm {d\,}}x\,}

so the formulas reduce to that of the previous paragraphs. If F is discrete, then H is a step function (with steps on the support of F).

Alternatively, we can write the probability measure directly as

P(dx∣θ)=exp⁡(η(θ)⋅T(x)−A(θ)) μ(dx) .{\displaystyle P\left(\,{\rm {d\,}}\mathbf {x} \mid {\boldsymbol {\theta }}\,\right)=\exp {\bigl (}\,{\boldsymbol {\eta }}(\theta )\cdot \mathbf {T} (\mathbf {x} )-A({\boldsymbol {\theta }})\,{\bigr )}~\mu ({\rm {d\,}}\mathbf {x} )~.}

for some reference measure μ{\displaystyle \mu \,}

Interpretation[edit]

In the definitions above, the functions T(x), η(θ), and A(η) were apparently arbitrarily defined. However, these functions play a significant role in the resulting probability distribution.

T(x) is a sufficient statistic of the distribution. For exponential families, the sufficient statistic is a function of the data that holds all information the data x provides with regard to the unknown parameter values. This means that, for any data sets x{\displaystyle x} and y{\displaystyle y}
, the likelihood ratio is the same, that is f(x;θ1)f(x;θ2)=f(y;θ1)f(y;θ2){\displaystyle {\frac {f(x;\theta _{1})}{f(x;\theta _{2})}}={\frac {f(y;\theta _{1})}{f(y;\theta _{2})}}}
if T(x) = T(y) . This is true even if x and y are quite distinct – that is, even if d(x,y)>0{\displaystyle d(x,y)>0\,}
. The dimension of T(x) equals the number of parameters of θ and encompasses all of the information regarding the data related to the parameter θ. The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the posterior distribution of the parameters, given the data (and hence to derive any desired estimate of the parameters). (This important property is discussed further below.)
η is called the natural parameter. The set of values of η for which the function fX(x;η){\displaystyle f_{X}(x;\eta )}
is integrable is called the natural parameter space. It can be shown that the natural parameter space is always convex.
A(η) is called the log-partition function[b] because it is the logarithm of a normalization factor, without which fX(x;θ){\displaystyle f_{X}(x;\theta )}
would not be a probability distribution:

A(η)=log⁡(∫Xh(x)exp⁡(η(θ)⋅T(x))dx){\displaystyle A(\eta )=\log \left(\int _{X}h(x)\,\exp(\eta (\theta )\cdot T(x))\,\mathrm {d} x\right)}

The function A is important in its own right, because the mean, variance and other moments of the sufficient statistic T(x) can be derived simply by differentiating A(η). For example, because log(x) is one of the components of the sufficient statistic of the gamma distribution, E⁡[log⁡x]{\displaystyle \operatorname {\mathcal {E}} [\log x]}

can be easily determined for this distribution using A(η). Technically, this is true because

K(u∣η)=A(η+u)−A(η),{\displaystyle K\left(u\mid \eta \right)=A(\eta +u)-A(\eta )\,,}

is the cumulant generating function of the sufficient statistic.

Properties[edit]

Exponential families have a large number of properties that make them extremely useful for statistical analysis. In many cases, it can be shown that only exponential families have these properties. Examples:

Given an exponential family defined by fX(x∣θ)=h(x)exp[θ⋅T(x)−A(θ)]{\displaystyle f_{X}(x\mid \theta )=h(x)\,\exp \!{\bigl [}\,\theta \cdot T(x)-A(\theta )\,{\bigr ]}}

, where Θ{\displaystyle \Theta }

is the parameter space, such that θ∈Θ⊂Rk{\displaystyle \theta \in \Theta \subset \mathbb {R} ^{k}}

. Then

If Θ{\displaystyle \Theta } has nonempty interior in Rk{\displaystyle \mathbb {R} ^{k}}
, then given any IID samples X1,...,Xn∼fX{\displaystyle X_{1},...,X_{n}\sim f_{X}}
, the statistic T(X1,...,Xn):=∑i=1nT(Xi){\displaystyle T(X_{1},...,X_{n}):=\sum _{i=1}^{n}T(X_{i})}
is a complete statistic for θ{\displaystyle \theta }.[9][10]
T{\displaystyle T}
is a minimal statistic for θ{\displaystyle \theta } iff for all θ1,θ2∈Θ{\displaystyle \theta _{1},\theta _{2}\in \Theta }
, and x1,x2{\displaystyle x_{1},x_{2}}
in the support of X{\displaystyle X}
, if (θ1−θ2)⋅(T(x1)−T(x2))=0{\displaystyle (\theta _{1}-\theta _{2})\cdot (T(x_{1})-T(x_{2}))=0}
, then θ1=θ2{\displaystyle \theta _{1}=\theta _{2}}
or x1=x2{\displaystyle x_{1}=x_{2}}
.[11]

Examples[edit]

It is critical, when considering the examples in this section, to remember the discussion above about what it means to say that a "distribution" is an exponential family, and in particular to keep in mind that the set of parameters that are allowed to vary is critical in determining whether a "distribution" is or is not an exponential family.

The normal, exponential, log-normal, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, geometric, inverse Gaussian, von Mises and von Mises-Fisher distributions are all exponential families.

Some distributions are exponential families only if some of their parameters are held fixed. The family of Pareto distributions with a fixed minimum bound xm form an exponential family. The families of binomial and multinomial distributions with fixed number of trials n but unknown probability parameter(s) are exponential families. The family of negative binomial distributions with fixed number of failures (a.k.a. stopping-time parameter) r is an exponential family. However, when any of the above-mentioned fixed parameters are allowed to vary, the resulting family is not an exponential family.

As mentioned above, as a general rule, the support of an exponential family must remain the same across all parameter settings in the family. This is why the above cases (e.g. binomial with varying number of trials, Pareto with varying minimum bound) are not exponential families — in all of the cases, the parameter in question affects the support (particularly, changing the minimum or maximum possible value). For similar reasons, neither the discrete uniform distribution nor continuous uniform distribution are exponential families as one or both bounds vary.

The Weibull distribution with fixed shape parameter k is an exponential family. Unlike in the previous examples, the shape parameter does not affect the support; the fact that allowing it to vary makes the Weibull non-exponential is due rather to the particular form of the Weibull's probability density function (k appears in the exponent of an exponent).

In general, distributions that result from a finite or infinite mixture of other distributions, e.g. mixture model densities and compound probability distributions, are not exponential families. Examples are typical Gaussian mixture models as well as many heavy-tailed distributions that result from compounding (i.e. infinitely mixing) a distribution with a prior distribution over one of its parameters, e.g. the Student's t-distribution (compounding a normal distribution over a gamma-distributed precision prior), and the beta-binomial and Dirichlet-multinomial distributions. Other examples of distributions that are not exponential families are the F-distribution, Cauchy distribution, hypergeometric distribution and logistic distribution.

Following are some detailed examples of the representation of some useful distribution as exponential families.

Normal distribution: unknown mean, known variance[edit]

As a first example, consider a random variable distributed normally with unknown mean μ and known variance σ2. The probability density function is then

fσ(x;μ)=12πσ2e−(x−μ)2/(2σ2).{\displaystyle f_{\sigma }(x;\mu )={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-(x-\mu )^{2}/(2\sigma ^{2})}.}

This is a single-parameter exponential family, as can be seen by setting

hσ(x)=12πσ2e−x2/(2σ2)Tσ(x)=xσAσ(μ)=μ22σ2ησ(μ)=μσ.{\displaystyle {\begin{aligned}h_{\sigma }(x)&={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-x^{2}/(2\sigma ^{2})}\\[4pt]T_{\sigma }(x)&={\frac {x}{\sigma }}\\[4pt]A_{\sigma }(\mu )&={\frac {\mu ^{2}}{2\sigma ^{2}}}\\[4pt]\eta _{\sigma }(\mu )&={\frac {\mu }{\sigma }}.\end{aligned}}}

If σ = 1 this is in canonical form, as then η(μ) = μ.

Normal distribution: unknown mean and unknown variance[edit]

Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then

f(y;μ,σ)=12πσ2e−(y−μ)2/2σ2.{\displaystyle f(y;\mu ,\sigma )={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-(y-\mu )^{2}/2\sigma ^{2}}.}

This is an exponential family which can be written in canonical form by defining

η=[μσ2, −12σ2]h(y)=12πT(y)=(y,y2)TA(η)=μ22σ2+log⁡|σ|=−η124η2+12log⁡|12η2|{\displaystyle {\begin{aligned}{\boldsymbol {\eta }}&=\left[\,{\frac {\mu }{\sigma ^{2}}},~-{\frac {1}{2\sigma ^{2}}}\,\right]\\h(y)&={\frac {1}{\sqrt {2\pi }}}\\T(y)&=\left(y,y^{2}\right)^{\rm {T}}\\A({\boldsymbol {\eta }})&={\frac {\mu ^{2}}{2\sigma ^{2}}}+\log |\sigma |=-{\frac {\eta _{1}^{2}}{4\eta _{2}}}+{\frac {1}{2}}\log \left|{\frac {1}{2\eta _{2}}}\right|\end{aligned}}}

Binomial distribution[edit]

As an example of a discrete exponential family, consider the binomial distribution with known number of trials n. The probability mass function for this distribution is

f(x)=(nx)px(1−p)n−x,x∈{0,1,2,…,n}.{\displaystyle f(x)={n \choose x}p^{x}(1-p)^{n-x},\quad x\in \{0,1,2,\ldots ,n\}.}

This can equivalently be written as

f(x)=(nx)exp⁡(xlog⁡(p1−p)+nlog⁡(1−p)),{\displaystyle f(x)={n \choose x}\exp \left(x\log \left({\frac {p}{1-p}}\right)+n\log(1-p)\right),}

which shows that the binomial distribution is an exponential family, whose natural parameter is

η=log⁡p1−p.{\displaystyle \eta =\log {\frac {p}{1-p}}.}

This function of p is known as logit.

Table of distributions[edit]

The following table shows how to rewrite a number of common distributions as exponential-family distributions with natural parameters. Refer to the flashcards[12] for main exponential families.

For a scalar variable and scalar parameter, the form is as follows:

fX(x∣θ)=h(x)exp⁡(η(θ)T(x)−A(η)){\displaystyle f_{X}(x\mid \theta )=h(x)\exp {\Big (}\eta ({\theta })T(x)-A({\eta }){\Big )}}

For a scalar variable and vector parameter:

fX(x∣θ)=h(x)exp⁡(η(θ)⋅T(x)−A(η)){\displaystyle f_{X}(x\mid {\boldsymbol {\theta }})=h(x)\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (x)-A({\boldsymbol {\eta }}){\Big )}}fX(x∣θ)=h(x)g(θ)exp⁡(η(θ)⋅T(x)){\displaystyle f_{X}(x\mid {\boldsymbol {\theta }})=h(x)g({\boldsymbol {\theta }})\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (x){\Big )}}

For a vector variable and vector parameter:

fX(x∣θ)=h(x)exp⁡(η(θ)⋅T(x)−A(η)){\displaystyle f_{X}(\mathbf {x} \mid {\boldsymbol {\theta }})=h(\mathbf {x} )\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (\mathbf {x} )-A({\boldsymbol {\eta }}){\Big )}}

The above formulas choose the functional form of the exponential-family with a log-partition function A(η){\displaystyle A({\boldsymbol {\eta }})}

. The reason for this is so that the moments of the sufficient statistics can be calculated easily, simply by differentiating this function. Alternative forms involve either parameterizing this function in terms of the normal parameter θ{\displaystyle {\boldsymbol {\theta }}}

instead of the natural parameter, and/or using a factor g(η){\displaystyle g({\boldsymbol {\eta }})}

outside of the exponential. The relation between the latter and the former is:

A(η)=−log⁡g(η){\displaystyle A({\boldsymbol {\eta }})=-\log g({\boldsymbol {\eta }})}g(η)=e−A(η){\displaystyle g({\boldsymbol {\eta }})=e^{-A({\boldsymbol {\eta }})}}

To convert between the representations involving the two types of parameter, use the formulas below for writing one type of parameter in terms of the other.

DistributionParameter(s) θ{\displaystyle {\boldsymbol {\theta }}}Natural parameter(s) η{\displaystyle {\boldsymbol {\eta }}}Inverse parameter mappingBase measure h(x){\displaystyle h(x)}Sufficient statistic T(x){\displaystyle T(x)}Log-partition A(η){\displaystyle A({\boldsymbol {\eta }})}Log-partition A(θ){\displaystyle A({\boldsymbol {\theta }})}Bernoulli distributionp{\displaystyle p}log⁡p1−p{\displaystyle \log {\frac {p}{1-p}}}11+e−η=eη1+eη{\displaystyle {\frac {1}{1+e^{-\eta }}}={\frac {e^{\eta }}{1+e^{\eta }}}}1{\displaystyle 1}x{\displaystyle x}log⁡(1+eη){\displaystyle \log(1+e^{\eta })}−log⁡(1−p){\displaystyle -\log(1-p)}binomial distribution
with known number of trials n{\displaystyle n}

p{\displaystyle p}log⁡p1−p{\displaystyle \log {\frac {p}{1-p}}}11+e−η=eη1+eη{\displaystyle {\frac {1}{1+e^{-\eta }}}={\frac {e^{\eta }}{1+e^{\eta }}}}(nx){\displaystyle {n \choose x}}x{\displaystyle x}nlog⁡(1+eη){\displaystyle n\log(1+e^{\eta })}−nlog⁡(1−p){\displaystyle -n\log(1-p)}Poisson distributionλ{\displaystyle \lambda }log⁡λ{\displaystyle \log \lambda }eη{\displaystyle e^{\eta }}1x!{\displaystyle {\frac {1}{x!}}}x{\displaystyle x}eη{\displaystyle e^{\eta }}λ{\displaystyle \lambda }negative binomial distribution
with known number of failures r{\displaystyle r}p{\displaystyle p}log⁡p{\displaystyle \log p}eη{\displaystyle e^{\eta }}(x+r−1x){\displaystyle {x+r-1 \choose x}}x{\displaystyle x}−rlog⁡(1−eη){\displaystyle -r\log(1-e^{\eta })}−rlog⁡(1−p){\displaystyle -r\log(1-p)}exponential distributionλ{\displaystyle \lambda }−λ{\displaystyle -\lambda }−η{\displaystyle -\eta }1{\displaystyle 1}x{\displaystyle x}−log⁡(−η){\displaystyle -\log(-\eta )}−log⁡λ{\displaystyle -\log \lambda }Pareto distribution
with known minimum value xm{\displaystyle x_{m}}α{\displaystyle \alpha }−α−1{\displaystyle -\alpha -1}−1−η{\displaystyle -1-\eta }1{\displaystyle 1}log⁡x{\displaystyle \log x}−log⁡(−1−η)+(1+η)log⁡xm{\displaystyle -\log(-1-\eta )+(1+\eta )\log x_{\mathrm {m} }}−log⁡α−αlog⁡xm{\displaystyle -\log \alpha -\alpha \log x_{\mathrm {m} }}Weibull distribution
with known shape kλ{\displaystyle \lambda }−1λk{\displaystyle -{\frac {1}{\lambda ^{k}}}}(−η)−1k{\displaystyle (-\eta )^{-{\frac {1}{k}}}}xk−1{\displaystyle x^{k-1}}xk{\displaystyle x^{k}}−log⁡(−η)−log⁡k{\displaystyle -\log(-\eta )-\log k}klog⁡λ−log⁡k{\displaystyle k\log \lambda -\log k}Laplace distribution
with known mean μ{\displaystyle \mu }b{\displaystyle b}−1b{\displaystyle -{\frac {1}{b}}}−1η{\displaystyle -{\frac {1}{\eta }}}1{\displaystyle 1}|x−μ|{\displaystyle |x-\mu |}log⁡(−2η){\displaystyle \log \left(-{\frac {2}{\eta }}\right)}log⁡2b{\displaystyle \log 2b}chi-squared distributionν{\displaystyle \nu }

ν2−1{\displaystyle {\frac {\nu }{2}}-1}2(η+1){\displaystyle 2(\eta +1)}e−x2{\displaystyle e^{-{\frac {x}{2}}}}log⁡x{\displaystyle \log x}log⁡Γ(η+1)+(η+1)log⁡2{\displaystyle \log \Gamma (\eta +1)+(\eta +1)\log 2}log⁡Γ(ν2)+ν2log⁡2{\displaystyle \log \Gamma \left({\frac {\nu }{2}}\right)+{\frac {\nu }{2}}\log 2}normal distribution
known varianceμ{\displaystyle \mu }μσ{\displaystyle {\frac {\mu }{\sigma }}}ση{\displaystyle \sigma \eta }e−x22σ22πσ{\displaystyle {\frac {e^{-{\frac {x^{2}}{2\sigma ^{2}}}}}{{\sqrt {2\pi }}\sigma }}}xσ{\displaystyle {\frac {x}{\sigma }}}η22{\displaystyle {\frac {\eta ^{2}}{2}}}μ22σ2{\displaystyle {\frac {\mu ^{2}}{2\sigma ^{2}}}}continuous Bernoulli distributionλ{\displaystyle \lambda }log⁡λ1−λ{\displaystyle \log {\frac {\lambda }{1-\lambda }}}eη1+eη{\displaystyle {\frac {e^{\eta }}{1+e^{\eta }}}}1{\displaystyle 1}x{\displaystyle x}log⁡eη−1η{\displaystyle \log {\frac {e^{\eta }-1}{\eta }}}log⁡(1−2λ(1−λ)log⁡(1−λλ)){\displaystyle \log \left({\frac {1-2\lambda }{(1-\lambda )\log \left({\frac {1-\lambda }{\lambda }}\right)}}\right)}normal distributionμ, σ2{\displaystyle \mu ,\ \sigma ^{2}}[μσ2−12σ2]{\displaystyle {\begin{bmatrix}{\dfrac {\mu }{\sigma ^{2}}}\\[10pt]-{\dfrac {1}{2\sigma ^{2}}}\end{bmatrix}}}[−η12η2−12η2]{\displaystyle {\begin{bmatrix}-{\dfrac {\eta _{1}}{2\eta _{2}}}\\[15pt]-{\dfrac {1}{2\eta _{2}}}\end{bmatrix}}}12π{\displaystyle {\frac {1}{\sqrt {2\pi }}}}[xx2]{\displaystyle {\begin{bmatrix}x\\x^{2}\end{bmatrix}}}−η124η2−12log⁡(−2η2){\displaystyle -{\frac {\eta _{1}^{2}}{4\eta _{2}}}-{\frac {1}{2}}\log(-2\eta _{2})}μ22σ2+log⁡σ{\displaystyle {\frac {\mu ^{2}}{2\sigma ^{2}}}+\log \sigma }log-normal distributionμ, σ2{\displaystyle \mu ,\ \sigma ^{2}}[μσ2−12σ2]{\displaystyle {\begin{bmatrix}{\dfrac {\mu }{\sigma ^{2}}}\\[10pt]-{\dfrac {1}{2\sigma ^{2}}}\end{bmatrix}}}[−η12η2−12η2]{\displaystyle {\begin{bmatrix}-{\dfrac {\eta _{1}}{2\eta _{2}}}\\[15pt]-{\dfrac {1}{2\eta _{2}}}\end{bmatrix}}}12πx{\displaystyle {\frac {1}{{\sqrt {2\pi }}x}}}[log⁡x(log⁡x)2]{\displaystyle {\begin{bmatrix}\log x\\(\log x)^{2}\end{bmatrix}}}−η124η2−12log⁡(−2η2){\displaystyle -{\frac {\eta _{1}^{2}}{4\eta _{2}}}-{\frac {1}{2}}\log(-2\eta _{2})}μ22σ2+log⁡σ{\displaystyle {\frac {\mu ^{2}}{2\sigma ^{2}}}+\log \sigma }inverse Gaussian distributionμ, λ{\displaystyle \mu ,\ \lambda }[−λ2μ2−λ2]{\displaystyle {\begin{bmatrix}-{\dfrac {\lambda }{2\mu ^{2}}}\\[15pt]-{\dfrac {\lambda }{2}}\end{bmatrix}}}[η2η1−2η2]{\displaystyle {\begin{bmatrix}{\sqrt {\dfrac {\eta _{2}}{\eta _{1}}}}\\[15pt]-2\eta _{2}\end{bmatrix}}}12πx32{\displaystyle {\frac {1}{{\sqrt {2\pi }}x^{\frac {3}{2}}}}}[x1x]{\displaystyle {\begin{bmatrix}x\\[5pt]{\dfrac {1}{x}}\end{bmatrix}}}−2η1η2−12log⁡(−2η2){\displaystyle -2{\sqrt {\eta _{1}\eta _{2}}}-{\frac {1}{2}}\log(-2\eta _{2})}−λμ−12log⁡λ{\displaystyle -{\frac {\lambda }{\mu }}-{\frac {1}{2}}\log \lambda }gamma distributionα, β{\displaystyle \alpha ,\ \beta }[α−1−β]{\displaystyle {\begin{bmatrix}\alpha -1\\-\beta \end{bmatrix}}}[η1+1−η2]{\displaystyle {\begin{bmatrix}\eta _{1}+1\\-\eta _{2}\end{bmatrix}}}1{\displaystyle 1}[log⁡xx]{\displaystyle {\begin{bmatrix}\log x\\x\end{bmatrix}}}log⁡Γ(η1+1)−(η1+1)log⁡(−η2){\displaystyle \log \Gamma (\eta _{1}+1)-(\eta _{1}+1)\log(-\eta _{2})}log⁡Γ(α)−αlog⁡β{\displaystyle \log \Gamma (\alpha )-\alpha \log \beta }k, θ{\displaystyle k,\ \theta }[k−1−1θ]{\displaystyle {\begin{bmatrix}k-1\\[5pt]-{\dfrac {1}{\theta }}\end{bmatrix}}}[η1+1−1η2]{\displaystyle {\begin{bmatrix}\eta _{1}+1\\[5pt]-{\dfrac {1}{\eta _{2}}}\end{bmatrix}}}log⁡Γ(k)+klog⁡θ{\displaystyle \log \Gamma (k)+k\log \theta }inverse gamma distributionα, β{\displaystyle \alpha ,\ \beta }[−α−1−β]{\displaystyle {\begin{bmatrix}-\alpha -1\\-\beta \end{bmatrix}}}[−η1−1−η2]{\displaystyle {\begin{bmatrix}-\eta _{1}-1\\-\eta _{2}\end{bmatrix}}}1{\displaystyle 1}[log⁡x1x]{\displaystyle {\begin{bmatrix}\log x\\{\frac {1}{x}}\end{bmatrix}}}log⁡Γ(−η1−1)−(−η1−1)log⁡(−η2){\displaystyle \log \Gamma (-\eta _{1}-1)-(-\eta _{1}-1)\log(-\eta _{2})}log⁡Γ(α)−αlog⁡β{\displaystyle \log \Gamma (\alpha )-\alpha \log \beta }generalized inverse Gaussian distributionp, a, b{\displaystyle p,\ a,\ b}[p−1−a/2−b/2]{\displaystyle {\begin{bmatrix}p-1\\-a/2\\-b/2\end{bmatrix}}}[η1+1−2η2−2η3]{\displaystyle {\begin{bmatrix}\eta _{1}+1\\-2\eta _{2}\\-2\eta _{3}\end{bmatrix}}}1{\displaystyle 1}[log⁡xx1x]{\displaystyle {\begin{bmatrix}\log x\\x\\{\frac {1}{x}}\end{bmatrix}}}log⁡2Kη1+1(4η2η3)−η1+12log⁡η2η3{\displaystyle \log 2K_{\eta _{1}+1}({\sqrt {4\eta _{2}\eta _{3}}})-{\frac {\eta _{1}+1}{2}}\log {\frac {\eta _{2}}{\eta _{3}}}}log⁡2Kp(ab)−p2log⁡ab{\displaystyle \log 2K_{p}({\sqrt {ab}})-{\frac {p}{2}}\log {\frac {a}{b}}}scaled inverse chi-squared distributionν, σ2{\displaystyle \nu ,\ \sigma ^{2}}[−ν2−1−νσ22]{\displaystyle {\begin{bmatrix}-{\dfrac {\nu }{2}}-1\\[10pt]-{\dfrac {\nu \sigma ^{2}}{2}}\end{bmatrix}}}[−2(η1+1)η2η1+1]{\displaystyle {\begin{bmatrix}-2(\eta _{1}+1)\\[10pt]{\dfrac {\eta _{2}}{\eta _{1}+1}}\end{bmatrix}}}1{\displaystyle 1}[log⁡x1x]{\displaystyle {\begin{bmatrix}\log x\\{\frac {1}{x}}\end{bmatrix}}}log⁡Γ(−η1−1)−(−η1−1)log⁡(−η2){\displaystyle \log \Gamma (-\eta _{1}-1)-(-\eta _{1}-1)\log(-\eta _{2})}log⁡Γ(ν2)−ν2log⁡νσ22{\displaystyle \log \Gamma \left({\frac {\nu }{2}}\right)-{\frac {\nu }{2}}\log {\frac {\nu \sigma ^{2}}{2}}}beta distribution

(variant 1)

α, β{\displaystyle \alpha ,\ \beta }[αβ]{\displaystyle {\begin{bmatrix}\alpha \\\beta \end{bmatrix}}}[η1η2]{\displaystyle {\begin{bmatrix}\eta _{1}\\\eta _{2}\end{bmatrix}}}1x(1−x){\displaystyle {\frac {1}{x(1-x)}}}[log⁡xlog⁡(1−x)]{\displaystyle {\begin{bmatrix}\log x\\\log(1-x)\end{bmatrix}}}log⁡Γ(η1)+log⁡Γ(η2)−log⁡Γ(η1+η2){\displaystyle \log \Gamma (\eta _{1})+\log \Gamma (\eta _{2})-\log \Gamma (\eta _{1}+\eta _{2})}log⁡Γ(α)+log⁡Γ(β)−log⁡Γ(α+β){\displaystyle \log \Gamma (\alpha )+\log \Gamma (\beta )-\log \Gamma (\alpha +\beta )}beta distribution

(variant 2)

α, β{\displaystyle \alpha ,\ \beta }[α−1β−1]{\displaystyle {\begin{bmatrix}\alpha -1\\\beta -1\end{bmatrix}}}[η1+1η2+1]{\displaystyle {\begin{bmatrix}\eta _{1}+1\\\eta _{2}+1\end{bmatrix}}}1{\displaystyle 1}[log⁡xlog⁡(1−x)]{\displaystyle {\begin{bmatrix}\log x\\\log(1-x)\end{bmatrix}}}log⁡Γ(η1+1)+log⁡Γ(η2+1)−log⁡Γ(η1+η2+2){\displaystyle \log \Gamma (\eta _{1}+1)+\log \Gamma (\eta _{2}+1)-\log \Gamma (\eta _{1}+\eta _{2}+2)}log⁡Γ(α)+log⁡Γ(β)−log⁡Γ(α+β){\displaystyle \log \Gamma (\alpha )+\log \Gamma (\beta )-\log \Gamma (\alpha +\beta )}multivariate normal distributionμ, Σ{\displaystyle {\boldsymbol {\mu }},\ {\boldsymbol {\Sigma }}}[Σ−1μ−12Σ−1]{\displaystyle {\begin{bmatrix}{\boldsymbol {\Sigma }}^{-1}{\boldsymbol {\mu }}\\[5pt]-{\frac {1}{2}}{\boldsymbol {\Sigma }}^{-1}\end{bmatrix}}}[−12η2−1η1−12η2−1]{\displaystyle {\begin{bmatrix}-{\frac {1}{2}}{\boldsymbol {\eta }}_{2}^{-1}{\boldsymbol {\eta }}_{1}\\[5pt]-{\frac {1}{2}}{\boldsymbol {\eta }}_{2}^{-1}\end{bmatrix}}}(2π)−k2{\displaystyle (2\pi )^{-{\frac {k}{2}}}}[xxxT]{\displaystyle {\begin{bmatrix}\mathbf {x} \\[5pt]\mathbf {x} \mathbf {x} ^{\mathsf {T}}\end{bmatrix}}}−14η1Tη2−1η1−12log⁡|−2η2|{\displaystyle -{\frac {1}{4}}{\boldsymbol {\eta }}_{1}^{\mathsf {T}}{\boldsymbol {\eta }}_{2}^{-1}{\boldsymbol {\eta }}_{1}-{\frac {1}{2}}\log \left|-2{\boldsymbol {\eta }}_{2}\right|}12μTΣ−1μ+12log⁡|Σ|{\displaystyle {\frac {1}{2}}{\boldsymbol {\mu }}^{\mathsf {T}}{\boldsymbol {\Sigma }}^{-1}{\boldsymbol {\mu }}+{\frac {1}{2}}\log |{\boldsymbol {\Sigma }}|}categorical distribution

(variant 1)

p1, …,pk{\displaystyle p_{1},\ \ldots ,\,p_{k}}

where ∑i=1kpi=1{\displaystyle \textstyle \sum _{i=1}^{k}p_{i}=1}

[log⁡p1⋮log⁡pk]{\displaystyle {\begin{bmatrix}\log p_{1}\\\vdots \\\log p_{k}\end{bmatrix}}}[eη1⋮eηk]{\displaystyle {\begin{bmatrix}e^{\eta _{1}}\\\vdots \\e^{\eta _{k}}\end{bmatrix}}}

where ∑i=1keηi=1{\displaystyle \textstyle \sum _{i=1}^{k}e^{\eta _{i}}=1}

1{\displaystyle 1}[[x=1]⋮[x=k]]{\displaystyle {\begin{bmatrix}[x=1]\\\vdots \\{[x=k]}\end{bmatrix}}}0{\displaystyle 0}0{\displaystyle 0}categorical distribution

(variant 2)

p1, …,pk{\displaystyle p_{1},\ \ldots ,\,p_{k}}

where ∑i=1kpi=1{\displaystyle \textstyle \sum _{i=1}^{k}p_{i}=1}

[log⁡p1+C⋮log⁡pk+C]{\displaystyle {\begin{bmatrix}\log p_{1}+C\\\vdots \\\log p_{k}+C\end{bmatrix}}}[1Ceη1⋮1Ceηk]={\displaystyle {\begin{bmatrix}{\dfrac {1}{C}}e^{\eta _{1}}\\\vdots \\{\dfrac {1}{C}}e^{\eta _{k}}\end{bmatrix}}=}

[eη1∑i=1keηi⋮eηk∑i=1keηi]{\displaystyle {\begin{bmatrix}{\dfrac {e^{\eta _{1}}}{\sum _{i=1}^{k}e^{\eta _{i}}}}\\[10pt]\vdots \\[5pt]{\dfrac {e^{\eta _{k}}}{\sum _{i=1}^{k}e^{\eta _{i}}}}\end{bmatrix}}}

where ∑i=1keηi=C{\displaystyle \textstyle \sum _{i=1}^{k}e^{\eta _{i}}=C}

1{\displaystyle 1}[[x=1]⋮[x=k]]{\displaystyle {\begin{bmatrix}[x=1]\\\vdots \\{[x=k]}\end{bmatrix}}}0{\displaystyle 0}0{\displaystyle 0}categorical distribution

(variant 3)

p1, …,pk{\displaystyle p_{1},\ \ldots ,\,p_{k}}

where pk=1−∑i=1k−1pi{\displaystyle p_{k}=1-\textstyle \sum _{i=1}^{k-1}p_{i}}

[log⁡p1pk⋮log⁡pk−1pk0]={\displaystyle {\begin{bmatrix}\log {\dfrac {p_{1}}{p_{k}}}\\[10pt]\vdots \\[5pt]\log {\dfrac {p_{k-1}}{p_{k}}}\\[15pt]0\end{bmatrix}}=}

[log⁡p11−∑i=1k−1pi⋮log⁡pk−11−∑i=1k−1pi0]{\displaystyle {\begin{bmatrix}\log {\dfrac {p_{1}}{1-\sum _{i=1}^{k-1}p_{i}}}\\[10pt]\vdots \\[5pt]\log {\dfrac {p_{k-1}}{1-\sum _{i=1}^{k-1}p_{i}}}\\[15pt]0\end{bmatrix}}}

[eη1∑i=1keηi⋮eηk∑i=1keηi]={\displaystyle {\begin{bmatrix}{\dfrac {e^{\eta _{1}}}{\sum _{i=1}^{k}e^{\eta _{i}}}}\\[10pt]\vdots \\[5pt]{\dfrac {e^{\eta _{k}}}{\sum _{i=1}^{k}e^{\eta _{i}}}}\end{bmatrix}}=}

[eη11+∑i=1k−1eηi⋮eηk−11+∑i=1k−1eηi11+∑i=1k−1eηi]{\displaystyle {\begin{bmatrix}{\dfrac {e^{\eta _{1}}}{1+\sum _{i=1}^{k-1}e^{\eta _{i}}}}\\[10pt]\vdots \\[5pt]{\dfrac {e^{\eta _{k-1}}}{1+\sum _{i=1}^{k-1}e^{\eta _{i}}}}\\[15pt]{\dfrac {1}{1+\sum _{i=1}^{k-1}e^{\eta _{i}}}}\end{bmatrix}}}

1{\displaystyle 1}[[x=1]⋮[x=k]]{\displaystyle {\begin{bmatrix}[x=1]\\\vdots \\{[x=k]}\end{bmatrix}}}log⁡(∑i=1keηi)=log⁡(1+∑i=1k−1eηi){\displaystyle \log \left(\sum _{i=1}^{k}e^{\eta _{i}}\right)=\log \left(1+\sum _{i=1}^{k-1}e^{\eta _{i}}\right)}−log⁡pk=−log⁡(1−∑i=1k−1pi){\displaystyle -\log p_{k}=-\log \left(1-\sum _{i=1}^{k-1}p_{i}\right)}multinomial distribution

(variant 1)
with known number of trials n{\displaystyle n}

p1, …,pk{\displaystyle p_{1},\ \ldots ,\,p_{k}}

where ∑i=1kpi=1{\displaystyle \textstyle \sum _{i=1}^{k}p_{i}=1}

[log⁡p1⋮log⁡pk]{\displaystyle {\begin{bmatrix}\log p_{1}\\\vdots \\\log p_{k}\end{bmatrix}}}[eη1⋮eηk]{\displaystyle {\begin{bmatrix}e^{\eta _{1}}\\\vdots \\e^{\eta _{k}}\end{bmatrix}}}

where ∑i=1keηi=1{\displaystyle \textstyle \sum _{i=1}^{k}e^{\eta _{i}}=1}

n!∏i=1kxi!{\displaystyle {\frac {n!}{\prod _{i=1}^{k}x_{i}!}}}[x1⋮xk]{\displaystyle {\begin{bmatrix}x_{1}\\\vdots \\x_{k}\end{bmatrix}}}0{\displaystyle 0}0{\displaystyle 0}multinomial distribution

(variant 2)
with known number of trials n{\displaystyle n}

p1, …,pk{\displaystyle p_{1},\ \ldots ,\,p_{k}}

where ∑i=1kpi=1{\displaystyle \textstyle \sum _{i=1}^{k}p_{i}=1}

where ∑i=1keηi=C{\displaystyle \textstyle \sum _{i=1}^{k}e^{\eta _{i}}=C}

(variant 3)
with known number of trials n{\displaystyle n}

p1, …,pk{\displaystyle p_{1},\ \ldots ,\,p_{k}}

where pk=1−∑i=1k−1pi{\displaystyle p_{k}=1-\textstyle \sum _{i=1}^{k-1}p_{i}}

[log⁡p1pk⋮log⁡pk−1pk0]={\displaystyle {\begin{bmatrix}\log {\dfrac {p_{1}}{p_{k}}}\\[10pt]\vdots \\[5pt]\log {\dfrac {p_{k-1}}{p_{k}}}\\[15pt]0\end{bmatrix}}=}

n!∏i=1kxi!{\displaystyle {\frac {n!}{\prod _{i=1}^{k}x_{i}!}}}[x1⋮xk]{\displaystyle {\begin{bmatrix}x_{1}\\\vdots \\x_{k}\end{bmatrix}}}nlog⁡(∑i=1keηi)=nlog⁡(1+∑i=1k−1eηi){\displaystyle n\log \left(\sum _{i=1}^{k}e^{\eta _{i}}\right)=n\log \left(1+\sum _{i=1}^{k-1}e^{\eta _{i}}\right)}−nlog⁡pk=−nlog⁡(1−∑i=1k−1pi){\displaystyle -n\log p_{k}=-n\log \left(1-\sum _{i=1}^{k-1}p_{i}\right)}Dirichlet distribution

(variant 1)

α1, …,αk{\displaystyle \alpha _{1},\ \ldots ,\,\alpha _{k}}[α1⋮αk]{\displaystyle {\begin{bmatrix}\alpha _{1}\\\vdots \\\alpha _{k}\end{bmatrix}}}[η1⋮ηk]{\displaystyle {\begin{bmatrix}\eta _{1}\\\vdots \\\eta _{k}\end{bmatrix}}}1∏i=1kxi{\displaystyle {\frac {1}{\prod _{i=1}^{k}x_{i}}}}[log⁡x1⋮log⁡xk]{\displaystyle {\begin{bmatrix}\log x_{1}\\\vdots \\\log x_{k}\end{bmatrix}}}∑i=1klog⁡Γ(ηi)−log⁡Γ(∑i=1kηi){\displaystyle \sum _{i=1}^{k}\log \Gamma (\eta _{i})-\log \Gamma \left(\sum _{i=1}^{k}\eta _{i}\right)}∑i=1klog⁡Γ(αi)−log⁡Γ(∑i=1kαi){\displaystyle \sum _{i=1}^{k}\log \Gamma (\alpha _{i})-\log \Gamma \left(\sum _{i=1}^{k}\alpha _{i}\right)}Dirichlet distribution

(variant 2)

α1, …,αk{\displaystyle \alpha _{1},\ \ldots ,\,\alpha _{k}}[α1−1⋮αk−1]{\displaystyle {\begin{bmatrix}\alpha _{1}-1\\\vdots \\\alpha _{k}-1\end{bmatrix}}}[η1+1⋮ηk+1]{\displaystyle {\begin{bmatrix}\eta _{1}+1\\\vdots \\\eta _{k}+1\end{bmatrix}}}1{\displaystyle 1}[log⁡x1⋮log⁡xk]{\displaystyle {\begin{bmatrix}\log x_{1}\\\vdots \\\log x_{k}\end{bmatrix}}}∑i=1klog⁡Γ(ηi+1)−log⁡Γ(∑i=1k(ηi+1)){\displaystyle \sum _{i=1}^{k}\log \Gamma (\eta _{i}+1)-\log \Gamma \left(\sum _{i=1}^{k}(\eta _{i}+1)\right)}∑i=1klog⁡Γ(αi)−log⁡Γ(∑i=1kαi){\displaystyle \sum _{i=1}^{k}\log \Gamma (\alpha _{i})-\log \Gamma \left(\sum _{i=1}^{k}\alpha _{i}\right)}Wishart distributionV, n{\displaystyle \mathbf {V} ,\ n}[−12V−1n−p−12]{\displaystyle {\begin{bmatrix}-{\frac {1}{2}}\mathbf {V} ^{-1}\\[5pt]{\dfrac {n-p-1}{2}}\end{bmatrix}}}[−12η1−12η2+p+1]{\displaystyle {\begin{bmatrix}-{\frac {1}{2}}{{\boldsymbol {\eta }}_{1}}^{-1}\\[5pt]2\eta _{2}+p+1\end{bmatrix}}}1{\displaystyle 1}[Xlog⁡|X|]{\displaystyle {\begin{bmatrix}\mathbf {X} \\\log |\mathbf {X} |\end{bmatrix}}}−(η2+p+12)log⁡|−η1|{\displaystyle -\left(\eta _{2}+{\frac {p+1}{2}}\right)\log |-{\boldsymbol {\eta }}_{1}|}

+log⁡Γp(η2+p+12)={\displaystyle +\log \Gamma _{p}\left(\eta _{2}+{\frac {p+1}{2}}\right)=}

−n2log⁡|−η1|+log⁡Γp(n2)={\displaystyle -{\frac {n}{2}}\log |-{\boldsymbol {\eta }}_{1}|+\log \Gamma _{p}\left({\frac {n}{2}}\right)=}

(η2+p+12)(plog⁡2+log⁡|V|){\displaystyle \left(\eta _{2}+{\frac {p+1}{2}}\right)(p\log 2+\log |\mathbf {V} |)}

+log⁡Γp(η2+p+12){\displaystyle +\log \Gamma _{p}\left(\eta _{2}+{\frac {p+1}{2}}\right)}

Three variants with different parameterizations are given, to facilitate computing moments of the sufficient statistics.

n2(plog⁡2+log⁡|V|)+log⁡Γp(n2){\displaystyle {\frac {n}{2}}(p\log 2+\log |\mathbf {V} |)+\log \Gamma _{p}\left({\frac {n}{2}}\right)}Note: Uses the fact that tr(ATB)=vec⁡(A)⋅vec⁡(B),{\displaystyle {\rm {tr}}(\mathbf {A} ^{\mathsf {T}}\mathbf {B} )=\operatorname {vec} (\mathbf {A} )\cdot \operatorname {vec} (\mathbf {B} ),} i.e. the trace of a matrix product is much like a dot product. The matrix parameters are assumed to be vectorized (laid out in a vector) when inserted into the exponential form. Also, V{\displaystyle \mathbf {V} } and X{\displaystyle \mathbf {X} } are symmetric, so e.g. VT=V .{\displaystyle \mathbf {V} ^{\mathsf {T}}=\mathbf {V} \ .}inverse Wishart distributionΨ,m{\displaystyle \mathbf {\Psi } ,\,m}[−12Ψ−m+p+12]{\displaystyle {\begin{bmatrix}-{\frac {1}{2}}{\boldsymbol {\Psi }}\\[5pt]-{\dfrac {m+p+1}{2}}\end{bmatrix}}}[−2η1−(2η2+p+1)]{\displaystyle {\begin{bmatrix}-2{\boldsymbol {\eta }}_{1}\\[5pt]-(2\eta _{2}+p+1)\end{bmatrix}}}1{\displaystyle 1}[X−1log⁡|X|]{\displaystyle {\begin{bmatrix}\mathbf {X} ^{-1}\\\log |\mathbf {X} |\end{bmatrix}}}(η2+p+12)log⁡|−η1|{\displaystyle \left(\eta _{2}+{\frac {p+1}{2}}\right)\log |-{\boldsymbol {\eta }}_{1}|}

+log⁡Γp(−(η2+p+12))={\displaystyle +\log \Gamma _{p}\left(-{\Big (}\eta _{2}+{\frac {p+1}{2}}{\Big )}\right)=}

−m2log⁡|−η1|+log⁡Γp(m2)={\displaystyle -{\frac {m}{2}}\log |-{\boldsymbol {\eta }}_{1}|+\log \Gamma _{p}\left({\frac {m}{2}}\right)=}

−(η2+p+12)(plog⁡2−log⁡|Ψ|){\displaystyle -\left(\eta _{2}+{\frac {p+1}{2}}\right)(p\log 2-\log |{\boldsymbol {\Psi }}|)}

+log⁡Γp(−(η2+p+12)){\displaystyle +\log \Gamma _{p}\left(-{\Big (}\eta _{2}+{\frac {p+1}{2}}{\Big )}\right)}

m2(plog⁡2−log⁡|Ψ|)+log⁡Γp(m2){\displaystyle {\frac {m}{2}}(p\log 2-\log |{\boldsymbol {\Psi }}|)+\log \Gamma _{p}\left({\frac {m}{2}}\right)}normal-gamma distributionα, β, μ, λ{\displaystyle \alpha ,\ \beta ,\ \mu ,\ \lambda }[α−12−β−λμ22λμ−λ2]{\displaystyle {\begin{bmatrix}\alpha -{\frac {1}{2}}\\-\beta -{\dfrac {\lambda \mu ^{2}}{2}}\\\lambda \mu \\-{\dfrac {\lambda }{2}}\end{bmatrix}}}[η1+12−η2+η324η4−η32η4−2η4]{\displaystyle {\begin{bmatrix}\eta _{1}+{\frac {1}{2}}\\-\eta _{2}+{\dfrac {\eta _{3}^{2}}{4\eta _{4}}}\\-{\dfrac {\eta _{3}}{2\eta _{4}}}\\-2\eta _{4}\end{bmatrix}}}12π{\displaystyle {\dfrac {1}{\sqrt {2\pi }}}}[log⁡τττxτx2]{\displaystyle {\begin{bmatrix}\log \tau \\\tau \\\tau x\\\tau x^{2}\end{bmatrix}}}log⁡Γ(η1+12)−12log⁡(−2η4){\displaystyle \log \Gamma \left(\eta _{1}+{\frac {1}{2}}\right)-{\frac {1}{2}}\log \left(-2\eta _{4}\right)}

−(η1+12)log⁡(−η2+η324η4){\displaystyle -\left(\eta _{1}+{\frac {1}{2}}\right)\log \left(-\eta _{2}+{\dfrac {\eta _{3}^{2}}{4\eta _{4}}}\right)}

log⁡Γ(α)−αlog⁡β−12log⁡λ{\displaystyle \log \Gamma \left(\alpha \right)-\alpha \log \beta -{\frac {1}{2}}\log \lambda }* The Iverson bracket is a generalization of the discrete delta-function: If the bracketed expression is true, the bracket has value 1; if the enclosed statement is false, the Iverson bracket is zero. There are many variant notations, e.g. wavey brackets: ⧙a=b⧘ is equivalent to the [a=b] notation used above.

The three variants of the categorical distribution and multinomial distribution are due to the fact that the parameters pi{\displaystyle p_{i}}

are constrained, such that

∑i=1kpi=1 .{\displaystyle \sum _{i=1}^{k}p_{i}=1~.}

Thus, there are only k−1{\displaystyle k-1}

independent parameters.

Variant 1 uses k{\displaystyle k}
natural parameters with a simple relation between the standard and natural parameters; however, only k−1{\displaystyle k-1} of the natural parameters are independent, and the set of k{\displaystyle k} natural parameters is nonidentifiable. The constraint on the usual parameters translates to a similar constraint on the natural parameters.
Variant 2 demonstrates the fact that the entire set of natural parameters is nonidentifiable: Adding any constant value to the natural parameters has no effect on the resulting distribution. However, by using the constraint on the natural parameters, the formula for the normal parameters in terms of the natural parameters can be written in a way that is independent on the constant that is added.
Variant 3 shows how to make the parameters identifiable in a convenient way by setting C=−log⁡pk .{\displaystyle C=-\log p_{k}\ .}
This effectively "pivots" around pk{\displaystyle p_{k}}
and causes the last natural parameter to have the constant value of 0. All the remaining formulas are written in a way that does not access pk {\displaystyle p_{k}\ }
, so that effectively the model has only k−1{\displaystyle k-1} parameters, both of the usual and natural kind.

Variants 1 and 2 are not actually standard exponential families at all. Rather they are curved exponential families, i.e. there are k−1{\displaystyle k-1} independent parameters embedded in a k{\displaystyle k}-dimensional parameter space.[13] Many of the standard results for exponential families do not apply to curved exponential families. An example is the log-partition function A(x) {\displaystyle A(x)\ }

, which has the value of 0 in the curved cases. In standard exponential families, the derivatives of this function correspond to the moments (more technically, the cumulants) of the sufficient statistics, e.g. the mean and variance. However, a value of 0 suggests that the mean and variance of all the sufficient statistics are uniformly 0, whereas in fact the mean of the i{\displaystyle i}

th sufficient statistic should be pi {\displaystyle p_{i}\ }

. (This does emerge correctly when using the form of A(x) {\displaystyle A(x)\ } shown in variant 3.)

Moments and cumulants of the sufficient statistic[edit]

Normalization of the distribution[edit]

We start with the normalization of the probability distribution. In general, any non-negative function f(x) that serves as the kernel of a probability distribution (the part encoding all dependence on x) can be made into a proper distribution by normalizing: i.e.

p(x)=1Zf(x){\displaystyle p(x)={\frac {1}{Z}}f(x)}

where

Z=∫xf(x)dx.{\displaystyle Z=\int _{x}f(x)\,dx.}

The factor Z is sometimes termed the normalizer or partition function, based on an analogy to statistical physics.

In the case of an exponential family where

p(x;η)=g(η)h(x)eη⋅T(x),{\displaystyle p(x;{\boldsymbol {\eta }})=g({\boldsymbol {\eta }})h(x)e^{{\boldsymbol {\eta }}\cdot \mathbf {T} (x)},}

the kernel is

K(x)=h(x)eη⋅T(x){\displaystyle K(x)=h(x)e^{{\boldsymbol {\eta }}\cdot \mathbf {T} (x)}}

and the partition function is

Z=∫xh(x)eη⋅T(x)dx.{\displaystyle Z=\int _{x}h(x)e^{{\boldsymbol {\eta }}\cdot \mathbf {T} (x)}\,dx.}

Since the distribution must be normalized, we have

1=∫xg(η)h(x)eη⋅T(x)dx=g(η)∫xh(x)eη⋅T(x)dx=g(η)Z.{\displaystyle 1=\int _{x}g({\boldsymbol {\eta }})h(x)e^{{\boldsymbol {\eta }}\cdot \mathbf {T} (x)}\,dx=g({\boldsymbol {\eta }})\int _{x}h(x)e^{{\boldsymbol {\eta }}\cdot \mathbf {T} (x)}\,dx=g({\boldsymbol {\eta }})Z.}

In other words,

g(η)=1Z{\displaystyle g({\boldsymbol {\eta }})={\frac {1}{Z}}}

or equivalently

A(η)=−log⁡g(η)=log⁡Z.{\displaystyle A({\boldsymbol {\eta }})=-\log g({\boldsymbol {\eta }})=\log Z.}

This justifies calling A the log-normalizer or log-partition function.

Moment-generating function of the sufficient statistic[edit]

Now, the moment-generating function of T(x) is

MT(u)≡E[eu⊤T(x)∣η]=∫xh(x)e(η+u)⊤T(x)−A(η)dx=eA(η+u)−A(η){\displaystyle M_{T}(u)\equiv E[e^{u^{\top }T(x)}\mid \eta ]=\int _{x}h(x)e^{(\eta +u)^{\top }T(x)-A(\eta )}\,dx=e^{A(\eta +u)-A(\eta )}}

proving the earlier statement that

K(u∣η)=A(η+u)−A(η){\displaystyle K(u\mid \eta )=A(\eta +u)-A(\eta )}

is the cumulant generating function for T.

An important subclass of exponential families are the natural exponential families, which have a similar form for the moment-generating function for the distribution of x.

Differential identities for cumulants[edit]

In particular, using the properties of the cumulant generating function,

E⁡(Tj)=∂A(η)∂ηj{\displaystyle \operatorname {E} (T_{j})={\frac {\partial A(\eta )}{\partial \eta _{j}}}}

and

cov⁡(Ti, Tj)=∂2A(η)∂ηi∂ηj.{\displaystyle \operatorname {cov} \left(T_{i},\ T_{j}\right)={\frac {\partial ^{2}A(\eta )}{\partial \eta _{i}\,\partial \eta _{j}}}.}

The first two raw moments and all mixed second moments can be recovered from these two identities. Higher-order moments and cumulants are obtained by higher derivatives. This technique is often useful when T is a complicated function of the data, whose moments are difficult to calculate by integration.

Another way to see this that does not rely on the theory of cumulants is to begin from the fact that the distribution of an exponential family must be normalized, and differentiate. We illustrate using the simple case of a one-dimensional parameter, but an analogous derivation holds more generally.

In the one-dimensional case, we have

p(x)=g(η)h(x)eηT(x).{\displaystyle p(x)=g(\eta )h(x)e^{\eta T(x)}.}

This must be normalized, so

1=∫xp(x)dx=∫xg(η)h(x)eηT(x)dx=g(η)∫xh(x)eηT(x)dx.{\displaystyle 1=\int _{x}p(x)\,dx=\int _{x}g(\eta )h(x)e^{\eta T(x)}\,dx=g(\eta )\int _{x}h(x)e^{\eta T(x)}\,dx.}

Take the derivative of both sides with respect to η:

0=g(η)ddη∫xh(x)eηT(x)dx+g′(η)∫xh(x)eηT(x)dx=g(η)∫xh(x)(ddηeηT(x))dx+g′(η)∫xh(x)eηT(x)dx=g(η)∫xh(x)eηT(x)T(x)dx+g′(η)∫xh(x)eηT(x)dx=∫xT(x)g(η)h(x)eηT(x)dx+g′(η)g(η)∫xg(η)h(x)eηT(x)dx=∫xT(x)p(x)dx+g′(η)g(η)∫xp(x)dx=E⁡[T(x)]+g′(η)g(η)=E⁡[T(x)]+ddηlog⁡g(η){\displaystyle {\begin{aligned}0&=g(\eta ){\frac {d}{d\eta }}\int _{x}h(x)e^{\eta T(x)}\,dx+g'(\eta )\int _{x}h(x)e^{\eta T(x)}\,dx\\&=g(\eta )\int _{x}h(x)\left({\frac {d}{d\eta }}e^{\eta T(x)}\right)\,dx+g'(\eta )\int _{x}h(x)e^{\eta T(x)}\,dx\\&=g(\eta )\int _{x}h(x)e^{\eta T(x)}T(x)\,dx+g'(\eta )\int _{x}h(x)e^{\eta T(x)}\,dx\\&=\int _{x}T(x)g(\eta )h(x)e^{\eta T(x)}\,dx+{\frac {g'(\eta )}{g(\eta )}}\int _{x}g(\eta )h(x)e^{\eta T(x)}\,dx\\&=\int _{x}T(x)p(x)\,dx+{\frac {g'(\eta )}{g(\eta )}}\int _{x}p(x)\,dx\\&=\operatorname {E} [T(x)]+{\frac {g'(\eta )}{g(\eta )}}\\&=\operatorname {E} [T(x)]+{\frac {d}{d\eta }}\log g(\eta )\end{aligned}}}

Therefore,

E⁡[T(x)]=−ddηlog⁡g(η)=ddηA(η).{\displaystyle \operatorname {E} [T(x)]=-{\frac {d}{d\eta }}\log g(\eta )={\frac {d}{d\eta }}A(\eta ).}

Example 1[edit]

As an introductory example, consider the gamma distribution, whose distribution is defined by

p(x)=βαΓ(α)xα−1e−βx.{\displaystyle p(x)={\frac {\beta ^{\alpha }}{\Gamma (\alpha )}}x^{\alpha -1}e^{-\beta x}.}

Referring to the above table, we can see that the natural parameter is given by

η1=α−1,{\displaystyle \eta _{1}=\alpha -1,}η2=−β,{\displaystyle \eta _{2}=-\beta ,}

the reverse substitutions are

α=η1+1,{\displaystyle \alpha =\eta _{1}+1,}β=−η2,{\displaystyle \beta =-\eta _{2},}

the sufficient statistics are (log⁡x,x),{\displaystyle (\log x,x),}

and the log-partition function is

A(η1,η2)=log⁡Γ(η1+1)−(η1+1)log⁡(−η2).{\displaystyle A(\eta _{1},\eta _{2})=\log \Gamma (\eta _{1}+1)-(\eta _{1}+1)\log(-\eta _{2}).}

We can find the mean of the sufficient statistics as follows. First, for η1:

E⁡[log⁡x]=∂A(η1,η2)∂η1=∂∂η1(log⁡Γ(η1+1)−(η1+1)log⁡(−η2))=ψ(η1+1)−log⁡(−η2)=ψ(α)−log⁡β,{\displaystyle {\begin{aligned}\operatorname {E} [\log x]&={\frac {\partial A(\eta _{1},\eta _{2})}{\partial \eta _{1}}}={\frac {\partial }{\partial \eta _{1}}}\left(\log \Gamma (\eta _{1}+1)-(\eta _{1}+1)\log(-\eta _{2})\right)\\&=\psi (\eta _{1}+1)-\log(-\eta _{2})\\&=\psi (\alpha )-\log \beta ,\end{aligned}}}

Where ψ(x){\displaystyle \psi (x)}

is the digamma function (derivative of log gamma), and we used the reverse substitutions in the last step.

Now, for η2:

E⁡[x]=∂A(η1,η2)∂η2=∂∂η2(log⁡Γ(η1+1)−(η1+1)log⁡(−η2))=−(η1+1)1−η2(−1)=η1+1−η2=αβ,{\displaystyle {\begin{aligned}\operatorname {E} [x]&={\frac {\partial A(\eta _{1},\eta _{2})}{\partial \eta _{2}}}={\frac {\partial }{\partial \eta _{2}}}\left(\log \Gamma (\eta _{1}+1)-(\eta _{1}+1)\log(-\eta _{2})\right)\\&=-(\eta _{1}+1){\frac {1}{-\eta _{2}}}(-1)={\frac {\eta _{1}+1}{-\eta _{2}}}\\&={\frac {\alpha }{\beta }},\end{aligned}}}

again making the reverse substitution in the last step.

To compute the variance of x, we just differentiate again:

Var⁡(x)=∂2A(η1,η2)∂η22=∂∂η2η1+1−η2=η1+1η22=αβ2.{\displaystyle {\begin{aligned}\operatorname {Var} (x)&={\frac {\partial ^{2}A\left(\eta _{1},\eta _{2}\right)}{\partial \eta _{2}^{2}}}={\frac {\partial }{\partial \eta _{2}}}{\frac {\eta _{1}+1}{-\eta _{2}}}\\&={\frac {\eta _{1}+1}{\eta _{2}^{2}}}\\&={\frac {\alpha }{\beta ^{2}}}.\end{aligned}}}

All of these calculations can be done using integration, making use of various properties of the gamma function, but this requires significantly more work.

Example 2[edit]

As another example consider a real valued random variable X with density

pθ(x)=θe−x(1+e−x)θ+1{\displaystyle p_{\theta }(x)={\frac {\theta e^{-x}}{\left(1+e^{-x}\right)^{\theta +1}}}}

indexed by shape parameter θ∈(0,∞){\displaystyle \theta \in (0,\infty )}

(this is called the skew-logistic distribution). The density can be rewritten as

e−x1+e−xexp⁡(−θlog⁡(1+e−x)+log⁡(θ)){\displaystyle {\frac {e^{-x}}{1+e^{-x}}}\exp \left(-\theta \log \left(1+e^{-x}\right)+\log(\theta )\right)}

Notice this is an exponential family with natural parameter

η=−θ,{\displaystyle \eta =-\theta ,}

sufficient statistic

T=log⁡(1+e−x),{\displaystyle T=\log \left(1+e^{-x}\right),}

and log-partition function

A(η)=−log⁡(θ)=−log⁡(−η){\displaystyle A(\eta )=-\log(\theta )=-\log(-\eta )}

So using the first identity,

E⁡(log⁡(1+e−X))=E⁡(T)=∂A(η)∂η=∂∂η[−log⁡(−η)]=1−η=1θ,{\displaystyle \operatorname {E} (\log(1+e^{-X}))=\operatorname {E} (T)={\frac {\partial A(\eta )}{\partial \eta }}={\frac {\partial }{\partial \eta }}[-\log(-\eta )]={\frac {1}{-\eta }}={\frac {1}{\theta }},}

and using the second identity

var⁡(log⁡(1+e−X))=∂2A(η)∂η2=∂∂η[1−η]=1(−η)2=1θ2.{\displaystyle \operatorname {var} (\log \left(1+e^{-X}\right))={\frac {\partial ^{2}A(\eta )}{\partial \eta ^{2}}}={\frac {\partial }{\partial \eta }}\left[{\frac {1}{-\eta }}\right]={\frac {1}{(-\eta )^{2}}}={\frac {1}{\theta ^{2}}}.}

This example illustrates a case where using this method is very simple, but the direct calculation would be nearly impossible.

Example 3[edit]

The final example is one where integration would be extremely difficult. This is the case of the Wishart distribution, which is defined over matrices. Even taking derivatives is a bit tricky, as it involves matrix calculus, but the respective identities are listed in that article.

From the above table, we can see that the natural parameter is given by

η1=−12V−1,{\displaystyle {\boldsymbol {\eta }}_{1}=-{\frac {1}{2}}\mathbf {V} ^{-1},}η2=n−p−12,{\displaystyle \eta _{2}={\frac {n-p-1}{2}},}

the reverse substitutions are

V=−12η1−1,{\displaystyle \mathbf {V} =-{\frac {1}{2}}{{\boldsymbol {\eta }}_{1}}^{-1},}n=2η2+p+1,{\displaystyle n=2\eta _{2}+p+1,}

and the sufficient statistics are (X,log⁡|X|).{\displaystyle (\mathbf {X} ,\log |\mathbf {X} |).}

The log-partition function is written in various forms in the table, to facilitate differentiation and back-substitution. We use the following forms:

A(η1,n)=−n2log⁡|−η1|+log⁡Γp(n2),{\displaystyle A({\boldsymbol {\eta }}_{1},n)=-{\frac {n}{2}}\log |-{\boldsymbol {\eta }}_{1}|+\log \Gamma _{p}\left({\frac {n}{2}}\right),}A(V,η2)=(η2+p+12)(plog⁡2+log⁡|V|)+log⁡Γp(η2+p+12).{\displaystyle A(\mathbf {V} ,\eta _{2})=\left(\eta _{2}+{\frac {p+1}{2}}\right)(p\log 2+\log |\mathbf {V} |)+\log \Gamma _{p}\left(\eta _{2}+{\frac {p+1}{2}}\right).}Expectation of X (associated with η1)

To differentiate with respect to η1, we need the following matrix calculus identity:

∂log⁡|aX|∂X=(X−1)T{\displaystyle {\frac {\partial \log |a\mathbf {X} |}{\partial \mathbf {X} }}=(\mathbf {X} ^{-1})^{\rm {T}}}

Then:

E⁡[X]=∂A(η1,⋯)∂η1=∂∂η1[−n2log⁡|−η1|+log⁡Γp(n2)]=−n2(η1−1)T=n2(−η1−1)T=n(V)T=nV{\displaystyle {\begin{aligned}\operatorname {E} [\mathbf {X} ]&={\frac {\partial A\left({\boldsymbol {\eta }}_{1},\cdots \right)}{\partial {\boldsymbol {\eta }}_{1}}}\\&={\frac {\partial }{\partial {\boldsymbol {\eta }}_{1}}}\left[-{\frac {n}{2}}\log |-{\boldsymbol {\eta }}_{1}|+\log \Gamma _{p}\left({\frac {n}{2}}\right)\right]\\&=-{\frac {n}{2}}({\boldsymbol {\eta }}_{1}^{-1})^{\rm {T}}\\&={\frac {n}{2}}(-{\boldsymbol {\eta }}_{1}^{-1})^{\rm {T}}\\&=n(\mathbf {V} )^{\rm {T}}\\&=n\mathbf {V} \end{aligned}}}

The last line uses the fact that V is symmetric, and therefore it is the same when transposed.

Expectation of log |X| (associated with η2)

Now, for η2, we first need to expand the part of the log-partition function that involves the multivariate gamma function:

log⁡Γp(a)=log⁡(πp(p−1)4∏j=1pΓ(a+1−j2))=p(p−1)4log⁡π+∑j=1plog⁡Γ[a+1−j2]{\displaystyle \log \Gamma _{p}(a)=\log \left(\pi ^{\frac {p(p-1)}{4}}\prod _{j=1}^{p}\Gamma \left(a+{\frac {1-j}{2}}\right)\right)={\frac {p(p-1)}{4}}\log \pi +\sum _{j=1}^{p}\log \Gamma \left[a+{\frac {1-j}{2}}\right]}

We also need the digamma function:

ψ(x)=ddxlog⁡Γ(x).{\displaystyle \psi (x)={\frac {d}{dx}}\log \Gamma (x).}

Then:

E⁡[log⁡|X|]=∂A(…,η2)∂η2=∂∂η2[−(η2+p+12)(plog⁡2+log⁡|V|)+log⁡Γp(η2+p+12)]=∂∂η2[(η2+p+12)(plog⁡2+log⁡|V|)+p(p−1)4log⁡π+∑j=1plog⁡Γ(η2+p+12+1−j2)]=plog⁡2+log⁡|V|+∑j=1pψ(η2+p+12+1−j2)=plog⁡2+log⁡|V|+∑j=1pψ(n−p−12+p+12+1−j2)=plog⁡2+log⁡|V|+∑j=1pψ(n+1−j2){\displaystyle {\begin{aligned}\operatorname {E} [\log |\mathbf {X} |]&={\frac {\partial A\left(\ldots ,\eta _{2}\right)}{\partial \eta _{2}}}\\&={\frac {\partial }{\partial \eta _{2}}}\left[-\left(\eta _{2}+{\frac {p+1}{2}}\right)(p\log 2+\log |\mathbf {V} |)+\log \Gamma _{p}\left(\eta _{2}+{\frac {p+1}{2}}\right)\right]\\&={\frac {\partial }{\partial \eta _{2}}}\left[\left(\eta _{2}+{\frac {p+1}{2}}\right)(p\log 2+\log |\mathbf {V} |)+{\frac {p(p-1)}{4}}\log \pi +\sum _{j=1}^{p}\log \Gamma \left(\eta _{2}+{\frac {p+1}{2}}+{\frac {1-j}{2}}\right)\right]\\&=p\log 2+\log |\mathbf {V} |+\sum _{j=1}^{p}\psi \left(\eta _{2}+{\frac {p+1}{2}}+{\frac {1-j}{2}}\right)\\&=p\log 2+\log |\mathbf {V} |+\sum _{j=1}^{p}\psi \left({\frac {n-p-1}{2}}+{\frac {p+1}{2}}+{\frac {1-j}{2}}\right)\\&=p\log 2+\log |\mathbf {V} |+\sum _{j=1}^{p}\psi \left({\frac {n+1-j}{2}}\right)\end{aligned}}}

This latter formula is listed in the Wishart distribution article. Both of these expectations are needed when deriving the variational Bayes update equations in a Bayes network involving a Wishart distribution (which is the conjugate prior of the multivariate normal distribution).

Computing these formulas using integration would be much more difficult. The first one, for example, would require matrix integration.

Entropy[edit]

Relative entropy[edit]

The relative entropy (Kullback–Leibler divergence, KL divergence) of two distributions in an exponential family has a simple expression as the Bregman divergence between the natural parameters with respect to the log-normalizer.[14] The relative entropy is defined in terms of an integral, while the Bregman divergence is defined in terms of a derivative and inner product, and thus is easier to calculate and has a closed-form expression (assuming the derivative has a closed-form expression). Further, the Bregman divergence in terms of the natural parameters and the log-normalizer equals the Bregman divergence of the dual parameters (expectation parameters), in the opposite order, for the convex conjugate function.[15]

Fixing an exponential family with log-normalizer A{\displaystyle A}

(with convex conjugate A∗{\displaystyle A^{*}}

), writing PA,θ{\displaystyle P_{A,\theta }}

for the distribution in this family corresponding a fixed value of the natural parameter θ{\displaystyle \theta } (writing θ′{\displaystyle \theta '}

for another value, and with η,η′{\displaystyle \eta ,\eta '}

for the corresponding dual expectation/moment parameters), writing KL for the KL divergence, and BA{\displaystyle B_{A}}

for the Bregman divergence, the divergences are related as:

KL(PA,θ∥PA,θ′)=BA(θ′∥θ)=BA∗(η∥η′).{\displaystyle {\rm {{KL}(P_{A,\theta }\parallel P_{A,\theta '})=B_{A}(\theta '\parallel \theta )=B_{A^{*}}(\eta \parallel \eta ').}}}

The KL divergence is conventionally written with respect to the first parameter, while the Bregman divergence is conventionally written with respect to the second parameter, and thus this can be read as "the relative entropy is equal to the Bregman divergence defined by the log-normalizer on the swapped natural parameters", or equivalently as "equal to the Bregman divergence defined by the dual to the log-normalizer on the expectation parameters".

Maximum-entropy derivation[edit]

Exponential families arise naturally as the answer to the following question: what is the maximum-entropy distribution consistent with given constraints on expected values?

The information entropy of a probability distribution dF(x) can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure dH(x) with the same support as dF(x).

The entropy of dF(x) relative to dH(x) is

S[dF∣dH]=−∫dFdHlog⁡dFdHdH{\displaystyle S[dF\mid dH]=-\int {\frac {dF}{dH}}\log {\frac {dF}{dH}}\,dH}

S[dF∣dH]=∫log⁡dHdFdF{\displaystyle S[dF\mid dH]=\int \log {\frac {dH}{dF}}\,dF}

where dF/dH and dH/dF are Radon–Nikodym derivatives. The ordinary definition of entropy for a discrete distribution supported on a set I, namely

S=−∑i∈Ipilog⁡pi{\displaystyle S=-\sum _{i\in I}p_{i}\log p_{i}}

assumes, though this is seldom pointed out, that dH is chosen to be the counting measure on I.

Consider now a collection of observable quantities (random variables) Ti. The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of Ti be equal to ti, is an exponential family with dH as reference measure and (T1, ..., Tn) as sufficient statistic.

The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting T0 = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to T0.

For examples of such derivations, see Maximum entropy probability distribution.

Role in statistics[edit]

Classical estimation: sufficiency[edit]

According to the Pitman–Koopman–Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases.

Less tersely, suppose Xk, (where k = 1, 2, 3, ... n) are independent, identically distributed random variables. Only if their distribution is one of the exponential family of distributions is there a sufficient statistic T(X1, ..., Xn) whose number of scalar components does not increase as the sample size n increases; the statistic T may be a vector or a single scalar number, but whatever it is, its size will neither grow nor shrink when more data are obtained.

As a counterexample if these conditions are relaxed, the family of uniform distributions (either discrete or continuous, with either or both bounds unknown) has a sufficient statistic, namely the sample maximum, sample minimum, and sample size, but does not form an exponential family, as the domain varies with the parameters.

Bayesian estimation: conjugate distributions[edit]

Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to an exponential family there exists a conjugate prior, which is often also in an exponential family. A conjugate prior π for the parameter η{\displaystyle {\boldsymbol {\eta }}} of an exponential family

f(x∣η)=h(x)exp⁡(ηTT(x)−A(η)){\displaystyle f(x\mid {\boldsymbol {\eta }})=h(x)\exp \left({\boldsymbol {\eta }}^{\rm {T}}\mathbf {T} (x)-A({\boldsymbol {\eta }})\right)}

is given by

pπ(η∣χ,ν)=f(χ,ν)exp⁡(ηTχ−νA(η)),{\displaystyle p_{\pi }({\boldsymbol {\eta }}\mid {\boldsymbol {\chi }},\nu )=f({\boldsymbol {\chi }},\nu )\exp \left({\boldsymbol {\eta }}^{\rm {T}}{\boldsymbol {\chi }}-\nu A({\boldsymbol {\eta }})\right),}

or equivalently

pπ(η∣χ,ν)=f(χ,ν)g(η)νexp⁡(ηTχ),χ∈Rs{\displaystyle p_{\pi }({\boldsymbol {\eta }}\mid {\boldsymbol {\chi }},\nu )=f({\boldsymbol {\chi }},\nu )g({\boldsymbol {\eta }})^{\nu }\exp \left({\boldsymbol {\eta }}^{\rm {T}}{\boldsymbol {\chi }}\right),\qquad {\boldsymbol {\chi }}\in \mathbb {R} ^{s}}

where s is the dimension of η{\displaystyle {\boldsymbol {\eta }}} and ν>0{\displaystyle \nu >0}

and χ{\displaystyle {\boldsymbol {\chi }}}

are hyperparameters (parameters controlling parameters). ν{\displaystyle \nu } corresponds to the effective number of observations that the prior distribution contributes, and χ{\displaystyle {\boldsymbol {\chi }}} corresponds to the total amount that these pseudo-observations contribute to the sufficient statistic over all observations and pseudo-observations. f(χ,ν){\displaystyle f({\boldsymbol {\chi }},\nu )}

is a normalization constant that is automatically determined by the remaining functions and serves to ensure that the given function is a probability density function (i.e. it is normalized). A(η){\displaystyle A({\boldsymbol {\eta }})} and equivalently g(η){\displaystyle g({\boldsymbol {\eta }})} are the same functions as in the definition of the distribution over which π is the conjugate prior.

A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution. It can however be represented by using a mixture density as the prior, here a combination of two beta distributions; this is a form of hyperprior.

An arbitrary likelihood will not belong to an exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.

To show that the above prior distribution is a conjugate prior, we can derive the posterior.

First, assume that the probability of a single observation follows an exponential family, parameterized using its natural parameter:

pF(x∣η)=h(x)g(η)exp⁡(ηTT(x)){\displaystyle p_{F}(x\mid {\boldsymbol {\eta }})=h(x)g({\boldsymbol {\eta }})\exp \left({\boldsymbol {\eta }}^{\rm {T}}\mathbf {T} (x)\right)}

Then, for data X=(x1,…,xn){\displaystyle \mathbf {X} =(x_{1},\ldots ,x_{n})}

, the likelihood is computed as follows:

p(X∣η)=(∏i=1nh(xi))g(η)nexp⁡(ηT∑i=1nT(xi)){\displaystyle p(\mathbf {X} \mid {\boldsymbol {\eta }})=\left(\prod _{i=1}^{n}h(x_{i})\right)g({\boldsymbol {\eta }})^{n}\exp \left({\boldsymbol {\eta }}^{\rm {T}}\sum _{i=1}^{n}\mathbf {T} (x_{i})\right)}

Then, for the above conjugate prior:

pπ(η∣χ,ν)=f(χ,ν)g(η)νexp⁡(ηTχ)∝g(η)νexp⁡(ηTχ){\displaystyle {\begin{aligned}p_{\pi }({\boldsymbol {\eta }}\mid {\boldsymbol {\chi }},\nu )&=f({\boldsymbol {\chi }},\nu )g({\boldsymbol {\eta }})^{\nu }\exp({\boldsymbol {\eta }}^{\rm {T}}{\boldsymbol {\chi }})\propto g({\boldsymbol {\eta }})^{\nu }\exp({\boldsymbol {\eta }}^{\rm {T}}{\boldsymbol {\chi }})\end{aligned}}}

We can then compute the posterior as follows:

p(η∣X,χ,ν)∝p(X∣η)pπ(η∣χ,ν)=(∏i=1nh(xi))g(η)nexp⁡(ηT∑i=1nT(xi))f(χ,ν)g(η)νexp⁡(ηTχ)∝g(η)nexp⁡(ηT∑i=1nT(xi))g(η)νexp⁡(ηTχ)∝g(η)ν+nexp⁡(ηT(χ+∑i=1nT(xi))){\displaystyle {\begin{aligned}p({\boldsymbol {\eta }}\mid \mathbf {X} ,{\boldsymbol {\chi }},\nu )&\propto p(\mathbf {X} \mid {\boldsymbol {\eta }})p_{\pi }({\boldsymbol {\eta }}\mid {\boldsymbol {\chi }},\nu )\\&=\left(\prod _{i=1}^{n}h(x_{i})\right)g({\boldsymbol {\eta }})^{n}\exp \left({\boldsymbol {\eta }}^{\rm {T}}\sum _{i=1}^{n}\mathbf {T} (x_{i})\right)f({\boldsymbol {\chi }},\nu )g({\boldsymbol {\eta }})^{\nu }\exp({\boldsymbol {\eta }}^{\rm {T}}{\boldsymbol {\chi }})\\&\propto g({\boldsymbol {\eta }})^{n}\exp \left({\boldsymbol {\eta }}^{\rm {T}}\sum _{i=1}^{n}\mathbf {T} (x_{i})\right)g({\boldsymbol {\eta }})^{\nu }\exp({\boldsymbol {\eta }}^{\rm {T}}{\boldsymbol {\chi }})\\&\propto g({\boldsymbol {\eta }})^{\nu +n}\exp \left({\boldsymbol {\eta }}^{\rm {T}}\left({\boldsymbol {\chi }}+\sum _{i=1}^{n}\mathbf {T} (x_{i})\right)\right)\end{aligned}}}

The last line is the kernel of the posterior distribution, i.e.

p(η∣X,χ,ν)=pπ(η| χ+∑i=1nT(xi),ν+n){\displaystyle p({\boldsymbol {\eta }}\mid \mathbf {X} ,{\boldsymbol {\chi }},\nu )=p_{\pi }\left({\boldsymbol {\eta }}\left|~{\boldsymbol {\chi }}+\sum _{i=1}^{n}\mathbf {T} (x_{i}),\nu +n\right.\right)}

This shows that the posterior has the same form as the prior.

The data X enters into this equation only in the expression

T(X)=∑i=1nT(xi),{\displaystyle \mathbf {T} (\mathbf {X} )=\sum _{i=1}^{n}\mathbf {T} (x_{i}),}

which is termed the sufficient statistic of the data. That is, the value of the sufficient statistic is sufficient to completely determine the posterior distribution. The actual data points themselves are not needed, and all sets of data points with the same sufficient statistic will have the same distribution. This is important because the dimension of the sufficient statistic does not grow with the data size — it has only as many components as the components of η{\displaystyle {\boldsymbol {\eta }}} (equivalently, the number of parameters of the distribution of a single data point).

The update equations are as follows:

χ′=χ+T(X)=χ+∑i=1nT(xi)ν′=ν+n{\displaystyle {\begin{aligned}{\boldsymbol {\chi }}'&={\boldsymbol {\chi }}+\mathbf {T} (\mathbf {X} )\\&={\boldsymbol {\chi }}+\sum _{i=1}^{n}\mathbf {T} (x_{i})\\\nu '&=\nu +n\end{aligned}}}

This shows that the update equations can be written simply in terms of the number of data points and the sufficient statistic of the data. This can be seen clearly in the various examples of update equations shown in the conjugate prior page. Because of the way that the sufficient statistic is computed, it necessarily involves sums of components of the data (in some cases disguised as products or other forms — a product can be written in terms of a sum of logarithms). The cases where the update equations for particular distributions don't exactly match the above forms are cases where the conjugate prior has been expressed using a different parameterization than the one that produces a conjugate prior of the above form — often specifically because the above form is defined over the natural parameter η{\displaystyle {\boldsymbol {\eta }}} while conjugate priors are usually defined over the actual parameter θ.{\displaystyle {\boldsymbol {\theta }}.}

Hypothesis testing: uniformly most powerful tests[edit]

A one-parameter exponential family has a monotone non-decreasing likelihood ratio in the sufficient statistic T(x), provided that η(θ) is non-decreasing. As a consequence, there exists a uniformly most powerful test for testing the hypothesis H0: θ ≥ θ0 vs. H1: θ < θ0.

Generalized linear models[edit]

Exponential families form the basis for the distribution functions used in generalized linear models (GLM), a class of model that encompasses many of the commonly used regression models in statistics. Examples include logistic regression using the binomial family and Poisson regression.

What is A and B in an exponential function?

Exponential functions are based on relationships involving a constant multiplier. You can write. an exponential function in general form. In this form, a represents an initial value or amount, and b, the constant multiplier, is a growth factor or factor of decay.

How do you find a and b of an exponential function from two points?

If you have two points, (x₁, y₁) and (x₂, y₂), you can define the exponential function that passes through these points by substituting them in the equation y = ab^x and solving for a and b. In general, you have to solve this pair of equations: y₁ = ab^x1 and y₂ = ab^x2^, .

How do you find the a value of an exponential function?

Exponential functions are written in the form f(x)=abx f ( x ) = a b x . Initial Value: The initial value of an exponential function is the result of substituting x=0 into the function. In the exponential function f(x)=abx f ( x ) = a b x , the initial value is a .