 |
SPSS for
Windows: Descriptive and Inferential Statistics
This document is the second module of a four module tutorial series.
This document describes the use of SPSS to
obtain descriptive and inferential statistics. In this module, you will be
introduced to procedures used to obtain several descriptive statistics,
frequency tables, and crosstabulations in the first section. In the second
section, the Chi-square test of independence, independent and paired
sample t tests, bivariate and partial correlations, regression, and
the general linear model will be covered. If you are not familiar with
SPSS or
need more information about how to get SPSS to
read your data, consult the first module of this four part tutorial,
SPSS for
Windows: Getting Started. This set of documents uses a sample dataset,
Employee data.sav, that SPSS
provides. It can be found in the root SPSS
directory. If you installed SPSS in
the default location, then this file will be located in the following
location: C:\Program Files\SPSS\Employee Data.sav.
Some users prefer to use keystrokes to navigate through SPSS.
Information on common keystrokes are available in our SPSS
10 for Windows Keystoke Manual.
Section 4: Summarizing Data
Descriptive Statistics
A common first step in data analysis is to summarize information about
variables in your dataset, such as the averages and variances of
variables. Several summary or descriptive statistics are available under
the Descriptives option available from the Analyze and
Descriptive Statistics menus:
Analyze
Descriptive
Statistics
Descriptives...
After selecting the Descriptives option, the following dialog
box will appear:
D
This dialog box allows you to select the variables for which
descriptive statistics are desired. To select variables, first click on a
variable name in the box on the left side of the dialog box, then click on
the arrow button that will move those variables to the Variable(s)
box. For example, the variables salbegin and salary have
been selected in this manner in the above example. To view the available
descriptive statistics, click on the button labeled Options. This
will produce the following dialog box:
D
Clicking on the boxes next to the statistics' names will result in
these statistics being displayed in the output for this procedure. In the
above example, only the default statistics have been selected (mean,
standard deviation, minimum, and maximum), however, there are several
others that could be selected. After selecting all of the statistics you
desire, output can be generated by first clicking on the Continue
button in the Options dialog box, then clicking on the OK
button in the Descriptives dialog box. The statistics that you
selected will be printed in the Output Viewer. For example, the selections
from the preceding example would produce the following output:
D
This output contains several pieces of information that can be useful
to you in understanding the descriptive qualities of your data. The number
of cases in the dataset is recorded under the column labeled N.
Information about the range of variables is contained in the
Minimum and Maximum columns. For example, beginning salaries
ranged from $9000 to $79,980 whereas current salaries range from $15,750
to $135,000. The average salary is contained in the Mean column.
Variability can be assessed by examining the values in the Std.
column. The standard deviation measures the amount of variability in
the distribution of a variable. Thus, the more that the individual data
points differ from each other, the larger the standard deviation will be.
Conversely, if there is a great deal of similarity between data points,
the standard deviation will be quite small. The standard deviation
describes the standard amount variables differ from the mean. For example,
a starting salary with the value of $24,886.73 is one standard deviation
above the mean in the above example in which the variable, salary
has a mean of $17,016.09 and a standard deviation of $7,870.64. Examining
differences in variability could be useful for anticipating further
analyses: in the above example, it is clear that there is much greater
variability in the current salaries than beginning salaries. Because equal
variances is an assumption of many inferential statistics, this
information is important to a data analyst.
Frequencies
While the descriptive statistics procedure described above is useful
for summarizing data with an underlying continuous distribution, the
Descriptives procedure will not prove helpful for interpreting
categorical data. Instead, it is more useful to investigate the numbers of
cases that fall into various categories. The Frequencies option
allows you to obtain the number of people within each education level in
the dataset. The Frequencies procedure is found under the
Analyze menu:
Analyze
Descriptives
Statistics
Frequencies...
Selecting this menu item produces the following dialog box:
D
Select variables by clicking on them in the left box, then clicking the
arrow in between the two boxes. Frequencies will be obtained for all of
the variables in the box labeled Variable(s). This is the only step
necessary for obtaining frequency tables; however, there are several other
descriptive statistics available, many of which are described in the
preceding section. The example in the above dialog box would produce the
following output:
D
Clicking on the Statistics button produces a dialog box with
several additional descriptive statistics. Clicking on the Charts
button produces the following box which allows you to graphically examine
their data in several different formats:
D
Each of the available options provides a visual display of the data.
For example, clicking on the Histograms button with its suboption,
With normal curve, will provide you with a chart similar to that
shown below. This will allow you to assess whether your data are normally
distributed, which is an assumption of several inferential statistics. You
can also use the Explore procedure, available from the
Descriptives menu, to obtain the Kolmogorov-Smirnov test,
which is a hypothesis test to determine if your data are normally
distributed.
D
Crosstabulation
While frequencies show the numbers of cases in each level of a
categorical variable, they do not give information about the
relationship between categorical variables. For example, frequencies can
give you the number of men and women in a company AND the number of people
in each employment category, but not the number of men and women IN each
employment category. The Crosstabs procedure is useful for
investigating this type of information because it can provide information
about the intersection of two variables. The number of men and women in
each of three employment categories is one example of information that can
be crosstabulated. The Crosstabs procedure is found in the
Analyze menu in the Data Editor window:
Analyze
Descriptive
Statistics
Crosstabs…
After selecting Crosstabs from the menu, the dialog box shown
above will appear on your monitor. The box on the left side of the dialog
box contains a list of all of the variables in the working dataset.
Variables from this list can be selected for rows, columns, or layers in a
crosstabulation. For example, selecting the variable gender for the
rows of the table and jobcat for the columns would produce a
crosstabulation of gender by job category.
D
The options available by selecting the Statistics and Cells
buttons provide you with several additional output features. Selecting
the Cells button will produce a menu that allows you to add
additional values to your table. For example, the dialog box shown below
illustrates an example in which Expected option in the
Counts box and the Row, Column, and Total
options in the Percentages box have been selected.
D
The combination of the two dialog boxes shown above will produce the
following output table:
D
The crosstabulation statistics provide several interesting observations
about the data. In the above table, there appears to be an association
between gender and employment category as the expected values, which are
the values expected by chance, and the actual counts are different from
each other. The following section will discuss how to further examine this
relationship with inferential statistics.
Section 5: Inferential Statistics
Chi-Square Test
The Chi-square test for independence is used in situations where you
have two categorical variables. A categorical variable is a
qualitative variable in which cases are classified in one and only one of
the possible levels. A classic example is gender, in which cases are
classified in one of two possible levels. The example in the above
section, in which Gender and Employment Category are
crosstabulated using the SPSS
Crosstabs procedure, is an example of data with which you could
conduct a Chi-square test of independence testing the null
hypothesis that there is no relationship between the two variables.
For instance, you could conduct a test of the hypothesis that there is
no relationship between Gender and Employment Category. If
this hypothesis were true, you would expect that the proportion of men and
women would be the same within each level of Employment Category.
In other words, there should be little difference between observed
and expected values, where the expected values represent the
numbers that would be in each cell when the variables are independent of
each other. The difference between observed and expected values is the
basis of the Chi-square statistic: it evaluates the likelihood that the
differences between the observed and expected values would occur under the
null hypothesis that there is no difference between these values. The
expected values can be obtained by clicking on the Cells box in the
Crosstabs dialog box, as described in the preceding section.
Examining the table above, it appears that it is indeed the case that
gender and employment category are independent of each other. It appears
that there are more women in clerical positions than would be expected by
chance, whereas there are more men in custodial and managerial positions
than would be expected by chance. Conducting a Chi-square test of
independence would tell us if the observed pattern is statistically
different from the pattern expected due to chance.
The Chi-square test of independence can be obtained through the
Crosstabs dialog boxes that were used above to get a
crosstabulation of the data. After opening the Crosstabs dialog box
as described in the preceding section, click the Statistics button
to get the following dialog box:
D
By clicking on the box labeled Chi-Square, you will obtain the
Chi-square test of independence for the variables you have crosstabulated.
This will produce the following table in the Output Viewer:
D
Inspecting the table in the previous section, it appears that the the
two variables, gender and employment category, are related to each other
in some way. This finding is implicated by the substantial differences in
the observed and expected counts: these differences represent the
difference between values expected if gender and employment classification
were independent of each other (expected counts) and the actual numbers of
cases in each cell (observed counts). For example, if gender and
employment classification were unrelated, then it is expected that 38.3
women would be in the manager classification as opposed to the observed
number, 10. In this example, the expected value of 38.3 represents the
fact that 45.6% of the cases in this dataset are women, so it is expected
that 45.6% of the 84 managers in the dataset would also be women if gender
and employment classification were independent of each other. The output
above provides a statistical hypothesis test for the hypothesis that
gender and employment category are independent of each other. The large
Chi-Square statistic (79.28) and its small significance level (p
< .000) indicates that it is very unlikely that these variables are
independent of each other. Thus, you can conclude that there is a
relationship between a person's gender and their employment
classification.
T tests
The t test is a useful technique for comparing mean values of
two sets of numbers. The comparison will provide you with a statistic for
evaluating whether the difference between two means is statistically
significant. T tests can be used either to compare two independent
groups (independent-samples t test) or to compare observations from
two measurement occasions for the same group (paired-samples t
test). To conduct a t test, your data should be a sample drawn from
a continuous underlying distribution. If you are using the t test
to compare two groups, the groups should be randomly drawn from normally
distributed and independent populations. For example, if you were
comparing clerical and managerial salaries, the independent
populations are clerks and managers, which are two nonoverlapping
groups. If you have more than two groups or more than two variables in a
single group that you want to compare, you should use one of the
General Linear Model procedures in SPSS,
which are described below.
There are three types of t tests; the options are all
located under the Analyze menu item:
Analyze
Compare Means
One-Sample T test...
Independent-Samples T test...
Paired-Samples T test...
While each of these t tests compares mean values of two sets of
numbers, they are designed for distinctly different situations:
- The one-sample t test is used compare a single sample with a
population value. For example, a test could be conducted to compare the
average salary of managers within a company with a value that was known
to represent the national average for managers.
- The independent-sample t test is used to compare two groups'
scores on the same variable. For example, it could be used to compare
the salaries of clerks and managers to evaluate whether there is a
difference in their salaries.
- The paired-sample t test is used to compare the means of two
variables within a single group. For example, it could be used to see if
there is a statistically significant difference between starting
salaries and current salaries among the custodial staff in an
organization.
To conduct an independent sample t test, first select the menu
option shown above, to produce the following dialog box:
D
To select variables for the analysis, first highlight them by clicking
on them in the box on the left. Then move them into the appropriate box on
the right by clicking on the arrow button in the center of the box. Your
independent variable should go in the Grouping Variable box, which
is a variable that defines which groups are being compared. For example,
because employment categories are being compared in this analysis, the
jobcat variable is selected. However, because jobcat has
more than two levels, you will need to click on Define Groups to
specify the two levels of jobcat that you want to compare. This
will produce another dialog box as is shown below:
D
Here, the groups to be compared are limited to the groups with the
values 2 and 3, which represent the clerical and managerial groups. After
selecting the groups to be compared, click the Continue button, and
then click the OK button in the main dialog box. The above choices
will produce the following output:
D
D
The first output table, labeled Group Statistics, displays
descriptive statistics. The second output table, labeled Independent
Samples Test, contains the statistics that are critical to evaluating
the current research question. This table contains two sets of analyses:
the first assumes equal variances and the second does not. To assess
whether you should use the statistics for equal or unequal variances, use
the significance level associated with the value under the heading,
Levene's Test for Equality of Variances. It tests the hypothesis
that the variances of the two groups are equal. A small value in the
column labeled Sig. indicates that this hypothesis is false and
that the groups do indeed have unequal variances. In the above case, the
small value in that column indicates that the variance of the two groups,
clerks and managers, is not equal. Thus, you should use the statistics in
the row labeled Equal variances not assumed.
The SPSS
output reports a t statistic and degrees of freedom for all
t test procedures. Every unique value of the t statistic and
its associated degrees of freedom have a significance value. In the above
example in which the hypothesis that clerks and managers do not differ in
their salaries, the t statistic under the assumption of unequal
variances has a value of -16.3, and the degrees of freedom has a value of
89.6 with an associated significance level of .000. The significance level
tells us that the probability that there is no difference between clerical
and managerial salaries is very small: specifically, less than one time in
a thousand would we obtain a mean difference of $33,038 or larger between
these groups if there were really no differences in their salaries.
To obtain a paired-samples t test, select the menu items
described above and the following dialog box will appear:
D
The above example illustrates a t test between the variables
salbegin and salary which represent employees' beginning
salary and their current salary. To set up a paired-samples t test
as in the above example, click on the two variables that you want to
compare. The variable names will appear in the section of the box labeled
Current Selections. When these variable names appear there, click
the arrow in the middle of the dialog box and they will appear in the
Paired Variables box. Clicking the OK button with the above
variables selected will produce output for the paired-samples t
test. The following output is an example of the statistics you would
obtain from the above example.
D
As with the independent samples t test, there is a t
statistic and degrees of freedom that has a significance level
associated with it. The t test in this example tests the hypothesis
that there is no difference in clerks' beginning and current salaries. The
t statistic, (35.04), and its associated significance level
(p < .000) indicate that this in not the case. In fact, the
observed mean difference of $17,403.48 between beginning and current
salaries would occur fewer than once in a thousand times if there really
were no difference between clerks' beginning and current salaries.
Correlation
Correlation is one of the most common forms of data analysis both
because it can provide an analysis that stands on its own, and also
because it underlies many other analyses, and can can be a good way to
support conclusions after primary analyses have been completed.
Correlations are a measure of the linear relationship between two
variables. A correlation coefficient has a value ranging from -1 to 1.
Values that are closer to the absolute value of 1 indicate that there is a
strong relationship between the variables being correlated whereas values
closer to 0 indicate that there is little or no linear relationship. The
sign of a correlation coefficient describes the type of relationship
between the variables being correlated. A positive correlation coefficient
indicates that there is a positive linear relationship between the
variables: as one variable increases in value, so does the other. An
example of two variables that are likely to be positively correlated are
the number of days a student attended class and test grades because, as
the number of classes attended increases in value, so do test grades. A
negative value indicates a negative linear relationship between variables:
as one variable increases in value, the other variable decreases in value.
The number of days students miss class and their test scores are likely to
be negatively correlated because as the number of days of missed classed
increases, test scores typically decrease.
To obtain a correlation in SPSS,
start at the Analyze menu. Select the Correlate option from
this menu. By selecting this menu item, you will see that there are three
options for correlating variables: (1) Bivariate, (2)
Partial, and (3) Distances. This document will cover the
first two types of correlations. The bivariate correlation is for
situations where you are interested only in the relationship between two
variables. Partial correlations should be used when you are
measuring the association between two variables but want to factor out the
effect of one or more other variables.
To obtain a bivariate correlation, choose the following menu
option:
Analyze
Correlate
Bivariate...
This will produce the following dialog box:
D
To obtain correlations, first click on the variable names in the
variable list on the left side of the dialog box. Next, click on the arrow
between the two white boxes which will move the selected variables into
the Variables box. Each variable listed in the Variables box
will be correlated with every other variable in the box. For example, with
the above selections, we would obtain correlations between Education
Level and Current Salary, between Education Level and
Previous Experience, and between Current Salary and
Previous Experience. We will maintain the default options shown in
the above dialog box in this example. The first option to consider is the
type of correlation coefficient. Pearson's is appropriate for continuous
data as noted in the above example, whereas the other two correlation
coefficients, Kendall's tau-b and Spearman's, are designed for ranked
data. The choice between a one and two-tailed significance test in the
Test of Significance box should be determined by whether the
hypothesis you are testing is making a prediction about the direction of
effect between the two variables: if you are making a prediction that
there is a negative or positive relationship between the variables, then
the one-tailed test is appropriate; if you are not making a directional
prediction, you should use the two-tailed test if there is not a specific
prediction about the direction of the relationship between the variables
you are correlating. The selections in the above dialog box will produce
the following output:
D
This output gives us a correlation matrix for the three correlations
requested in the above dialog box. Note that despite there being nine
cells in the above matrix, there are only three correlation coefficients
of interest: (1) the correlation between current salary and educational
level, the correlation between previous experience and educational level,
and the correlation between current salary and previous experience. The
reason only three of the nine correlations are of interest is because the
diagonal consists of correlations of each variable with itself, always
resulting in a value of 1.00 and the values on each side of the diagonal
replicate the values on the opposite side of the diagonal. For example,
the three unique correlation coefficients show there is a positive
correlation between employees' number of years of education and their
current salary. This positive correlation coefficient (.661) indicates
that there is a statistically significant (p < .001) linear
relationship between these two variables such that the more education a
person has, the larger that person's salary is. Also observe that there is
a statistically significant (p < .001) negative correlation
coefficient (-.252) for the association between education level and
previous experience, indicating that the linear relationship between these
two variables is one in which the values of one variable decrease as the
other increases. The third correlation coefficient (-.097) also indicates
a negative association between employee's current salaries and their
previous work experience, although this correlation is fairly weak.
The second type of correlation listed under the Correlate menu
item is the partial correlation, which measures an association between two
variables with the effects of one or more other variables factored out. To
obtain a partial correlation, select the following menu item:
Analyze
Correlate
Partial...
This will produce the following dialog box:
D
Here, we have selected the variables we want to correlate as well as
the variable for which we want to control by first clicking on variable
names to highlight them on the left side of the box, then moving them to
the boxes on the right by clicking on the arrow immediately to the left of
either the Variables box or the Controlling for box. In this
example, we are correlating current salaries with years of education while
controlling for beginning salaries. Thus, we will have a measure of the
association between current salaries and years of education, while
removing the association between beginning salaries and the two variables
we are correlating. The above example will produce the following
output:
- - - P A R T I A L C O R R E L A T I O
N C O E F F I C I E N T S - - -
Controlling for..
SALBEGIN
SALARY
EDUC
SALARY
1.0000 .2810
( 0) ( 471)
P= . P=
.000
EDUC
.2810 1.0000
( 471) ( 0)
P= .000 P= .
(Coefficient / (D.F.)
/ 2-tailed Significance)
" . " is printed if a coefficient
cannot be computed
Notice that the correlation coefficient is considerably smaller in the
output above than in the bivariate correlation example: the correlation
between these variables was .661, whereas it is only .281 in the partial
correlation. Nevertheless, a statistically significant association
(p < .001) exists between these variables. While it may be
obvious in the above example that starting and current salaries will be
highly correlated, the example illustrates how partial correlations can be
used to assess the extent to which variables can be used to explain unique
variance by removing the effects of other variables that may be highly
correlated with the relationship of interest.
Partial correlations can be especially useful in situations where it is
not obvious whether variables possess a unique relationship or whether
several variables overlap with each other. For example, if you were
attempting to correlate anxiety with job performance and stress with job
performance, it would be useful to conduct partial correlations. You could
correlate anxiety and a job performance measure while controlling for
stress to determine if there were a unique relationship between anxiety
and job performance or whether perhaps stress is highly correlated
with anxiety--which would result in little remaining variance that could
be uniquely attributed to the association between anxiety and job
performance.
Regression
Regression is a technique that can be used to investigate the effect of
one or more predictor variables on an outcome variable. Regression allows
you to make statements about how well one or more independent variables
will predict the value of a dependent variable. For example, if you were
interested in investigating which variables in the employee database were
good predictors of employees' current salaries, you could create a
regression equation that would use several of the variables in the dataset
to predict employees' salaries. By doing this you will be able to make
statements about whether knowing something about variables such as
employees' number of years of education, their starting salary, or their
number of months on the job are good predictors of their current
salaries.
To conduct a regression analysis, select the following from the
Analyze menu:
Analyze
Regression
Linear...
This will produce the following dialog box:
D
This dialog box illustrates an example regression equation. As with
other analyses, you select variables from the box on the left by clicking
on them, then moving them to the boxes on the right by clicking the arrow
next to the box where you want to enter a particular variable. Here,
employees' current salary has been entered as the dependent variable. In
the Independent(s) box, several predictor variables have been
entered, including education level, beginning salary, months since hire,
and previous experience.
NOTE: Before you run a regression model, you should consider the method
that you use for selecting or rejecting variables in that model. The box
labeled Method allows you to select from one of five methods:
Enter, Remove,Forward, Backward, and
Stepwise. Unfortunately, we cannot offer a comprehensive discussion
of the characteristics of each of these methods here, but you have several
options regarding the method you use to remove and retain predictor
variables in your regression equation. In this example, we will use the
SPSS
default method, Enter, which is a standard approach in regression
models. If you have questions about which method is most appropriate for
your data analysis, consult a regression text book, the SPSS
help facilities, or contact a consultant.
The following output assumes that only the default options have been
requested. If you have selected options from the Statistics,
Plots, or Options boxes, then you will have more output than
is shown below and some of your tables may contain additional statistics
not shown here.
The first table in the output, shown below, includes information about
the quantity of variance that is explained by your predictor variables.
The first statistic, R, is the multiple correlation coefficient
between all of the predictor variables and the dependent variable. In this
model, the value is .90, which indicates that there is a great deal of
variance shared by the independent variables and the dependent variables.
The next value, R Square, is simply the squared value of R.
This is frequently used to describe the goodness-of-fit or the amount of
variance explained by a given set of predictor variables. In this example,
the value is .81, which indicates that 81% of the variance in the
dependent variable is explained by the independent variables in the
model.

The second table in the output is an ANOVA table that describes the
overall variance accounted for in the model. The F statistic
represents a test of the null hypothesis that the expected values of the
regression coefficients are equal to each other and that they equal zero.
Put another way, this F statistic tests whether the R square
proportion of variance in the dependent variable accounted for by the
predictors is zero. If the null hypothesis were true, then that would
indicate that there is not a regression relationship between the dependent
variable and the predictor variables. But, instead, it appears that the
four predictor variables in the present example are not all equal to each
other and could be used to predict the dependent variable, current salary,
as is indicated by a large F value and a small significance
level. D
The third table in standard regression output provides information
about the effects of individual predictor variables. Generally, there are
two types of information in the Coefficients table: coefficients
and significance tests. The coefficients indicate the increase in the
value of the dependent variable for each unit increase in the predictor
variable. For example, the unstandardized coefficient for Educational
Level in the example is 669.91, which indicates to us that for each
year of education, a person's predicted salary will increase by $669.91. A
well known problem with the interpretation of unstandardized coefficients
is that their values are dependent on the scale of the variable for which
they were calculated, which makes it difficult to assess the relative
influence of independent variables through a comparison of unstandardized
coefficients. For example, comparing the unstandardized coefficient of
Education Level, 669.91, with the unstandardized coefficient of the
variable Beginning Salary, 1.77, it could appear that Educational
Level is a greater predictor of a person's current salary than is
Beginning Salary. We can see that this is deceiving, however, if we
examine the standardized coefficients, or Beta coefficients. Beta
coefficients are based on data expressed in standardized, or z
score form. Thus, all variables have a mean of zero and a standard
deviation of one and are thus expressed in the same units of measurement.
Examining the Beta coefficients for Education Level and
Beginning Salary, we can see that when these two variables are
expressed in the same scale, Beginning Salary is more obviously the
better predictor of Current Salary.
In addition to the coefficients, the table also provides a significance
test for each of the independent variables in the model. The significance
test evaluates the null hypothesis that the unstandardized regression
coefficient for the predictor is zero when all other predictors'
coefficients are fixed to zero. This test is presented as a t
statistic. For example, examining the t statistic for the variable,
Months Since Hire, you can see that it is associated with a
significance value of .000, indicating that the null hypothesis, that
states that this variable's regression coefficient is zero when all other
predictor coefficients are fixed to zero, can be rejected.
D
General Linear Model
The majority of procedures used for conducting analysis of variance
(ANOVA) in SPSS can
be found under the General Linear Model (GLM) menu item in the
Analyze menu. Analysis of variance can be used in many situations
to determine whether there are differences between groups on the basis of
one or more outcome variables or if a continuous variable is a good
predictor of one or more dependent variables. There are three varieties of
of the general linear model available in SPSS:
univariate, multivariate, and repeated measures. The univariate
general linear model is used in situations where you only have a
single dependent variable, but may have several independent variables that
can be fixed between-subjects factors, random between-subjects factors, or
covariates. The multivariate general linear model is used in
situations where there is more than one dependent variable and independent
variables are either fixed between-subjects factors or covariates. The
repeated measures general linear model is used in situations where
you have more than one measurement occasion for a dependent variable and
have fixed between-subjects factors or covariates as independent
variables. Because it is beyond the scope of this document to cover all
three varieties of the general linear model in detail, we will focus on
the univariate version of the general linear model with some attention
given to topics that are unique to the repeated measures general linear
model. Several features of the univariate general linear model are useful
for understanding other varieties of the model that are provided in
SPSS:
understanding the univariate model will prove useful for understanding
other GLM options.
The univariate general linear model is used to compare differences
between group means and estimating the effect of covariates on a single
dependent variable. For example, you may want to see if there are
differences between men and women's salaries in a sample of employee data.
To do this, you would want to demonstrate that the average salary is
significantly different between men and women. However, in doing such an
analysis, you are likely aware that there are other factors that could
affect a person's salary that need to be controlled for in such an
analysis. For example, educational background and starting salary are some
such variables. By including these variables in our analysis, you will be
able to evaluate the differences between men and women's salaries while
controlling for the influence of these other variables.
To specify a univariate general linear model in SPSS, go
to the analyze menu and select univariate from the general linear model
menu:
Analyze
General Linear Model
Univariate...
This will produce the following dialog box:
D
The above box demonstrates a model with multiple types of independent
variables. The variable, gender, has been designated as a fixed
factor because it contains all of the levels of interest.
In contrast, random variables are variables that represent a
random sample of the possible levels that could be sampled. There are not
any true random variables in our dataset; therefore, this input box has
been left blank here. However, you could imagine a situation similar to
the above example where you sampled data from multiple corporations for
our employee database. In that case, you would have introduced a random
variable into the model--the corporation to which an employee belongs.
Corporation is a random factor because you would only be sampling a few of
the many possible corporations to which you would want to generalize your
results.
The next input box contains the covariates in your model. A
covariate is a quantitative independent variable. Covariates are
often entered in models to reduce error variance: by removing the effects
of the relationship between the covariate and the dependent variable, you
can often get a better estimate of the amount of variance that is being
accounted for by the factors in the model. Covariates can also be used to
measure the linear association between the covariate and a dependent
variable, as is done in regression models. In this situation, a linear
relationship indicates that the dependent variable increases or decreases
in value as the covariate increases or decreases in value.
The box labeled WLS Weight can contain a variable that is used
to weight other variables in a weighted least-squares analysis. This
procedure is infrequently used however, and is not discussed in any detail
here.
The default model for the SPSS
univariate GLM will
include main effects for all independent variables and will provide
interaction terms for all possible combinations of fixed and random
factors. You may not want this default model, or you may want to create
interaction terms between your covariates and some of the factors. In
fact, if you intend to conduct an analysis of covariance, you should test
for interactions between covariates and factors. Doing so will determine
whether you have met the homogeneity of regression slopes
assumption, which states that the regression slopes for all groups in your
analysis are equal. This assumption is important because the means for
each group are adjusted by averaging the slopes for each group so that
group differences in the covariate are removed from the dependent
variable. Thus, it is assumed that the relationship between the covariate
and the dependent variable is the same at all levels of the independent
variables. To make changes in the default model, click on the Model
button which will produce the following dialog box:
D
The first step for modifying the default model is to click on the
button labeled Custom, to activate the grayed out areas of the
dialog box. At this point, you can begin to move variables in the
Factors & Covariates box into the Model box. First, move
all of the main effects into the Model box. The quickest way to do
that is to double-click on their names in the Factors &
Covariates box. After entering all of the main effects, you can begin
building interaction terms. To build the interactions, click on the arrow
facing downwards in the Build Term(s) section and select
interaction, as shown in the figure above. After you have selected the
interaction, you can click on the names of the variables with which you
would like to build an interaction, then click on the arrow facing right
under the Build Term(s) heading. In the above example, the
educ*gender term has already been created. The
salbegin*gender and salbegin*educ terms can be created by
highlighting two terms at a time as shown above, then clicking on the
right-facing arrow. Some of the other options in the Build Terms
list that you may find useful are the All n-way options. For
example if you highlighted all three variables in the Factors &
Covariates box, you could create all of the three possible 2-way
interactions by selecting the All 2-way option from the Build
Terms(s) drop-down menu, then clicking the right-facing arrow.
If you are testing the homogeneity of regression slopes assumption, you
should examine your group by covariate interactions, as well as any
covariate by covariate interactions. In order to meet the ANCOVA
assumption, these interactions should not be significant. Examining the
output from the example above, we expect to see nonsignificant effects for
the gender*educ and the gender*salbegin interaction
effects:
D
Examining the group by covariate effects, you can see that both were
nonsignificant. The gender*salbegin effect has a small F
statistic (.660) and a large significance value (.417), the
educ*salbegin effect also has a small F statistic (1.808)
and large significance value (.369), and the salbegin*educ effect
also has a small F statistic (1.493) and large significance level
(.222). Because all of these significance levels are greater than .05, the
homogeneity of regression assumption has been met and you can proceed with
the ANCOVA.
Knowing that the model does not violate the homogeneity of regression
slopes assumption, you can remove the interaction terms from the model by
returning to the GLM Univariate dialog box,
clicking the Model button, and selecting Full Factorial.
This will return the model to its default form in which there are no
interactions with covariates. After you have done this, click OK in
the GLM Univariate
dialog box to produce the following output: D
The default output for the univariate general linear model contains all
main effects and interactions between fixed factors. The above output
contains no interactions because gender is the only fixed factor. Each
factor, covariate, or other source of variance is listed in the left
column. For each source of variance, there are several test statistics. To
evaluate the influence of each independent variable, look at the F
statistic and its associated significance level. Examining the first
covariate, education level, the F statistic (37.87) and its
associated significance level (.000) indicate that it has a significant
linear relationship with the dependent variable. The second covariate,
salbegin, also has a significant F statistic (696.12) as can
be seen from its associated significance level (.000). In both cases, this
indicates that the the values of the dependent variable, salary,
increases as the values of education level and beginning salaries
increase. The next source of variance, gender, provides us with a test of
the null hypothesis that there are no differences between gender groups,
or more specifically, there are not differences between men and women's
salaries. This test provides a small F statistic (3.87) and a
significance level that is not statistically significant (p = .05).
In the above model containing education level and beginning salaries as
covariates, we are not able to say that there is a statistically
significant difference between men and women's salaries. That is, when the
model takes into account the variance accounted for by education level and
beginning salaries, the variance that can be uniquely attributed to gender
is not significantly different from a model in which gender explains
no variance.
The repeated measures version of the general linear model has many
similarities to the univariate model described above. However, the key
difference between the models is that there are multiple measurement
occasions of the dependent variable in repeated measures models, whereas
the univariate model only permits a single dependent variable. You could
conduct a similar model with repeated measurements by using beginning
salaries and current salaries as the repeated measurement occasions.
To conduct this analysis, you should select the Repeated
Measures option from the General Linear Model submenu of the
Analyze menu:
Analyze
General Linear Model
Repeated Measures...
Selecting this option will produce the following dialog box:
D
This dialog box is used for defining the repeated measures, or
within-subjects, dependent variables. You first give the within-subject
factor a name in the box labeled Within-Subject Factor Name. This
name should be something that describes the dependent variables you are
grouping together. For example, in this dialog box, salaries are being
analyzed, so the within-subject factor was given the name
salaries. Next, specify the number of levels, or number of
measurement occasions, in the box labeled Number of Levels. This is
the number of times the dependent variable was measured. Thus, in the
present example, there are two measurement occasions for salary because
you are measuring beginning salaries and current salaries. After you have
filled in the Within-Subject Factor Name and the Number of
Levels input boxes, click the Add button which will transfer
the information in the input boxes into the box below. Repeat this process
until you have specified all of your within-subject factors. Then, click
on the Define button, and the following dialog box will appear:
D
When this box initially appears, you will see a slot for each level of
the within-subject factor variables that you specified in the previous
dialog box. These slots are labeled numerically for each level of the
within-subject factor but do not contain variable names. You still need to
specify which variable fills each slot of the within-subject factors. To
do this, click the variable's name in the variable list on the left side
of the dialog box. Next, click on the arrow pointing towards the
Within-Subject Variables dialog box to move the variable name from
the list to the top slot in the within-subjects box. This process has been
completed for salbegin, the first level of the salaries
within-subject factor. The same process should be repeated for
salary, the variable representing an employee's current salary.
After you have completed the specifications for the within-subjects
factors, you can define your independent variables. Between-subject
factors, or fixed factors should be moved into the box labeled
Between-Subjects Factors(s) by first clicking on the variable name
in the variable list, then clicking on the arrow to the left of the
Between-Subjects Factor(s) box. In this example, gender has been
selected as a between-subjects factor. Covariates, or continuous predictor
variables, can be moved into the Covariates box in the same manner
as were the between-subjects factors. Above, educ, the variable
representing employee's number of years of education, has been specified
as a covariate.
This will produce several output tables, but we will focus here on the
tables describing between-subject and within-subject effects. However,
these tables for univariate analysis of variance may not always be the
appropriate. The univariate tests have an additional assumption: the
assumption of sphericity. If this assumption is violated, you should use
the multivariate output or adjust your results using one of the correction
factors in the SPSS
output. For a more detailed discussion of this topic, see the usage note,
Repeated Measures ANOVA
Using SPSS
MANOVA in
the section, "Within-Subjects Tests: The Univariate versus the
Multivariate Approach." This usage note can be found at http://www.utexas.edu/cc/rack/stat.html.
The following output contains the statistics for the effects in the
model specified in the above dialog boxes:
D
This table contains information about the within-subject factor,
salaries, and its interactions with the independent variables. The
main effect for salaries is a test of the null hypothesis that all levels
of within-subjects factor are equal, or, more specifically, it is a test
of the hypothesis that beginning and current salaries are equal. The
F statistic (18.91) and its associated significance level (p
< .000) indicate that you can reject this hypothesis as false. In other
words, it appears that there is a statistically significant difference
between beginning salaries and salaries after one year of employment.
After you have tested this hypothesis, you can then investigate whether
the increase in salaries is the same across all values or levels of the
other independent variables that are included in the model. The first
interaction term in the table tests the hypothesis that the increase in
salaries is constant, regardless of educational background. The F
statistic (172.10) and its associated significance level (p <
.000) allow us to reject this hypothesis as well. The knowledge that this
interaction is significant indicates that it is worthwhile to examine
characteristics of the interaction. Here, the interaction reflects the
fact that employees with higher levels of education received greater pay
raises than those with lower levels of education. However, you
should always investigate the properties of your interactions through
graphical displays, mean comparisons, or statistical tests because a
significant interaction can take on many forms. Finally, the third
interaction tests the hypothesis that the increase in salaries in the
first year of employment differs by gender. The F statistic (25.07)
and its significance level (p < .000) indicate that the increase
in salaries does vary by gender.
D
The output for the repeated measures general linear model also provides
statistics for between-subject effects. In this example, the model
contains two between-subjects factors: employees' education level and
their gender. Education level was entered as a covariate in the model, and
therefore the statistics associated with it are a measure of the linear
relationship between education level and salaries. In contrast, the
statistics for the between-subjects factor, gender, represents a
comparison between groups across all levels of the within-subjects
factors. Specifically, it is a comparison between males and females on
differences between their beginning and current salaries. In the above
example, both education level and gender are statistically significant.
The F statistic (277.96) and significance level (p <
.000) associated with education level allows us to reject the null
hypothesis that there is not a linear relationship between education and
salaries. By rejecting the null hypothesis, you can conclude that there is
a positive linear relationship between the two variables indicating that
as number of years of education increases, salaries do as well. The
F statistic (55.79) for gender and its associated
significance level (p < .000) represent a test of the null
hypothesis that there are no group differences in salaries. The
significant F statistic indicates that you can reject this null
hypothesis and conclude that there is a statistically significant
difference between men and women's salaries.
To learn more about SPSS,
please proceed to the next
SPSS
tutorial.
14 September 2001 Statistical Support, a division of Research Consulting at ITS Send us
e-mail at stats@its.utexas.edu
or submit a
feedback form Copyright 2001, UT Austin |
 |