USGS - science for a changing world

Upper Midwest Environmental Sciences Center

LTRM Statistics

Estimating Means and Standard Errors from LTRM Survey Data

 

Introduction

The LTRM collects data using sampling locations that have been selected both probabilistically and nonprobabilistically (in LTRM parlance, "stratified random" and "fixed-site" data, respectively). Sample information from probabilistically selected locations may be used to make inferences about the populations from which those samples were derived. For example, the prevalence of submersed aquatic vegetation may be estimated for the entire population of sample units (generally defined for an entire reach) using data collected at locations selected probabilistically combined with information about the sampling design. Our “fixed site” data do not permit such design-based generalizations. For this reason, this web site is primarily concerned with the use of sample information from probabilistically selected sites.

The LTRM estimates annual means and associated standard errors by relying on the sampling design (rather than on distributional assumptions presumed associated with the observed data). These so-called “design-based” methods accommodate complexities often associated with survey designs, including stratification and nonproportional sampling ("strata" represent subpopulations from which independent samples are drawn). A useful comparison of design- and model-based methods for the analysis of survey data is provided by Lohr (1999).

The LTRM does not presently adjust statistics from the biological components for detection probabilities or capture efficiencies. Consequently, statistics from the Program's biological components are more properly termed index statistics. Index statistics are presumed to be correlated with parameters of interest (e.g., abundance, percent frequency of occurrence). However, because index statistics have not been adjusted for variation in detection probabilities, changes in index statistics cannot be explicitly differentiated from changes in detection probabilities. Further information on index statistics and detection probabilities is provided in Thompson et al. (1998).

LTRM treats missing data as missing completely at random (MCAR). MCAR means that both missing and nonmissing data arose from the same distribution. Data may be missing because, for example, a site could not be visited (due, for example, to flow considerations) or a sample was taken but sample data was subsequently lost. Because the proportions of missing data within LTRM datasets are typically small, we believe that violation of MCAR assumptions will typically have inconsequential effects. Exceptions are addressed on a case-by-case basis in the application web pages.

Sample Inclusion Probabilities

A sample inclusion probability is the probability that an individual population unit—for the LTRM, a grid point—is selected for sampling. Example: 20 grid points from each of strata i and j are selected using simple random sampling. If the population sizes of these strata are 1000 and 2000 grid points, then the sample inclusion probabilities are 20/1000 = 2% and 20/2000 = 1%, respectively. For inclusion probabilities to be constant across strata (i.e., “proportional to size”), the number of grid points selected would need to be directly proportional to the strata sizes (e.g., select 10 and 20 units from strata i and j, respectively). For a given component, these inclusion probabilities have varied across strata (within a given pool) but, with few exceptions, have been essentially constant within strata.

Sampling Weights

The LTRM uses sampling weights to adjust for nonproportional sampling. Sampling weights for the LTRM are generally defined as the inverses of the sample inclusion probabilities, and they may also be viewed as the number of potentially sampleable units represented by a given sampled unit. Continuing with the previous example, each sampled unit in strata i and j may be viewed as representing 50 and 100 (i.e., 1000/20 and 2000/20) potentially sampleable units, respectively.

In some instances, locations selected for sampling by the LTRM were not sampled. This might have occurred, for example, when the intended sampling location was inaccessible. At present, the LTRM treats these missing observations as missing completely at random (vegetation component) or by substituting predefined alternative locations (else). If unsampled locations were either (1) not missing completely at random or (2) were not interchangeable with the alternate locations for the given metric, then we may expect our reported statistics to reflect bias of unknown magnitude. At present, the LTRM ignores the issue of missing data, and sample inclusion probabilities are estimated using the observed rather than intended number of sample units.

Sample inclusion probabilities and sampling weights by strata, component, and reach are calculated using the number of sampling observations and the corresponding population sizes. Population sizes are provided below in both pdf and Excel format.

Population units (Excel file) (pdf file)

Estimating Design-based Means and Standard Errors

For the LTRM, means are adjusted for nonproportional sampling and standard errors for both nonproportional sampling and stratification. Design-based means and standard errors are estimated using SAS' survey means procedure (proc surveymeans); further technical details are provided in SAS (2003).

Comments by sampling component:

Fish

  • Designs generally include both spatial and temporal strata.
  • Sample sizes within spatial and temporal strata have typically been small (n < 10). Consequently, means are typically reported by reach and sampling year.
  • Wing dams, sampling locations within wing dams and tailwaters are not selected using the methods used for the larger strata (Gutreuter 1995). Consequently, information about wing dam and tailwater sampling is excluded from annual means reported by the fish component.
  • Further information about statistics used by the LTRM's fish component is provided in Gutreuter (1993).

Macroinvertebrate

  • The macroinvertebrate component defined strata in space only. Sample sizes have typically been moderate in backwater and impounded strata (n ~ 50), intermediate in the side channel stratum (n ~ 20), and small in the main channel border stratum (n ~ 10). Means of macroinvertebrate outcomes will generally be estimated by reach and year.
  • Further information about statistics used by the LTRM's macroinvertebrate component is provided in Sauer (1998).

Vegetation

  • The vegetation component defines strata in space only and, with the exception of the percent cover variable, uses a cluster design. Sample sizes vary considerably by strata but are often large (n > 50). Means and standard errors will generally be estimated separately for each taxa group, reach, and sampling year.
  • The vegetation component’s sampling frame for all pools was revised in 1999. Consequently, means and standard errors from 1998 (all pools) represent subpopulations (see below).
  • Sampling date is confounded with strata in Pool 8. Consequently, strata-specific means are confounded with sampling date.
  • Further information about statistics used by the LTRM's vegetation component is provided in Yin et al. (2001).

Water quality

  • The water quality component uses a spatially stratified sampling design within each of the four seasons. As interest is typically in season- and strata-specific estimates, means and standard errors will generally be estimated separately for each reach, season and stratum.
  • Means for Pool 26 exclude Swan Lake (which is not technically part of Pool 26).
  • Due to frequent problems with missing data, estimates from Navigation Pool 26's Swan Lake should be viewed with caution.

Means of Subpopulations

Estimating means and standard errors from a subpopulation not defined by the design typically require methods that acknowledge that the number of samples in the subpopulation is a random variable (Thompson 2002). See Estimating a mean from a portion of one or more strata.

Species Richness

Estimating species richness is generally beyond the scope of this document. Methods for estimating species richness that adjust for species-specific detection probabilities are reviewed by MacKenzie et al. (2005).

Finite Population Correction Factors

Survey statisticians typically adjust standard errors when large proportions (>10%) of potential sampling units are sampled. As these proportions increase, we become increasingly sure about our estimates—and, hence, standard errors should decrease and confidence interval widths narrow. Corrections for sampling large proportions of sampling sites are termed finite population corrections.

Sampling fractions are typically low (<10%) for all LTRM components. Occasional exceptions are seen for the water quality component during 1993 through winter 1994-95 (a design change occurred in spring 1995). 

Precision estimates reported by the water quality component for 1993 through winter 1994-95 do not currently adjust for finite population sampling. This approach reflects that sampling proportions in the 1990s were a function of a sampling design that changed in 1995, and that failure to adjust for finite population sampling yields conservative precision estimates (e.g., standard error estimates will be too large).
 
We demonstrate the effects of adjusting precision estimates for finite sampling using chlorophyll a and total suspended solids data from the side channel stratum (strat = 2), summer episode (episode = 2) and year 1994. Total numbers of sampling sites are provided at http://www.umesc.usgs.gov/ltrmp/stats/population_sizes.pdf.

proc surveymeans data=WQall total=180;
var chlf ss;
where fs = 3 and strat=2 and episode = 3 and year = 1994;
run;

Results (below) indicate that finite sampling adjustment for finite sampling yielded modest decreases in both the standard error of the mean (SE) and the widths of the 95% confidence interval on the mean (95% CI).  For this stratum and year, 18% of sites were sampled. 

Variable

N

Mean

SE

95% CI

With finite sampling adjustment

 

CHLF

30

14.57

0.41

(13.73, 15.41)

SS

30

42.45

1.49

(39.40, 45.50)

Without finite sampling adjustment

 

CHLF

30

14.57

0.45

(13.64, 15.50)

SS

30

42.45

1.64

(39.08, 45.81)



Confidence Intervals

Most survey-based confidence intervals on the mean, including those supplied by the LTRM, rely on a normality assumption for the mean. When underlying data aren’t normally distributed, the normality assumption for the mean assumes large sample sizes. The definition of “large” depends on the skewness or asymmetry of the data and on the number of strata (Lohr 1999; Thompson 2002). Unless otherwise indicated, all data from the LTRM should be presumed nontrivially skewed (exceptions include some water quality metrics). Within-strata sample sizes have often been small (i.e., n < 30 to 50, and possibly <10), as have the number of strata. The number of strata has varied from between approximately 15 in the fish component (10 beginning in 2005) to approximately 4 for metrics associated with the other components. For these reasons, confidence intervals that rely on a normality assumption (whether supplied by the LTRM or otherwise) for the mean should be viewed as approximate. Exceptions include where the underlying data may be assumed approximately symmetric (e.g., percent frequency of occurrences near 50%) or when sample sizes are large. Alternative methods of estimating confidence intervals are described by Lohr (1999).


References

Gutreuter, S. 1993. A statistical review of sampling of fishes in the Long Term Resource Monitoring Program. National Biological Survey, Environmental Management Technical Center, Onalaska, Wisconsin, December 1993. EMTC 93-T004. 15 pp. (NTIS PB94-150828) 

Gutreuter, S., R. Burkhardt, and K. Lubinski. 1995. Long Term Resource Monitoring Program Procedures: Fish monitoring. National Biological Service, Environmental Management Technical Center, Onalaska, Wisconsin, July 1995. LTRMP 95-P002 1. 42 pp. + Appendixes A-J.

Lohr, S. L. 1999. Sampling: Design and analysis. Duxbury Press Publishing Company, Pacific Grove, California.

Mackenzie, D. I., J. D. Nichols, N. Sutton, and L. L. Bailey. 2005. Improving inferences in population studies of rare species that are detected imperfectly. Ecology 86:1101–1113.

SAS Institute Inc. 2003. SAS OnlineDoc® 9.1. SAS Institute Inc., Cary, North Carolina.

Sauer, J. 1998. Temporal analyses of select macroinvertebrates in the Upper Mississippi River System, 1992-1995. U.S. Geological Survey, Environmental Management Technical Center, Onalaska, Wisconsin, April 1998. LTRMP 98-T001. 26 pp. + Appendix. (NTIS PB98-140874) 

Snijders, T. A. B., and R. J. Bosker. 1999. Multilevel analysis. Sage, London. 266 pp.

Thompson, S. K. 2002. Sampling. Second edition. Wiley & Sons, New York.

Thompson, W. L., G. C. White, and C. Gowan. 1998. Monitoring vertebrate populations. Academic Press, San Diego, California.

Yin, Y., H. Langrehr, T. Blackburn, M. Moore, J. Winkelman, R. Cosgriff, and T. Cook. 2001. 1998 annual status report: Submersed and rooted floating leaf vegetation in Pools 4, 8, 13, and 26 and La Grange Pool of the Upper Mississippi River System. U.S. Geological Survey, Upper Midwest Environmental Sciences Center, La Crosse, Wisconsin, May 2001. LTRMP 2001-P001. 9 pp. + Appendix + Chapters 1-5. (DTIC ADA392067)

Contact: Further information about estimating means and standard errors from LTRM data may be obtained from Brian Gray, LTRM statistician, Upper Midwest Environmental Sciences Center, La Crosse, Wisconsin, at brgray@usgs.gov.

Accessibility FOIA Privacy Policies and Notices

Take Pride in America logo USA.gov logo U.S. Department of the Interior | U.S. Geological Survey

URL: http://www.umesc.usgs.gov/lltrmp/stats/means.html
Page Contact Information: Contacting the Upper Midwest Environmental Sciences Center
Page Last Modified: June 1, 2016