·e SOME NON-RESPONSE SAMPLING THEORY FOR TWO STAGE DESIGNS by GEORGE THOMAS FORADORI Institute of Statistics Mimeograph Series No. 297 November, 1961 ERRATA SEEET Page 23: Seventh line second word "now" instead of "not" n n " 26: Formula on line 8 should have - 2i instead of h2i 2i ni " 29: Second line below formula (2.17) should have II nowll instead of II not " %i " 29: Formula 5 lines below formula (2.17) r:- instead of j-=l " 32: First line of formula (2.26) "~" instead of lIy i" II 33: Formula 3 lines from bottom of page should read: II 34: II 34: Formula (2.31) liNli "instead of "Ni " M m Formula (2.30) "1:;11 instead of "1:;11 1 " 70: f -1 instead of f i 1 on lines 11, 12, 14 i 71: f i-1 all graphs instead of f i " 72: " " " " II " 73: " II " " II II " II ~2i ~ i=l iv TABLE OF CONTENTS Page LIST OF TABLES LIST OF FIGURES • 1.0 THE PROBLEM ... • • • • • ... . . . . . . .... ..... ... 1.1 Introduction. 1.2 Review of Literature • 1.3 Notation. • • • • • • • vi vii 1 1 ...... ...... .......... 16 2.0 THE THEORY OF SAMPLING WITH NON-RESPONSE AT THE SECOND STAGE. 21 Sampling with Unequal Probabilities and without Replacement at the First Stage • • • • • • • 2 .1.1 The Sampling Procedure • • • • 2.1.2 An Estimator of the Total • • • • .• • • 2.1.3 Variance of the Estimator, T • • • • • • • • . l 2.1.4 An Unbiased Estimator of V(T ). • • • l 2.2 Sampling with Unequal Probabilities and with Replacement at the First Stage • • • • • • • • • • • . 2.2.1 The Samp~ing Procedure • • • • • • • • • 2.2.2 An Estimator of the Total •"",. • • • • 2.2.3 Variance of the Estimator, T2 • • • • "" ) • • • • • 2.2.4 An Unbiased Estimator of V(T 2 "'" "'" .. 2.3 A Comparison of V(Tl ) and V(T2 ) Where the FSUs Are Selected with Equal Probability for Each Case .• •• 5 2.1 3.0 OPI'IMUM SAMPLE ALLOCATION FOR NON-RESPONSE • • • • 22 22 23 25 28 32 32 33 34 36 40 .... A Cost Function • • • • • • • • • • • • • • • Sampling with Replacement at the First Stage • • • • • 3.2.1 Fixed Variance and Minimum Cost • • 3.2.2 Fixed Cost. and Minimum Variance • • • • • •• 3~3 Comments on the Optimum Solutions • • • • • • • • 3.4 Sampling Wi thou~ Replacement at· the First Stage Wi th Equal First stage InClusion Probabilities 3.5 Relative Efficiency • • • • • • • • • • • • • 3.6 Application of Theory to Data • • • • • • • • • • 3.7· Extension to Stratified Case. • • • • • • • • • • 3.8 . Some Graphic Solutions for Optimum Recall Rates ••• 42 44 44 49 50 53 55 58 66 69 v TABLE OF CONTENTS (continued) Page 4.0 TWO STAGE NON-RESPONSE THEORY EXTENDED TO MULTIPHASE . ~NG ••••••••••••••••• ~ ••• ... 4.1 Sampling with Unequal Probabilities and without Re;. placement at the First Stage • • • • • • • • • • • • 4.1.1 An Estimator of the Total • • • • • • • • • • 4.1.2 Variance of the Estimator, ~1E • • ••.• •.•• 4.1.3 . Optimum Allocation • • • • • • • • • • • • • '. 4.2 Sampling.with Unequal Probabilities and with Replacement at the First Stage • • • •• •••• • • 4.2.1 An Estimator of the Total •• • • • • • • 4.2.2 Variance of the Estimator, ~2E • • • • • • • • 4.2.3 Optimum Allocation. • • • • • • • • • • • • • 4.3 Comments on the Solutions • • • • • • • • • • • • • • • 4.4 Extension of the Theory to Stratified Sampling • • ........ 5.1 Summar,y or Res.ults •• -. 5.2 Summary of ConcJ-usions • • • • . . . . . . . . . . 5.3 Suggestions,for Further Research ...... LIST OF REFERENCES • • • • • • • • • • • • • • . . . . . . 5.0 SUMMARY AND CONCLUSIONS ·e •••• 77 77 78 81 85 85 86 86 88 90 94 94 95 95 97 vi LIST OF TABLES Page 17 Notation relevant to unstratified single phase sampling • • 2. Notation relevant to unstratified multiphase sampling • 18 3 • Population data • • • • • • • • • • • • • • • • • . • • 60 4: ·e .. 1. Population values of. reSPOndent SSUs. to ya.;rious mailings. for th FSU • • • • • • • • • • • • • • • • • • • • • • • • the i 76 vii LIST OF FIGURES P~e 1. 2. 3. ·e Contours of equal f 1-1 for,. = 1 and R1 Contours of equal f 1-1 for,. = 1 and R1 Contours of equal f~1 for ,. = 1 and R1 = .25 = .50 = .75 ··• .• ·· ··· ..·· • · · . . . . · · 71 72 73 1.0 THE PROBLEM 1.1 Introduction The theory of statistical sampling is principally concerned with the measurement of variation in an estimate based on the ultimate units selected in the context of the frame used to identify these units and the probability system constructed for the selection of these units. The variation in an estimate arising purely from the operation of sampling is called sampling error and arises because not every sampling unit in the population is enumerated. However, through the use of prob- ability theory, appropriate to the selection scheme, the variation can be assessed and even controlled. For those sample surveys where measurements of the ultimate sam- ··e pling units depend on the response from human beings, difficulty may be encountered in enumerating the selected units. Thus, a sample drawn properly, according to some known probability scheme may admit of errors due to such non-response. Certainly, the general class of non-sampling errors includes much more than non-response errors. Items such as interviewer bias, coding errors, false or erroneous replies and plain mistakes also fall into this class. IIi essence, we shall include all variations other than the random variations introduced by the selection process itself in the class of non-sampling errors. In this thesis, the non-response errors will be singled out for critical analysis. Unless otherwise stated, the term bias used in this thesis will refer to that arising from non-response. 2 A non-response situation will be said to exist when some of the selected sample un1 ts fail to be enumerated. It should be noted that the problem of non-response is generally greater in surveys conducted by mail than in those conducted by personal interview methods. Notwith- standing, the significantly lower cost of mail studies may make their use desirable. In the context of this thesis, the reasons for such incompleteness are of no consequence. Rather, we will consider the theoretical impli- cations of non-response and the subsequent impact it has on sample design. The primary assmnption made is that by expenditure of suffi- cient funds, all selected un1 ts can be contacted and enumerated. This is not absolutely true in practice but it is a suffiCiently close approx- ·e imation to the real situation that departures can be ignored as having insignificant effects on the results. One simple, if questionable, means of avoiding modifications in the development of sampling theories brought about by non-response is to postulate complete homogeneity among both responding and non-responding population units. This being the case, one need merely substitute another sampling unit (chosen according to some a priori probability system) for any non-responding one. The net effect is to attain the predetermined sample size prescribed by the sampling plan. However, to ascribe homogeneity to both respondents and non-respondents without strong evidence is not realistic, in most instances, especially in human populations. Such a replacement scheme will usually introduce some bias in the estimates, except in cases where the measured characteristics 3 among responding and non-responding units in the population are, in fact, homogeneous. A more nearly acceptable means of correcting for non-response is to expend some effort in the direction of the non-respondents themselves. Thus, by subsampling the non-respondents one or more times (phases) and improving the interviewing effort, some representation of the latter group can be managed. This has the advantage that by appropriate mod- ifications to the usual estimators of population values, the subsample results can be used to provide unbiased estimates and, in some cases, to reduce mean square error. In most surveys and even for particular questions in multi-purPOse surveys the value of the measured characteristic of the sampling units ·-e will be correlated with, if not have a direct effect on the likelihood of response. Such a situation makes it imperative that measures be taken to elicit a higher proportion of responses from the selected units • Only in this manner can bias be entirely eliminated from sub- sequent estimates of population values. Reduction of bias can sometimes be accomplished empirically by utilizing accumulated data on both respondents and non-respondents. for Itadjustment" purPOses. Thus, historical data are sometimes used other methods include better interviewers and training methods, continued recall, improved questionnaires or, any combinations of these and similar techniques. When methods for collecting information from human populations are modified, there is the possibility of introducing further biases into the responses themselves. In fact, the non-response segment of a given 4 population may actually change. cated in this thesis. tual non-existence. Such problems as these are merely indi- The results obtained herein postulate their virMuch work has been done in this area and is to be found in the literature of statistics, psychology, and sociology. By proper attention to the techniques developed in these disciplines, the effective bias introduced by attempts to increase the response rate through modification of procedure can be minimized. As of the present writing all published work on the statistical aspects of the non-response problem has been confined to single stage sampling plans. (1) It is the primary purpose of this thesis to: examine the effects of non-response on the most common estimators in light of some non-response population models. (2) extend present procedures to both stratified and unstratified two-stage sam:Pli~.This will include the optimum allocation of effort to various sample stages and phases. (3) develop some unbiased estimators of variance for the two-stage one-phase case. In the main, it is the work of Hansen and Hurwitz (1946) and El-Badry (1956) which will be extended. Where earlier authors have neglected to find variance estimators, such will be derived where possible. It should be stressed that the plans considered in this study are intended to utilize the information available to the investigator in 5 order to achieve maximum economic efficiency; i.e., to secure estimates of population values which are as free as possible of non-response bias and which have maximum precision for the available funds or which minimize cost while attaining a required precision. 1.2 Review of Literature In conducting sample surveys of human populations in which the sampling units (households, persons) under study have the choice of not responding, the risk of bias is ever present. However, choice on the part of the individual selected is not the only criterion demanding consideration of bias. Individuals who are not at home or who live in out of the way places also present the possibility of non-response. ·,e A number of studies are available showing some of the personal, educational and other traits which have been found correlated with the willingness to respond or ability to contact certain units. Several of these investigations will be discussed below followed by a discussion of studies concerned with the statistical and design aspects of the problem. Pace (1939) made a study of former students to discover the attitudes of a representative group of young adults 5 to 13 years out of school. Comparison of. the respondents and non-respondents showed that whether the respondent actually graduated and the number of years completed were both important factors influencing the decision to respond iJIDIlediately. Stanton (1939) studied 11,169 school teachers, inquiring about their ownership and use of classroom receiving equipment and other radio facilities. He found that a higher proportion of those having such 6 equipment answered the initial inquiry. Those not having such facili- ties responded mostly to follow-up questionnaires. In a survey of listeners to child training broadcasts in Iowa City, Suchman and McCandless (1940) found that willingness to respond was highest among regular listeners and from individuals most interested in the subject of the inquiry. ingness to respond. Educational level was also related to will- The study itself was carried out in three phases: first, a mail questionnaire; second, a follow-up mail questionnaire; and, finally, a telephone interview of a random subsanrp1e of the nonrespondents to the earlier inquiries. Gaudet and Wilson (1940) reported on a study consisting of initial interviews on 2,800 individuals followed by reinterview of a random sample 01'1,800 of the original sample. They concluded that the refusal rate was negatively correlated to the educational level. A second non- responding group consisting mainly of industrial workers was not avail~ able for interview. Shuttleworth (1940) conducted a study on the employment status of technology and chemistry graduates in New York City. The survey con- sisted of first and second mail questionnaires and personal interviews of the non-respondents. On the basis of the results, it was concluded that unemployment was more prevalent amorig the non-respondents than among the respondents at the initial phase. Reporting on a study of over 4,900 compIeted interviews of consumer requirements, Hilgard and Payne (1944), have indicated the magnitude of bias resulting from the respondents successfully contacted at 7 the initial phase. Subsequent interviews were made on different days of the week and at various hours of the day. In this manner, it was found that housewives with young children and those who worked only in the home were contacted more easily than those without children or who worked·outside the home. Wallace (1947) reported on a study of Time subscribers comparing the characteristics of mail respondents and personal interview results on the non-respondents. It was tound that there were few important differences between the two groups. There were, however, a significant- ly higher proportion of college educated persons among the mail respondents. Because no statistically significant differences were found between the groups with respect to marital status, income, residence, ownership of appliances and other features, he conCluded that the sample of mail respondents was in fact a random sample of all Time subscribers. It should be noted that even though a population is homogeneous with respect to some. characteristics (i.e., respondent characteristics non-respondent characteristics) others may differ substantially. It is in the latter instance that elimination of bias becomes essential. Up to this point, the literature reviewed was concerned primarily with pointing up the existence or nonexistence of bias. The articles reviewed below are mostly concerned with the theoretical aspects of bias and its control. Generally speaking, one of three devices has been used to adjust results from mail and personal interview surveys. These ar.e: 8 (1) enumeration by initial contact and then recall of a subsample of the non-respondents (2) obtaining auxiliary information on both respondents and nonrespondents that is correlated With items to be estimated (3) postulate a mathematical relationship (model) of the population and use information from successive interviews to estimate the unknowns of the model and use this to describe nonrespondents. In 1948, Ferber considered the use of tests for random order as a basis for measuring the correlation of sample unit characteristics to Willingness to respond. The responding questionnaires in a mail survey were ranked according to the value of the measured characteristics. Another ranking was also made on the chronological order of receipt. By calculating the Spearman rank correlation coefficient (r) one could test for correlation. A significant value of E implies that premature termination of the study or that a complete disregard of non-respondents could introduce bias. Naturally, such a test is made on the null hy- pothesis of independence of the value of the measured item and the promptness of response. Such a null hypothesis would not be applicable if one of the items measured were geographical location. However, the procedure seems appropriate in many instances. Commenting on the procedure suggested by Ferber Zeisel (1948), Ford and (1949) questioned its usefulness as a predictor of characteris- tics of non-respondents. They showed that projection of trend differ- ences among early and late responses were unreliable in particular cases. 9 Their data were taken from a survey of employee attitudes with respect to previous employment. They found that willingness to respond. as measured by earlier returns was correlated with the ratings made by the present supervisors. Projection of results based on an estimate of the trend was found to include a disproportionately large number of unsatisfactory employees when the results were compared to personnel records and the responses of follow-up respondents. Probably the first published work on the statistical aspects of non- response in sample surveys was by Hansen and Hurwitz (1946). They studied sample design, including estimation, appropriate to a single stage mail survey. Follow-up, Le., the first phase, is made by all possible means to insure response. ··e This is usually accomplished by personal interview and intensive recall. The follow-up interviews are made on a randomly selected subsample of the non-respondents. The authors proposed the following unbiased estimator of the population total for this plan: , N (-' -") x=-mx+sx n where N = total number of addresses in the population n = number of questionnaires mailed m = number of mail replies s = number in the sample of n who do not respond to mail questionnaires X' = sample mean of the x" = sample mean.of the.!: respondents to the personal interviews. ~ respondents to the mail questionnaires 10 Also derived in the above paper was the variance of' the proposed estimator. authors. However, an estimate of' this variance was not given by the Application of' the method to several survey conditions was presented. Formulae f'or optimum allocation of' ef'f'ort between mail and f'ield interviews were derived. These values are optimum in the sense that they provide minimum expected cost f'or specif'ied precision (variance) • Comparisons were made with plans calling f'or complete enumera- tion of' all non-respondents. In this thesis, the greater ef'f'iciency of' subsampling of' non-respondents will be proven in generaL The authors stressed that their sample design was most ef'f'icient when the response rate to the mail questionnaireishighandthedif.f.erence between the cost of' obtaining this response and the personal inter- ··e view costs is large. This last comment should be borne in mind when reading Durbin's 1953 paper. In 1949, -Hendricks suggested f'inding general mathematical laws which would enable one to project trends based on results obtained f'rom at least three successive mailings. He postulated the f'ollowing rela- tionship f'or responses to successive mailings. where X x- 2 Ln ~ : N (0, = mailing number (l,2,.~.) = average number of' mailings required f'or response. x 0- ) Using the data made available by a North Carolina f'arm study, the variance of' this postulated distribution was estimated. Further, he approximated the relationship between the items of' interest by a 11 quadratic function in X, the number of mailings. Using this equation, values of the items of interest as well as the number responding to future :mailings may be predicted and used to adjust the results. Another method proposed by Politz and Simmons (1949) was aimed at reducing non-response bias and at the same time eliminating expensive recalls. The technique is applicable to human populations and then only where the method of contact is instantaneous, as in personal interviews or telephone interviews. The technique consists of contacting each individual in the sample at a "random" time in the interview period. From each person successfully contacted, information is obtained as to whether or not that person was at home at certain other "random" times during the interviewing period. This information was t~en used to esti- mate the proportion of time that person was at home during the interviewing period. For practical purposes, the popuiation was divided into six categories; those at home for 1/6, 2/6, 3/6, etc., of the interview period. The average response in each of these categories was then weighted by the reciprocal of the associated proportion. This provided quasi-unbiased estimates, since the category 0/6 was not sampled at all. The practical shortcomings of this plan include the selection of "random" times for interviewing and the reliance on interviewer memory for estimating his "at home" record. In 1955, Durbin compared a method utilizing continued recall to the technique suggested by Politz and Simmons. He also showed that the former method, which is essentially the Hansen and Hurwitz (1946) approach, gave little increase in efficiency when compared to a 100 per cent 12 recall scheme unless the relative cost of obtaining non-respondents was at least an order of magnitude greater than the cost of obtaining initial responses. (This conclusion seems to be in keeping with the appli- cation as made to mail surveys with personal recall by Hansen and However, the suggestion made by Cochran (1953) that the Hurwitz .) Hansen and Hurwitz method be used even when initial contact is by personal interview is not encouraged by Durbin I s results. Durbin found that the relative efficiency of the Politz-Simmons approach to a recall method on all non-respondents is approximately equal to 1 +~ 2 , where e is the correlation between the response characteristic and the probability of successful contact and response during the interviewing period. This result is true only for the case where the relation between the value of the response variable and the probability of successful contact is linear. The distribution of the probabilities was assumed, by Durbin, to be triangular. The principle objection to the conclusions is that the author assumed that interviewing costs were equal at all phases. This, of course, is not true in most practical cases, particularly, in mail-personal recall methods. Birnbaum and Sirken (1950) considered the problem of bias due to non-response when sampling from large binomial type populations, i.e., responses of the type "yes or no". approximations to the binomial. The results are based on normal The authors present tables giving limits on the bias and expected cost of the survey. Variance of the estimators are tabulated as functions of sample size and the number of call-backs made on the non-availables. Their method of sampling differs 13 from the Hansen and Hurwitz type in that each 'lUlit is contacted up to k times. If a response is obtained on the ith call (i ~ k), the 'lUlit is considered to be a respondent. Tables of minimum sample size re- quired for a given precision are presented for values of number. of call-backs, from one to five. ~, the maximum The techniques and tables pre- sented do not apply to responses which are not of the "yes-no" type. In the development of the theory of sampling with non-response, Deming (1953) states "the bias of non-response is probably so serious in many if not most surveys that the specification of the number of recalls, and the adjustment of the original size of sample to permit either the use of the Politz plan or the requisite number of recalls to balance the bias of non-response against the variance, and to stay within the allowable budget, are an essential part of sample design where the aim is to produce as much information as possible per 'lUlit cost." Deming's sampling plan consists of making successive recalls on a fixed proportion of non-respondents to the previous call. This is in lieu of procuring information on the proportion of time home as in the Politz plan. However, in assessing the relative bias, mean square error and cost for varying number of recalls, the author constructs a hypothetical popul~tion in the same manner as did Politz (1949). He concludes, on the'basis of several contrived examples, that most recall plans should allow for at least 4 or 5 recalls. Only when the preci- sion requirements were high, was the Politz plan as efficient as the Deming recall plan. Yates (1949) suggested that, "the simplest way of dealing with non-response is to regard the non-respondents as similar to the remainder of the sample, 14 i.e., to treat the sample as if it were a sample on a smaller number of units." In this instance, one may take the approach due to Durbin (1957). The theory outlined by Durbin (1957) is an extension to stratified and multistage sampling of standard ratio methods of estimation. The problem of non-response is essentially ignored by definition of a domain of study consisting of respondents only. Its application to sam- pIing in the presence of non-response is therefore only valid under the conditions implied by Yates above. Another plan for utilizing population information to achieve maximum efficiency was put forth by EI-Badry (1956). The method of sampling involves using waves (phases) of mail questionnaires each to a subsample of the non-respondents to the previous mailing. ,.A ·w The final phase of interviewing is carried out by personal interview on a subsample of the non-respondents remaining after ,all the mailings have been made. Clearly, this is an extension of the Hansen-Hurwitz (1946) plan which involved using only one mail attempt followed by personal interview. El-Badry considered the following unbiased estimator 'of the population total: ~l + .•• k k x31 2 3 + n ml m II k. i=2 J. n m2 m II k. i=2 J. 15 Where ~ = number in the original mailing Nl nil = = number of sampling units responding to the i th mailing nm2 = ki +1 = sampling rate to be used at the i th mailing m = number of sampling un1 ts in the population number of sampling un1 ts not responding to mth mailing number of mail attempts. The final phase is carried out by interviewing a fraction, n m2 non-respondents at the mth mailing. ~, of the El-Badry derives optimum value~ for the sampling fractions and the original sample size based on a generalized Hansen-Hurwitz cost function. A study of the results of a stratified one-stage sample survey on morbidity carried out in Nashville, Tennessee, was presented by ··e Vaivanijkul (1961). The study consisted of comparing the responses of 2,564 respondents to the response of a subsample of 85 non-respondents. The latter were made to respond by an intensive recall procedure. It was found that of 34 items studied there were significant differences between the respondents and non-respondents in only six. It was con- cluded that the bias due to non-response would have been negligible had the non-respondents not been considered at all. Some important conclusions which can be drawn from the above studies and from the existing literature on the problem of non-response are as folloW's. (1) There is a certain amount of risk in putting confidence in survey results that neglect the possible effects of nonresponse. 16 (2) It is inefficient and even futile to strive for complete coverage of, or merely resorting to larger samples as a means of avoiding non-response error. (3) It is important to utilize past experience about a survey or similar surveys in planning. (4) Present theory of non-response needs to be developed and extended to more commonly used sample designs. 1.3 Notation When possible standard notation has been used throughout this thesis. For reading ease and to maintain continuity many of the sym- bols used are defined where introduced. the following tables are presented. However, for ready reference 17 Table 1. Notation relevant to unstratified single phase sampling Spibol Population Sample Definition Number of first stage units (FSUs) M m Within the i th FSU: Number of times PSU i appears in sample Selection probability Inclusion probability Number of second stage units (SSUs) Number of responding SSUs Number of non- responding SSUs Number of first phase SSUs Resample rate at first phase Proportion of respondents R Proportion of non-respondents i\ Measure of responding SSU i .J. Y Ylij Y Y2ij Mean of all SSUs Yi Yi Mean of all responding SSUs Yli Yli Mean of all non- responding SSUs Y2i Y2i Measure of non-responding SSU lij .J. 2ij .! Mean of first phase SSUs Y2i Total of all SSUs Yi Total of all responding SSUs Yli Total of all non-responding SSUs Y2i Total of first phase SSUs Y2i 18 Table 1 (continued) Definition Symbol Population Sample Mean square of allSSUs 2 (Ni-l)ai 2 (ni-l)si Mean square of all responding SSUs 2 (Nli-l)ali 2 (~i-l)sli Mean square of all non-responding SSUs (N2i-l)a~i (~i-l)S~i 2 (h2i -1)s2i Mean square of first phase SSUs For the case of stratified single stage sampling, all the notation of Table 1 is made applicable by adding to each symbol the subscript as, for example, N 2ki ··e ! representing the total number of non-responding SSUs in FSU i of stratum k. Table 2. Notation relevant to unstratified multi-phase sampling Symbol Population Sample Definition Within the i th FSU ~ Number of SSUs selected at second stage Number of responding SSUs at phase ~ N(U+l)l (u=1,2, ••• ,p-l) Number of non-responding SSUs at phase Sampling rate at phase ~ Sampling rate at phase ~ Mean of all responding SSUs at phase ~ . i N(U+l)2 n(u+l)2 i [f(U+ll- wi ~ - Y(u+i)l 1 19 Table 2 (continued) Symbol Population Ssmple Definition Mean of all non-responding SSUs at phase .& ~ y(u+l)2 Mean of uth phase SSUs y(u+l)2 Fraction of responding SSUs at phase i ~ F(u+l)l Fraction of non-responding SSUs after phase Mean square of responding SSUs at phase ~ ~ i F(u+l)2 2 CT(u+l)l Mean square of non-responding SSUs after phase .·e 2 CT(u+l)2 ~ For the case of stratified multistage sampling all the notation of Table 2 is made applicable by adding to each symbol the subscript .! as, i for example, N(u+l)2k representing the total number of non-responding th th u phase SSUs in the i FSU of stratum .!. In defining inclusion probabilities, consider a particular sample of n units drawn out of N. If the n units are drawn without replacement there are '(:) different samples. Further, if order of draw is consid- ered each sample may be drawn in n! ways. Associated with each of the (:) samples is the unconditional probability of its being drawn. particular for the sample s, this probability is denoted by P. s In It is computed according to the elementary laws of probability from the selection probabilities defined at each draw. If sampling is performed 20 with replacement, the definition of Ps remains the same except that.! now ranges over ( N+n-l) N-l different possible samples. The definition of inclusion probabilities is as follows: where the first summation is over probabilities P associated with s sample ! which contains FSU .! and the second summation over probabil- i ties Ps associated with sample ! containing FSUs .! and J.. 21 2.0 THE THEORY OF SAMPLING WITH NON-RESPONSE AT THE SECOND STAGE In sampling from a fim te population, the previously drawn units mayor may not be replaced after each draw. stage or phase in the sampling procedure. This may apply at any In this chapter we shall consider sampling plans in which the following methods of selection apply at each step in sampling. 1. 2. First stage units are selected with unequal probabilities ( a) without replacement (b) with replacement. Second stage units are selected with equal probability and without replacement. ··e 3. First phase units are selected from among non-responding second stage units with equal probability and without replacement. For the development of the theory, it is assumed that the frame from which sampling with replacement at the first stage is made is such that N i ~ ron for all!. i In practice, this premise should p~esent no problem, since the likelihood of repeated selection of a particular first stage unit, say up to ~ times, is extremely small for reasonable size M, the total number of first stage units in the population. A second point worthy of mention is that the development is restricted to only one type of estimator of the population total. is done for two reasons. Firstly, it is an extended version of the estimator already considered by Hansen-Hurwitz (1956). This (1946) and El-Badry Secondly, the results may be extended to other estimators 22 either by application of the same methodology developed herein or, by substitution of appropriate expressions in the results obtained below for this estimator. 2.1 2.1.1 Sampling with Unequal Probabilities and Without Replacement at the First Stage The Sampling Procedure. Let us assmne the existence of a population frame consisting of M first stage units. tion a sample of ~ first stage units is selected. From this populaThe method of selec- tion is such that for the i th primary unit the inclusion probability is Pi • The i th first stage unit consists of N second stage units. i the second stage n i :s ni :s Ni ) units (where 1 At are drawn from among N i with equal probability and without replacement. ··e Initial contact is then made with the selected second stage units. From the i th first stage unit ~i second stage units (where ~i respond and the remaining units ~i (where ~i spond. = ni :s ni ) - ~i) do not re- In practice, the second stage units might represent human beings, firms, drugstores, farms, etc., some of which respond to an initial inquiry and some of which do not. Subsequent to this initial attempt at soliciting a response the following procedure is carried out. Select from among the n non2i respondents in the i th primary unit a subsample, again with equal probability and without replacement. This step in the process of selection may be designated as second-stage first-phase sampling. units selected at this step is bol f i ~i· Let f i ~i = ~i. The number of Thus, the sym- is seen to be the reciprocal of the sampling rate used at the first phase among the non-respondents. Each of the selected h units 2i is now recalled until response is attained. It is at this phase where extraordinary means are adopted to seek out and solicit responses. The cost per schedule at this phase is many times higher than at the initial contact. Having obtained responses from (~i + ~i) second stage units we are not in a position to estimate the population total for the measured characteristic. The n initial respondents measure the characteristics li of respondents while the h initial non-respondents estimate the char2i acteristics of that part of the population which is not amenable to initial contact. 2.1.2 An Estimator of the Total. .···e Consider the following estimator of the population total for the measurable characteristic 'l: n,i ~i!l,..i ( r Ylij + h .'if' Y2ij ) j=l 2i J=l (2.1) This estimator provides an unbiased estimate of M Ni T = E E Yij i=l j=l the population total. The proof is as follows: on a given set of all (~~) sets of ~ Taking expectation of (2.1) conditional primary units and fixed sample sizes ~i ~1' ~i' units each, produces the following results: over 24 .... E(Tll m'~i,h2i) • where Y2i (2.2 ) is the average value for all non-respondents in the ~i sample. Now taking expectations over all sets [~i' ~J such that ~i + ~i = E( ~i) :: ni Ri ni and making use of the results and Finally, taking expectations over all possible sets of m first stage sampling units: .... E(T ) :: .l ~ 6 P 6 m Y i ) 1 Pi s (~ ( M ~ 6)i p)=~ s 1 Y i Pi .p i (2.4) The probability, Ps ' is the unconditional probability of selecting a particular set, namely the set~, of ~ out of Mfirst stage units. P is calculable from the probabilities defined for the selection of s the first stage units. The selection probabilities are arbitrary and the drawing made without replacement. The notation 1:: indicates sJ i summation over all P associated with sets that include the first stage s unit 1. 25 A Variance of the Estimator, 2.1.3 A ~l' The variance of ~l is found to be: M YiYj + EE P P ifj i j -.2 Pij - r . It can be seen that this expression has three discernible parts. The first is a quadratic function of the first phase sampling units, Le., the variance among the non-respondents. The second is the cor- responding variance among the responding second stage sampling units. The third part represents a quadratic function of the first stage unit ··e totals. It will be shown later that for certain selection schemes the third part reduces to the familiar expression for the between primary unit variance for finite populations. The proof of (2.5) is accomplished most directly by utilization of a theorem formalized by Madow (1949) which is (2.6) where ~ represents a particular sample configuration consisting of first stage units. The i th unit contains n li out of n i ~ responding second stage units and ~i responding second stage (first phase) units out of ~i second stage units which did not respond. called that ni = ~i from first stage unit + ~i .!. It will be re- is the number of second stage units drawn 26 Consider the first term on the right hand side of (2.6): where the quantity • ~iY2i ... has been added and subtracted from the origi- nal expression for T given in (2.1). l Since the sampling in each first stage unit is independent of the sampling carried out in other FSUs for the given set of ,!!; first stage units, (2.7) may be written (2.8) For the second stage sample of !, continued sampling at a rate (f i ~i rl non-respondents in primary unit and the process of taking expec- tations relevant to second-stage first-phase sampling leads to replacing s~i by O"~i' Here the expected value of its properties as a binomial variate. ~i is easily determined from Thus the expression becomes: 1 ~[. (1 - -!) n. -O"~ + R (f -1) 0"2 VeT... Is) = m E ---- .1 1 ~ i ni Ni ni i i J 2i • Finally, taking expectations relevant to first stage sampling gives: (2.10) which represents the first expression in (2.5). 27 The second term on the right hand side of (2.6) can be evaluated as follows: m 2 r, m ~2 Y Ip) - LE(.E Yi/Pi)J 1 i i 1 = E(.E (2.11) The notation .E s:> i,j indicates summation over all probabilities P s which are associated with sets including both first stage units Adding (2.10) and (2.11) yields the desired result 1 and 1. (2.5). In the particular case where the sampling is in one stage, 1.e., dropping the subscripts 1 and setting Pi = 1 we find that (2.5) to: This result was first obtained by Hansen and Hurwitz (1946). reduces 28 2 It should be noted that when f i = 1, 0"2i disappears from the variance formula (2.5). In effect, non-response has been eliminated by making the recall rate unity. 2.1.4 An Unbiased Esti~tor A of V(Tl ). An estimator for (2.5) and, incidentally, for (2.12) has not been given so far. One possible estimator will be developed below. A In order to develop an unbiased estimator for V(Tl ), consider the following sample statistics for the i th first stage sampling unit: (2.14) Expanding ~ and ~ and adding them together we obtain: (2.16) Taking expectation of this quantity conditional on fixed n , over 2i all first phase samples of size ~i drawn from the selected ~i 29 non- respondents we have: where Y2i is the average response of all ~i non-respondents. Note that sums taken over non-respondents, e.g., Y2i are not over all ~i values. To the right hand side of (2.17) add the quantity .L ( ni 1 n ~i i=l .: 2 + 2iJ y' 2 (~iY2i) • ~t y. y. /) Jrf 2iJ 2i') and subtract the identical quantity After some algebra we optain the expression: i (2.18) where the subscripts 1 and 2 have been dropped on the first term on the right hand side of (2.18) indicating summation over all (~i + ~i) observations. In like manner ~i is the average of all (n1i + ~i) values. 30 Taking expectations over all samples of size ~i out of all possi- ble N items (2.18) becomes 2i (2.19) where 2 (ni-l) si = ni E (Yij j=l • 2 Y i) • Finally, taking expectation of (2 .19) over all ~i ~ n i gives (2.20) Let us now return to (2.15). ~i' ~i; Taking expectation conditional on followed by expectation conditional on expectations over all values of ~i ~i; and then taking we obtain: (2.21) Now combine (2.15) and (2.16) linearly to obtain an unbiased estimation of the quantity: Bi = (1 ni 2 2 - N ) ~i + Ri(f i - 1) ~2i • i This is accomplished by using the coefficients (2.16) and (f -l)(N -1) i i Ni(ni-l) for (2.15), respectively. estimator of (2.22) is given by the sample statistic: (2.22) Ni - n i Ni(ni-l) for Thus, an unbiased 31 It should be noted that the terms considered involve second stage sampling units. The expectations have been taken over second stage and first phase sampling units. To complete the derivation of an estimator for V(Tl ) consider the quantity: V(Tl ) = m1E 1 Pi n~ ~PirPiPj) m (l-Pi ) 2 m (b i ) + E -:2 Yi + E E P P P i 1 irj ij i j F1 YiY j ,. which shall be shown to be an unbiased estimator of V(T ). l Taking expectations of (2.24) relevant to those steps in sampling beyond the first stage for the given set of terms of ~ first stage units, the (2.24) are The last expression is true because of the independent sampling carried out in the first stage units. Inserting these results in taking expectations over all sets of becomes: ~ (2.24) and FSUs out of M, the expression 32 This may be rewritten: (2.26) Substituting Pi for I: Ps and Pij for I: Ps s' i s ; ) i, j final result, Equation in (2.26) gives the (2.27) is identical with (2.5) and hence, (2.24) is an ,. unbiased estimate of V(T ). l 2.2 Sampling with Unequal Probabilities and with Replacement at the First Stage 2.2.1 The Sampling Procedure. The population is as defined above in 2 .1. The only difference in the sampling scheme is that each of the first stage units is replaced prior to selection of the next unit. Thus, it is possible for the i th FSU to be selected up to ~ times. Each time a particular FSU is drawn into the sample, a different set of second stage units is selected. Hence, it is possible that as many as th SSUs maybe required from the i first stage unit. For this reai son it is necessary that Ni ~ Inni for all ~. th In the following development, the i FSU has probability Pi of Inn being drawn into the sample at any draw. The symbol ., i represents the 33 number of times the i th primary unit is drawn in a sample consisting of ~ units. Thus, 7i may take on the values 0, 1, 2, ••• , m. Further, it M is obvious that 2.2.2 i~ 7i = m. An Estimator of the Total. For the above case of replace- ment at the first stage of selection, it can be shown that is an unbiased estimator of the population total T. ... ~2 In (2.28) given by: Yli is the sample average of the respondents in the i th first stage unit and ~i is t1.le average number of respondents for the 7i times this unit is drawn into the sample. and ~i' Similar definitiona hold for the quantities It is thus seen that ability that 7 i ii.li + ~i = ni must hold. Y2i The prob- = k is given by: Also (2.29) Proof: = 7i~iRi • But _ 7i ( ) 7 n - EILand the proof of 2.29 is complete. i li - f=l .Lif A Taking expectation of ~2 conditional on fixed ~i and ~i results in the expression E(T I i i ) = 2 ~i' li N 1:m ~..! p 1 i - ~i [~i Y +- Y li n ~i 2iJ (2.30) i Averaging over all values ~i conditional on ~i and using (2.29) the following expression is obtained: Finally, taking the expected value of (2.31) over all integral values of ~i :s m and using the fact that it is seen that (2.32 ) A which shows that T 2 is an unbiased estimator of T. A 2.2.3 Variance of the Estimator, T2 • is found to be: where A The sampling variance of T2 35 Direct application of each part of Madow's theorem to the estimator as written below leads to the simplest proof. Write T as: 2 .& where the quantity Y2i represents the unknown sample mean based on "i~i non- respondents. _ _ _ ~ .& We may replace '(~iYli n+ ~iY2i) by Y , the true sample mean. i i Thus, for the first part of Madow's theorem the expression is given by since there is zero covariance between ~i and (Y2i - ~2i). The condi- tion indicated by ! on the left hand side is on a given set of "i. Taking expectation on first phase sampling the equation becomes: M(N.)2 2 [ + E p~ "i V(Yi ) E ~i 2 (fi-I) n· "i The expected value of this expression taken over ~i ~ 1 i D- ~i i on "i' gives and over "i A ~ . E V( Tis) 2 m it reduces to 1 =-m M1 E - 1 Pi ~ [( n i ) 1--. ni Ni - This completes the first term of Madow's theorem. 2] 0"2i • ni , conditional The second term in the theorem yields the following expression: Now it is well known that for multinomial variables:' = m Pi (a) veri) (b) cov(ri,rj) (l-Pi ) =- m PiP j • Thus .-- ~-. which is the same as 1 2 m by =-0: (2·35) • Adding the formulae (2.34) and (2.35) produces the desired expression (2.33). 2.2.4 An Unbiased Estimator of V(T" ). 2 biased estimator of the sampling variance of square between first stage units defined by: " 2 (m-l) £- = ~ i _ T2 ) oy 1 i Pi m r (y where In order to derive an un- " T.2' consider the mean 37 We may rewrite (2.36) in the f'orm 2 i) " M (Y 2 (m-l) s.. = 1: oy 1 Pi .. 2 - mT i2 • Taking expectations relevant to the second stage and second stage f'irst phase sampling this expression becomes: (m-l.) E(~y) = E [~ 7 ~ I 7~J i, i Using the relation: = E(x) 2 + V(x) and noting that 2 E (y~ li'''i) = Pi (2.37) becomes (m-l-) E(s..2y) o = E~l p"2i . [(.--.2 L~ _ 1:i i - N ...,2) + iVi _2(...1- ...,2 + RNi. "n vi i i l (f'i- )...,2 )~l_ mE(T.... )2 ni"i v2i~J .2· Consider the f'ollowing term which is included in the square brackets on the right hand side of' (2.38): • It can be written in the alternate f'orm 38 where the symbol m' indicates surnxnation taken over the different units appearing in the sample. But the above ,expression can be written (2.39) where the sum is taken over all first stage units drawn into the sample and where .,;. can only take on values in the range 1 through m' E.,;. the restriction that ~ With = m. 1 Substituting (2.39) into (2.38), noting that and taking expectations relevant to first stage sampling, the folloWing result is obtained after some algebra. M 21 M 1 ~ (2.40) + E NiCJ"i - E B 1 . m-l 1 Pi n i i It is observed that the terms on the right hand' side of (2.40) have the folloWing relations to the expressions shown below: 2 0: by = ~ Pi~ rI- 1 (2.41) 39 (2.43 ) (2.44) In (2.42) and (2.43), the statistic s where ~, ~ tively. 2 s~ is defined by = i and ~ are given by (2.13), (2.14) and (2.15), respec- After substitution of (2.41), (2.42), (2.43) and (2.44) into (2.40) and after rearrangement of terms, it is found that, where the operation of expectation is over all relevant steps as above. Substituting (2.41) in the right hand side of (2.33) defines the left hand side of (2.45). A mV( T ) 2 t = E s..2oy Thus (2.45) can now be written 1 -- m-1 and using (2.43) and (2.44) this may be rewritten, after some algebra, as 40 ,.. V(T2 ) m 6 1 ~2 +1- I: -~-i (1 - ::T"") m y m-l 1 2 r;J. Pini 1 =-E~ m -Hi si~ • 1 Pi (2.46) - I: Therefore, ,.. ,.. 1 2 1 V(T2 ) = iii ~y + m(m-l) b m ..u (1 71) 2 i I: I 1 Pini m Nisi2 1 -I: -,m Pi l ,.. is an unbiased estimator of V(~2)' This form has a structure similar to the variance estimator given by Sukhatme (1954), page 388, for the case of complete response. ,.. ,.. 2.3 A Comparison of V(~l) and V(~2) where the FSUs Are Selected with Equal Probability for Each Case One case which may have much utility in practice is where the first stage selection probabilities are equal. In such instances, selection schemes featuring replacement of sample units prior to each draw are generally thought to possess a larger variance than non-replacement ,.. schemes. This matter will now be considered for the estimators ~l and ,.. T given earlier in the chapter. Their variances for arbitrary selec2 tion probabilities are given by (2.5) and (2.33), respectively. For the special case of equal selection probability, that is, in effect, where the inclusion probability P. is equal to m/M and the selection ~ probability Pi is equal to l/M the formulae become: V(T1 ) = ~ ~ ~ [(1 - :~) cr~ + Ri(fi-1) cr~iJ + ~ (1 - ~) cri 41 and ,.. MM V(T ) 2 = -m E l respectively. In order to compare these quantities consider the difference ,.. ,.. f:::,. = V(Tl ) - V(T2 ) which is equal to: "" D ~ (1 - ~) lT~ + <,";.1) f NilTi - ~ (1 - ~) lT~ which reduces to the equation m M 2 5 = ( - ) f:::,. = I: N 0" m-l 1 i i - 2 M cry (2.48) Now 5 is a quadratic form in Y , the values of the- character iJ being measured, for all individuals in the population, both respondents and non-respondents. If the matrix of such a form is examined according to well known determinantal laws, it can be shown that 5 is an indefinite quadratic form. As such, it may, depending on the particular values of the variables, i.e., Y , N and M, be either positive, i iJ negative or zero. Thus no general statement as to the relative magnitudes of these variances can be made. 3.0 OPrIMUM SAMPLE ALLOCATION FOR NON-RESPONSE The first aim of this chapter will be to develop cost functions appropriate to each of the sample designs considered in Chapter I. These cost functions will then be used in combination with the corresponding variance formulae to achieve optimum allocation of the available resources. The solutions will be given to satisfy the following criteria: (1) expected cost to be minimized subject to a preassigned precision (variance) (2) variance to be minimized subject to a preassigned expected cost. Next an investigation will be made into the comparative efficiencies of: (1) optimum allocation of resources with complete enumeration at the first phase and, (2) optimum allocation of resources with enumeration of a subsample of rion-respondents at the first phase. In addition, a set of graphs are presented which simplify the calculation of optimum allocation values. Finally, the above results will be extended to stratified sample designs. 3.1 A Cost Function A general cost function may be developed for two-stage one-phase surveys with non-response by considering the variable costs introduced into the sampling program. For the sample designs of the previous chapter, consider the following costs: ' (1) the average cost associated with first stage un1 ts • This is necessary if the follow-up interview involves personal contact. (2) the average cost of sending out the first questionnaire or otherwise making the initial contact. This averaae cost will be allowed to vary from FSU to FSU but, in general it will usually be constant for mail questionnaires. C~) the average cost of processing respondents from the initial contact. ( 4) ". ~' A This will be different for each first stage un1 t. the average cost of obtaining and processing responses by personal interview or by other follow-up techniques. These average costs per schedule for the i th FSU are denoted by co' c , c and c , respectively. Using these definitions and those 1i 2i 3i outlined in the first chapter we may define the following cost function: m m m C = m Co + ~ ni c1i + ~ ~ic2i + ~ ~ic3i/fi • The first term on the right hand side of (3.1) is the total cost associated with the various first stage units selected. The second term is the total cost of the initial mailing (or contact). The third term is the cost of processing the initial responses and the last term represents the total cost of soliciting and processing responses from a subsample of the non-respondents. As (3.1) stands, C is a random variable since the last two right hand terms vary from survey to survey. To avoid this complic~tion, the 44 expected cost associated with the sampling plan will be considered as the criterion. If R is the average proportion of responding SSUs in the i th FSU i and, if R = 1 - R is the average proportion of non-responding SSUs, i i then the expected cost for the case of selection without replacement of the FSUs is found to be: In (3.2), Pi is an inclusion probability. For replacement of each FSU with general selection probabilities Pi' the expected cost equation is found to be: C2 = m Co M + f m Pini (c1i + Ri c2i + i\c3i f i ) The problem is to determine fi' ni (i = 1,2, ••• ,M) and m such that the expected cost function (e.g., 3.2) is a minimum subject to a fixed variance or conversely. For reasons given in the introduction to Section 3.4, the sampling plan involving the selection of FSUs with replacement will be considered first. 3.2 Sampling with Replacement at the First Stage 3.2.1 Fixed Variance and Minimum Cost. Consider the problem of determining the optimum allocation for the sample design described in Section 2.2. The estimator of the population total, T, is given by: For this estimator the appropriate sampling variance was found to be To find values of ni , f i , and ~ such that the expected cost func- ,. tion is a minimum subject to a fixed variance, V(T ) 2 of Lagrange can be used. = V ' the method 20 However, to use the Lagrange method it is necessary to evaluate the definiteness of a matrix of order (2M+2) in , order to ascertain the nature of the stationary solution, i. e., maximum, minimum or neither. Because of its generality as well as its ease of application, an alternative procedure which makes use of the Cauchy Inequality will be used in all derivations of this nature. This approach is described in detail by Stuart (1954). The Cauchy Inequality may be written: (:5.6) where a i and b i Furthermore, equality in (3.6) being are real numbers. necessarily a minimum of the left hand side), is attained, if and only if, a b i = K = constant for all i. i To make use of this inequality, identify the term (~ a~) with the ,. 2 J. variance, V( T ), and the term -( 1:: bi) with the expected cost function 2 i c2 • 46 ~(T2 ~ If the product [C2] is minimized by insuring certain relation- ships among the allocation constants, then exact solutions may be found ... by requiring, in this instance, that V( ~2) = V20. This requirement is met by suitable definition of K, the constant of proportionality given above. ,.. To apply this technique rearrange the formula for V( ~2) and write the expression: ~ ~ ME 1 - · - [2 - 2J +M - r2 + cr -Rcr E f-i - · - R c 1 mni Pii i 2i 1 mni Pi i 2i • ". '\.'a The corresponding expected cost function in this sampling situation is given by (3.3) and it may be rewritten as follows: Now define the following relationships between the terms in equation (3.7) a a and (3.8) 2 2 [0: - ~ m by i=l t~:1) Nicr~J 0 J. J. =! 2 1 =i mn i 2 a(i+M) = (3.6): and the a. and b. of [~Pi (cr~ f i mni - R cr~i)J [~R1cr~iJ . Pi , J. i = 1,2, ••• ,M (3.9) 47 and b 2 =m c 00' , i = 1,2, ••• ,M. (3.10) Applying the conditions required for a minimum produces the following equations: a -,'..e 0 .! b0 m a b i i = tby2 1=1M(Pi)Pi 12f a: - I: c 1 mn i NO" - i = K 0 r 2]' ~;J O"i - Ri 0"2i + c2i Ri c li =K ~ i = 1,2, ••• ,M. ,These equations are readily solved for the variables ~, mn , and fi/mn i i in terms of K. The solutions may then be substituted into the right hand side of (3.7). The constant K is thus given by the relation: 48 Eliminating mIl i K from equations C~ .12) and C~ .13) yields the solutions: i = 1,2, ••. ,M. Eliminating mK from equation (3.11) and (3.12) yields the solutions: , Using the value of K given by (3.14) in equation (3.11) produces the result: - ·e These results are identical to those found by use of the Lagrange method. But in addition, it is assured that these give a minimum ex- pected cost subject to the fixed variance, V . 20 The minimum expected cost is now calculable as a function of the fixed variance V • 20 Substituting K, from C~.14), and the equations (3 .15) through (3.17) into the formula for C gives for the minimum 2 (3.18) It should be noted that the solutions given by equations (3.16), (3.17), and (3.18) (3.15), can only be evaluated by use of a priori 2 2 information regarding the population, Le., O"i' 0"2i' R , etc. i 3.2.2 Fixed Cost and Minimum Variance. the cost of a survey ~ There are instances where be specified in advance. Under these condi- tions the sample design, and in particular the over-all sample size can be so chosen as to insure a minimum variance. This section gives these solutions for the sample design outlined in Section 2.2. The solutions fO\lD.d by applying the Cauchy Inequality are: i = 1,2, ••• ,M, (3.19 ) i = 1,2, ••• ,M, (3.20) 50 where Co is the predetermined expected cost. It appears that only the solution for ~, the number of first stage units to be selected, is affected by the constraint. The variance, under the constraint that the expected cost be equal to Co' is found to be: -e M[2 ~ N i (O'i - 3.3 Connnents on the Optimmn Solutions The most important consideration required to make t~e entire pro- cedure outlined above applicable is that: (3.23) In-the general case, there is no assurance that this will hold since the left hand side of (3.23) is an indefinite quadratic form. it must be evaluated in eacb particular application. Thus, 51 Consider the solutions for f i which may be put in the form: ~ = i c3i (O"~/O"~i - i\) (cli + c2i Ri J , i = 1,2, ••• ,M. It is interesting to note that these solutions do not depend on the size of the first stage units, on the number of first stage units selected, nor on cost or variance criteria. However, since f i is the reciprocal of the sampling rate for FSU !, the solutions must be such that This implies that, ··e must hold for aJ.l -i. , However, this condition does not seem too restric- tive, since usually the right hand side of (3.25) is less than one from c » c and c > c in practice and the ratio of li 2i 3i 3i the over-all within FSU variance to the within FSU non-response varithe fact that ance is usually one or greater. Only in those instances where the tendency for non-response occurs at both extremes of the response range will the variance of non-respondents tend to be greater than the overall variance. This might occur if the question involved, say, income and individuals earning either very high or very low salaries were reluctant to respond. Solutions are optimum, in the uncomplicated case, if the constant, K, determined by a restriction on either the cost or variance, provides 52 feasible solutions in the sense that the following inequalities are satisfied: m < M, ni ~ i N 1 < f i J, i = 1,2, ••• ,M. (3.28) In general, nothing can be done to modify the solutions f in the event that (3.28) is not satisfied. and n i i Only by emending the frame which, in effect, changes the population values 2 , Yi' i CT 2 2i , etc., can CT (3.28) be satisfied, if-at all. Some situations may also arise where the constant K gives a solu- ··e tion for ~ which is not possible, i. e., m > Mfor sampling FSUs without replacement. In this case, acceptable results may still be achieved. The solution provided by (3.35) below requires that: mK = r (M i : ~ Nicrif = cr o If this requires an m > M, set m = M and redefine K mKJ. = Cl • Thus (3.29) is satisfied. = K1 such that This has the effect of nullifying the restriction used to determine the original value, K. The effect of such a change on the expected cost may be evaluated from (3.33) and using the relation: 53 where K = K1 and C = computed expected cost, to determine the variance. A judgement as to the acceptability of this precision and or cost can then be made with the assurance that for this cost, precision is a maximum and, for this precision, cost is a minimum. In the numerical example given in Section (3.6) the situation discussed above does, in fact, arise. The suggested procedure is followed and an adequate sampling program derived. 3.4 Sampling without Replacement at the First Stage with Equal First Stage Inclusion Probabilities There is difficulty in applying the methods of Section 3.2 to the general case of sampling without replacement at the first stage. This is due to the fact that, in general, the inclusion probabilities, Pi' are functions of m. Since ~ itself is to be determined in the procedure, explicit optimum allocation solutions cannot be obtained unless the functional relationship between Pi and .!!!: is known. The optimum solutions for f i (i = 1,2, ••• ,M) remain unchanged since the inclusion probabilities do not enter into the solutions. However, the n and m remain i undetermined in the general case. Because it is commonly used in practice, the special case of equal probability selection of the first stage units with the estimator (2.1) is considered in this section. Assuming that the inclusion probabilities are equal implies Pi = ~ Substituting this relation into the appropriate formulae gives: [n, M M Ni r y i + -roy: ~i %i ] = 1:: 3 n j =1 lij ~i k=l 2ik ' m i=l i ,.. T where 2 cry M = E ii - (~Yi)2 M(M-l) . These formulae represent the estimator, its variance and the ·expected cost, respectively. Application of the Cauchy Inequality produces the following results: i = 1,2, ••• ,M (3.34) i =.1,2, ••• ,M (3.35) m= for the case where V(T" ) is fixed at V and the expected cost 10 l minimized. The expected cost in this situation is found to be: 55 As in the case of sampling with replacement the equations (3.34) and (3.35) also hold for the case of fixed expected cost equal to C o where the variance is to be minimized. ~ In this case the equation for is: and the variance for cost fixed at Co is found to be: -e 3.5 Relative Efficiency In order to eliminate bias the sample schemes thus far discussed have employed the technique of recall on some subsample of initial nonrespondents. Another, and possibly more connnon, technique used in personal interview surveys is to attempt to enumerate all initial nonrespondents. The relative efficiency of these proposed techniques can be assessed for any estimator. formula (3.31). Let us consider the estimator given by Suppose a sample survey is conducted as follows: m FSUs are . th selected from M without replacement, the i FSU being drawn with inm th elusion probability M. From the i selected FSU, n SSUs are chosen i from among Ni without replacement and with equal probability. To each of the selected second stage units questionnaires are mailed and after a fixed time period all non-responding units are personally interviewed. The recall rate, l/f i =L The expected cost function associated with this plan is assumed to be: It should be noted that f i :: 1, substituted into equation (3.33), gives this equation. "e The variance under this plan is given by: 1 _ ni ) ( N i 0-2 i + M2m (1 _ ~) M 0".2 Y c ' the optimum Assuming that the expected cost is set equal to 30 th number of SSUs to be selected from the i first stage unit is given by: ·2 = n i and the optimum number of first stage units to choose is given by: m= 57 Substitution of these results into the formula for the variance <:3.41 ) gives the following variance for this plan: Now an alternate plan utilizing subsampling of non-response with the subsampling rate in FSU 1. given by l/f found in Section (3.2.2) for fixed cost. formulae are given by (3.34), (3.35) and i would give the results The appropriate allocation (3.38) where Co is replaced by C30 • The variance under this plan is given by (3 ~39) again with C reo placed by C • 30 This formula is: Accepting the criterion that the more efficient estimator is that possessing the smaller variance we find on comparing that they differ only in the second term in brackets. (3.44) and (3.45) For the sampling plan requiring recall in the entire non-responding group, this term is while the subsampling scheme has the term 58 Thus if it can be shown that A recall is more efficient > B then the sampling plan with partial than the one calling for complete recall. Consider the following argument: If the general term in the aggregate composing A can be shown greater than the corresponding term Now both sides of this expression are positive and so is the term "e (CT~ - RiCT~i) as can be seen from the demonstrations (3.25) and (3.26). 2 Thus it is sufficient to show that A > B2 which, if true, implies But it is obvious that (3.47) is a perfect square. Hence, A > B as was to be proven. Similar results can be shown to hold for the estimator discussed in Section 3.2.1. 3.6 Application of Theory to Data In order to illustrate the application of the theory developed above, consider two alternative sampling designs as applied to the 59 numerical data given in Table 3. It was shown above that a plan call- ing for subsampling of non-respondents was more efficient than a plan calling for comPlete coverage of the non-respondents. The extent of this increased efficiency will be indicated by the example. "e .. 60 Table ··e 3. Population data 2 FSU N i Y i NiO"i NiO"i 1 2 3 4 5 6 7 8 9 10 11 17 14 18 19 18 16 73 21 44 62 42 72 25 39 84 81.919 23·777 65.984 75.428 35.466 85·666 30.035 46.313 84.698 90.283 160.85 58·958 71.270 74.520 53.041 65.189 50.242 75 ·972 45.007 44.928 57.066 0 41.839 66·356 67.638 1,552.445 13 14 15 16 17 18 19 20 21 22 23 24 25·: 16 18 19 14 15 13 "4 15 16 23 56 64 158 394.75 40.38 241.88 299.44 69.88 458.67 75.18 112.89 597.82 627.00 1,293.63 347.60 282.18 241.45 234.45 265.60 140.24 303.78 144.69 134.57 250.50 0 116.71 275.20 198.91 TOTAL 394 1,443 7,147.40 12 12 19 12 13 20 10 18 23 12 48· 101 41 69 70 43 52 44 74 45 54 58 '·4 61 The data in Table 3 are from a population of 25 first stage sampling un1 ts • The number of second stage sampling un1 ts (N ) in each, i the total for the characteristic (Y ), the product NiCT~, and NiCT are i i given in columns 2, 3, 4 and 5, respectively. In addition, the follow- ing costs, inclusion probabilities, variances and non-response rates will be assumed: Co = 500 c Pi = 1/25 c R = R = .5 c CTi = CT2i c 3i 0 =5 li = .5 2i =1 = 10 It is assumed that the cost and response rates are identical for all first stage un1 ts • This simplifies considerably the computational work while illustrating the efficiency increase, even in such a special case. It is also assumed that the variance among all SSUs is the same as the A. variance among non-respondents. Plan I. The estimator of the total is Complete recall at first phase. ~l. For optimum allocation of sampling effort the following formulae are appropriate; 62 where Using the given data we find that: 25 E 1 25 "e Y i = 1,443 , E'Ni~i = 1,552.44 , e = 14,455.766 , ~ = 3,802.67 • 1 Therefore: Now, m= 26.117 > 25, which is not a feasible solution. Let us therefore redefine!::: by setting m = 25. From the equation: 63 we find that A. = 10.7539 • Using this in the formula for n we find that i NiO"i Checking with the N in Table 3, one finds that all solutions for i ni are feasible (i. e ., Iii > N ) except for the fact that they must be i rounded to integral values. The solutions as found can be used to check out the new expected cost caused by a change in ~. = (25)(5) + (.037962)(1,552.44)(6) . Co = 125 + 353.60 = 478.60 • Thus the cost is not too much altered from 500. We may now com- " to be expected under, this plan from the equation: pute the variance of T l 2 25 V = A. C - t o Plan II. 0 i=l 2 NiO"i • Subsampling at first phase. For optimum allocation with the same cost and response rates as for plan I, we must utilize the following equations: 64 For the data given we find, in addition to the totals found in Plan I, the value M ~ = 2.28826 .E Ni~i = 3,552.386. J.=l Therefore: 500 m = 5 + 13.2134' = 27.4523 • This, too, is not feasible. that for m = 25, ~ = 10.7539. Therefore, recompute ~ as before and find Subject to this new constraint, we find that It should be noted that the number of second stage' uni ts required to contact imtially in this plan is greater than for Plan I. Also, all these solutions are feasible subj ect to rounding to nearest integral value. Similarly we may solve for f as required. i to find Thus all conditions are satisfied simultaneously. Com- puting the expected cost of this plan we find E(C) = Co = (25)(5) Co = 125 + + (.06575)(1,552.44)(3.23606) 330.31 = 455.31 • Again this is less than the '" for this 500 specified. The variance of T.l plan is To compare the two plans we shall use the inverse ratio of the .'e products: (Cost) (Variance) .. for the two plans. Thus the degree' of efficiency, in the sense of this problem, is given by or, in percentage terms, an 11 per cent increase in efficiency :is acl:devei by sub-sampling a"llong the non-respondents. It should be pointed out that these calculations have not taken into account the effects of rounding off n i and f i to usable values. Assuming a rounding scheme allowing for both upward and downward changes, the compensation should not affect the results to any significant degree. 66 3.7 Extension to Stratified Case The theory derived earlier in Section 2.,,2 will be extended to stratified populations. Also, the Cauchy Inequality will be applied to the corresponding optimum allocation theory. Consider an extension of the results of Section 2.2. the population now consists of S strata. first stage units. Suppose that th The k stratum contains The sampling plan consists of selecting ~ l\ FSUs from stratum! with replacement and with selection probability Pki for the i th first stage unit. At the second stage one selects ~i second stage sampling units (stratum!, FSU .!) from among Nki with equal probability and without replacement. Consider the estimator " T s2 = S" I: T k=12k m Nki [i\kiYlki + ~kiY2k~ I: - 1 k=l ~ i=l Pki ki ~ki = I:S -1 which is clearly unbiased, since it is merely the sum of individual unbiased estimators of strata totals. The variance of this estimator is constructed similarly and by extending (2.33) is found to be ~ ~ k=l i=l J N (J2 + ~ .!-- [(J2 - ~ (l-Pki) N (J2 ki ki k=l ~ kby i=l Pki ki ki since sampling in each stratum is performed independently. 67 The corresponding expected cost equation is written as (3.50) Applying the Cauchy Inequality it is found that the solutions guaranteeing a minimum of the quantity (3.51) must satisfy the set of equations "e 1 ~co [2O"kby - ~ i=l (PPkiki) NkiO"ki2~ = >"1 ' k = 1,2,3, ••• ,8, (3.52 ) 1 '2 2 - ~0"2ki - 2J O"ki rkij I!1tI1ti Pki [ckl.i+ C1<2il \i 1 = >"1 ' i = 1,2, ••. ,~, (3.53 ) f ki I!1tI1ti [N 2ki ~ P ki CT k = 1,2,3, .•• ,8, = >"1 ' fc:'k3i ki where >"1 is the constant determined by the particular restriction imposed. For the restriction the constant >"1 is determined by the equation 68 where (3.56) Optimum solutions for f ki and ~i are given by: , i = 1,2, ••• ,~, k = 1,2, ..• , S • (3.58) Substitution of ~1 into equation' and (3.52) and use of identities (3.55) (3.56) gives: 1 ~ = ~ (c) ko "2 S [E k=l 1. (ak c kO )2 + S S M,~ E ~ k=l i=l N i Nkib e :ki 2 V + E ~- Nk,CTk , so k=l i=l ~ ~ kJ , k = 1,2, ... ,S • (3.59) The cost of attaining this variance is found to be (3.60) 69 Similar solutions are obtained when the restriction is made on the expected cost as, for example, CS2 = Cs20 ' (3.59) and (3.60), respectively, become In this case the formulae , ~= (3.61) The procedure is extended in a similar manner for sampling without ··e replacement and with equal inclusion probabilities. These results can easily be written down using the appropriate formulae from earlier sections of this chapter. These equations will not be reproduced here. 3.8 Some Graphic Solutions for Optimum Recall Rates . -1 It has been shown that the recall rate f i in FSU 1: (a second subscript is implicit in stratified cases) is independent of over-all cost or variance constraints. Write the equation for ~ as: , 2 where T O"i - --- , which is a linear function of the relative variable i - ~i costs at the various phases of sampling. Further, for instances in 70 which the initial contact is via mail and the recall is on a personal interview basis, usually, In practice these relative costs ordinarily will lie in the ranges and respectively. r·e If, in addition, we can specify the approximate initial response rate (R ) and the ratio of the variances (T ), simple linear graphs i i can be used to provide rapid solutions for f • Having these values and i writing the formulae for ni and ~ in terms of f i lead to rapid solutions for the allocation constants. Sample graphs providing contours of equal f i for T = 1 and Ri = .25, .50 and .75 are shown in Figures 1, 2 and 3, respectively. Since the appropriate modifications necessary to express n and ~ as i functions of f are relatively straight forward, such modified forms i will not be included here. 71 .20 .10 o Figure 1. Contours of fi for ~ = .25 72 .60 .50 .40 ·-e .20 .10 .00 Figure 2. Contours of f i for T' == 1 and Ri == .50 73 ..50 .40 ··e .20 .10 o Figure 3. C1i c3i Contours of fi .25 for" = 1 and Ri = 075 74 4.0 TWO STAGE NON-RESPONSE THEORY EXTENDED TO MULTIPHASE SAMPLING In this chapter El-Badry's (1956) multiphase extension of Hansen and Hurwitz's (1946) theory will be generalized. This theory assumes that an individual or unit will respond after a definite number of contacts. The sampling plan is as follows for the single stratum case: (1) A sample of !!! first stage units is selected from among M according to some probability system. (2) From the i ~ ~ i chosen FSU, ~ SSUs are s~lected from among W with equal probability and without replacement. (3) .-e Mail questionnaires are sent to those selected individuals. (4) Successive waves of questionnaires are mailed in attempts to reach the more reluctant individuals. A random sample of the non-respondents to the previous mailing are chosen at each mail phase. (5) Finally, the units of a random sample of the non- respondents remaining at the p th phase are contacted in person and con- tinued recall made as necessary to secure response. For this plan the following population frame will be considered. It is assumed that for any FSU, the proportion of SSUs requiring 1, 2, ••• , (p-l) contacts prior to response is known. In other words, a subset of SSUs exists which responds to the first mailing (contact), a subset which responds to the second mailing, etc. It should be noted that any one mailing secures information from only one of these ·sets. Non-response occurs because contact has been made with SSUs not 75 belonging to the corresponding set. It will appear subsequently that the responses to any particular mailing will furnish an unbiased estimate of the characteristic being studied for the corresponding set in any FSU in the population. In the development which follows the problem of proper identification of symbols by sub- and super-scripts arises. In order to simplify the presentation, consider Table 4 which defines terms relating to the i th first stage unit. Symbols referring to the i by the appropriate superscript as in Fil' th FSU will be made However, this superscript will be omitted in the tabular presentation, it being understood to th apply to the i FSU. .-e In addition, some symbols relating to the non-responding groups to each of the ~-mail attempts must be defined. Corresponding to the set E, there is a well defined group comprising all the members of the FSU who would not respond even after Emailings.Obviously.this group involves all the sets from the (u+l)th through the (p+l)th. The pro- portion of the Ni SSUs in that group is denoted by F where, clearly, u2 u = 1,2, ••• ,p-l. Thus it represents the sum of the fraction answering at the next mailing and the fraction not answering the next mailing. These sets of non-respondents also have corresponding means, Y ; totals, Y ; and u2 2 variance, CTu2 ' u2 76 Table 4. Population values of respondent SSUs to various mailings for th the i FSU Mailing number Number Fraction 1 Nll Fll 1'11 Yll 2 N21 1'21 Y21 3 N 31 F21 F 31 1'31 Y 3l Mean Total - 2 0"11 2 0"21 2 0"31 2 u .. Variance O"ul (·p-l) N(p_l)l F(p_l)l 1'(P_l)l Y(p-l)l 2 O"(p_l)l P Npl Fpl 1'pl Y pl 2 O"pl N(p+l)l F(p+l)l 1'(P+l)l Y(p+l)l 2 O"(p+l)l (p+l) r:J !:/The set (p+l) consists of SSUs responding only to personal interviews and recalls after not responding to ~ mailed questionnaires. Certain parallel notation is also needed to define sample values. These are: ~ = size of imtial mailing, = sample number responding to the uth mailing where u = 1,2, .. . ,p, n u2 = sample number not responding to the uth mailing where n = 1, 2, ••• ,p, 77 f u = reciprocal of sampling rate to be applied to the nonrespondents to the (u_l)th mailing and to be sent a uth wave of questionnaires, where u = 2, 3, •.. , p and f 1 == 1. Corresponding sample values for means and variances will carry similar ""'2 "'2 subscripts as Yul' crul ' Yu2 and cru2 • 4.1 Sampling with Unequal Probabilities and without Replacement at the First Stage In this sampling plan, .!!! first stage units are selected without replacement and with general inclusion probability FSU 1. From the i th FSU, i ~ - associated with second stage units are selected with equal probability and without replacement. .. p~ From the ni questionnaires mailed i i i we obtain ~l responses. A random sample of ~2/f2 are drawn and sent a second mailing. From the i i ~2 non-respondents to the second mailing, i randomly select ~2/f3 units and send them a third mailing, etc., through ~ mailings. Finally, at the last phase, n~2/wi units are chosen and personally interviewed until these all have been successfully contacted. 4.1.1 An Estimator of the Total. Under the plan outlined above " is an estimate of the population total TIE M T = E ·rt E Yio i=l j=l J and is given by the equation: + ••. + ( 4.1) 78 A. (T ) can be shown to be unbiased. 1E The proof is as follows: It is obvious that the quantity on the right hand side is equal to m E !: i=l 1 P i rr ~"'-iYll ~l + Y2lf 2En2l i i + •.• .,..-1 i n l (p i) i + Y 1 II fjEn 1 p j =2 P + and where, from binomial sampling theory, Hence, 4.2 reduces to 1 [P (. m p E(TA.1E ) =E !: i=l i m = E !: i=l !: u=l· i i .,..-1 i NFulY~ +NF 1 [P ( T )~ 1 i .,..-1 _.-1 -i ~ m Nulr ul + Ir Yi u2 Y 2 = E !: P u=l P i=l i !: 1 - Y P2 P2 = T • A. 4.1.2 Variance of the Estimator, T • 1E In order to find the vari- A. ance of T , apply Madow's Theorem to the estimator which, after adding 1E and subtracting appropriate identical terms, may be written in the following form: 79 '" The first term in the theorem, V(T1Els), (~ indicating a particular set of m first stage units), is seen to be + ••• where the expected value of the bracketed quantity in (rr) (4.4), conditional on i, is and the dot notation yi is the sample mean of the SSUs u· th u mailing including both respondents and non-respondents. Squaring the quantity in brackets as indicated and taking expectations relevant to the phases, the cross product terms vanish since the non-respondents of the uth mailing constitute the population sampled at the (u+l) th attempt. This gives + ••• + (4.6) 80 Now by adding and subtracting yiu2 and noting that the cross product th term vanishes, the u term becomes: 2 2 2 2 2 i (-i -i ) i (-1 ~~ )2 i (-1 ~~ ) E nu2 Y(u+l). - Yu2 = E nu2 Y(u+l). - ~ - E nu2 yu2 -ru2 Taking conditional expectations one obtains and Substituting these expressions into (4.7) gives Thus where ~ is the population variance of the i ~ primary sampling unit. ~ 81 Finally by the same reasoning used in Section 2.1.3 in obtaining formula (2.10) the first term in the theorem becomes = M 1 E - i=l Pi p-1 + E u=2 Since [T1EI~ is an unbiased estimator of T, the evaluation of the second term in the theorem, viz., V[E(T1Els~ is exactly the same as - (2.11), that is: (4.10 ) .. The sum of (4.9) and (4.10) results in the desired formulae: ( 4.11) where 4.1.3 Optimum Allocation. Define the following cost function appropriate to the sampling plan considered in this section: 82 i u2 + ••• + - - + i n iU+l) ~i i· i + m ~ c2i ~l + ~l + ••• + nul + i=l . m i + n( -1)1) + ~ c~i p i=l ~ i n 2 ~ w • The terms in this cost function are easily interpreted by extending the description of the two-stage single-phase design. Now consider the expected value of the cost which has the form: + F i u2 +. • • u+l i n f j=2 j F) (R-l)2 i II f j j=2 p + ••• Since the inclusion probabilities, in general, are functions of each situation must be handled separately. consid~r ~ As in Section 3.4.1 we shall m' the special case where Pi = M' that is the inclusion prob- abilities are eq-qal. In this 6ituation the expected cost function is given by (4.13 ) with Pi set equal to m/M. From the application of Cauchy's Inequality to minimize the product, (4.14 ) [n;:J, The optimum solutions for m, [wi] and [f~ must satisfy the equations: = K, mlCo = 1 i ( + F)2 m n eli c 2i 11 K, (4.15 ) i = 1,2, ••• ,M , ( 4.16) l 1 2 - (+1 i) ( i i ) Su- S(u+l) M Ni~fj u i l m ~ (cli Fu2 + c2iF(u+l)1)2 = K, = 1,2, ••• ,p-l, (4.17) i = 1,2, ••• ,M, .. M i NiW (irj=2f~) Si p = K, m ~i IC. c3i hi Fp2 i = 1,2, ••• ,M (4.18) These equations lead to the folloWing solutions for the fits, wi,s, and ni,s which are independent of any restriction on cost or presicion: c2iF~i 2 si) i2· = (CliFi2 + ) (O"i ._ f2 + iii' cli c2iF11 Sl - S2 i=1,2, ••• ,M, (4.19 ) (4.20 ) 84 i = 1,2, ••• ,M, ( 4.21) i = 1,2, .•• ,M. (4.22 ) At this point, define the following relationships: c:;. = M(M cri - i=l~ rrcr~), "e The only remaining allocation constant depend on the restriction. is~. The two possible solutions If the expected cost is fixed at ClEO the solution for m is found to be ClEO m= 1. c + c o (~) ~ 2 M E rrf3i i=l and the least variance attained by the estimator is: (4.24 ) 85 For a fixed variance VlEO ' the solution for ~ .!. ~ 2 + (c) o M E i=l ~ is: rt-13i and the minimum. expected cost is: (4.26 ) It is easy to verify that the solutions are generalizations of the earlier ones. obtained for two-stage uni-phase sampling. 4.2 Sampling with unequal Probabilities and with Replacement at the First Stage The sampling plan considered below is the same as qut1ined in the introduction to this chapter and in Section 4.1 except that the first stage sampling is made with arbitrary selection probabilities and with replacement. 4.2.1 An Estimator of the Total. The estimator considered here is an extension of (2.38) and is given by This estimator is unbiased; however, the proof will be omitted as it follows closely the development in Sections 2.2.2 and 4.1.1. 86 A 4.2.2 A Variance of the Estimator, ~2E. The variance of ~2E is derived by an extension of the methods used in Sections (2.2.3) and (4.1.2) and is found to be: A V(T ) 2E ~2 = -1 ni) cri 2 - M 1 [( E -1 - m i=l Pi ~ ni i P-l(U+l i) ( i i ) + E II f S - Sf 1 u=l j=2 j U "u+l) S (4.28 ) Comparing formula (4.28) with the variance formula for the singlephase two-stage plan given by (2.33) the only differences are in terms composed solely of phase constants. 4.2.3 Optimum Allocation. That is S~, f~, wi, etc. In defining a suitable cost function, it is apparent tha:t (4.12) is appropriate to this sample design. How- ever, in obtaining the expected cost we must use the a priori selection probabilities and make use of the fact that first stage sampling is with replacement. On finding the expected value of· the cost as given by (4.12) and taking these modifications into consideration the following cost function is obtained: (4.29 ) Using Cauchy's Inequality leads to the following solutions via the steps used in Section (4.1.3). i i i For f , f (u = 2,3, ••• ,p) and 101' the u 2 87 solutions are identical with equations (4.19), (4.20), and (4.21), respectively. For ni we have: , ~ ~ = P1[cr~y - . i ~N r~i 2 2 8i ] l f (::1) ~lJ r-Cl=i-+-C-2-i-F~i~'::::: (4.30 ) At·this point define: Then fixing the cost at C2EO the solution for the remaining variable m is as follows: (4.31) The least variance attained by the estimator is V(T ) =...!.- ~Cl c )~ + ~ 2E ~ ~ 2 0 1 2EO 2 r~ilJ - The solution for !- in case of fixed '" V(~2E) set, say, at V2EO is: (4.33 ) 88 The least expected cost of the sample plan is r (4.34 ) Again, these formulae are extensions of the single-phase formulae when first stage sampling is made with replacement. 4.3 Comments on the Solutions If we consider the solutions for fiu and wi for the i th first . stage ~ tUlit as given by equations (4.19), (4.20) and (4.21), we find that the f i u will be greater than or equal to unity if '-e > .. ~~-i) - S~ (4.35 ) cliF(u_l)+c2iFul u = 2,3, ••• ,p-l, and if (4'.36 ) These conditions are direct extensions of those discovered by . El-Badry (1956), when sampling is carried on only at one stage. Also, note that these conditions, as well as the restraint imposed by (4.22) for sampling FSUs without replacement, namely: < ( 4.37) Mc:0 cannot be altered through cost or variance constraints imposed on the solutions. Only through modification of the frame and/or the. probabil- ity system can these be varied. However, the constraint, m < M, can be met by following a procedure similar to that suggested in Section 3.3. As for the population information required for the calculation of fi and wi, it is apparent from formulae (4.19), (4.20) and (4.21) that u i i i one must know F(u-2)2' F(u-l)2 and F(u_l)l as well as the variances of the non- responding groups in the (u-l)th and uth attempt relative to the non-responding .... group in the (u-2 )th attetnPt. While a priori knowledge of the relative variances is required, the proportions i i F(u-2)2, F(u-l)2 and F(u-l)l can be estimated from the results of the previous two phases (i. e. mail responses). By considering the optimum variance or cost functions (i.e., (4.24), (4.26), (4.32) or (4.34) for (p+l) stages), the expected cost or variance can be reduced if the following inequality holds: This merely states that the added terms necessitated by using the (P+l)th phase (i.e., those on the left hand side of (4.38» are less 90 than the term replaced. This too is a direct extension of the results found by El-Badry (1956). Thus, (4.38) can be evaJ..uated after the pth phase by considering both the response information obtained at that attempt and by using any other information to estimate the probability of response at the next i / Fi ) and the ratio of the variances of the non-respondphase ( F(pTl)l p2 2 2 i ing.groups, (O"(P+l)l/O"p2). i Now (4.38) is not a theoreticaJ..ly sufficient indicator of the phase at which to start personaJ.. interviewing. Theo- reticaJ..ly, one should consider extensions of (4.38) to include the phases (p+2), (p+3), ••• , (p+k). However, for practicaJ.. applications these considerations do not seem to warrant consideration. 4.4 Extension of the Theory to Stratified Sampling In this section, the estimators (4.1) and (4.27) will be extended to a stratified sampling situation. Since this is accomplished by mere- ly adding a stratification subscript to the earlier notation and then summing the results over aJ..l S strata, the estimators retain their property of unbiasedness. Also, because sampling is independent from stratum to stratum the variance of the stratified estimators is simply the sums of the variances of the individuaJ.. strata. An extension of the cost function is aJ..so easily accomplished by incorporating strata subscript notation and adding the individuaJ.. stratum cost functions. a stratified plan. This covers all the variable costs implicit in Because of the, straightforward' changes outlined above, only the optimum solutions for Ilk and the associated minimum 91 costs and variances will be given. All proofs will be omitted and where results are similar to earlier formulae except for strata subscripts, these earlier results will be cited and the appropriate notational changes indicated. Each plan outlined below will be used to draw from stratum. ! of the S strata in the population. ~ first stage units At the second stage ~ secondary sampling units are drawn from among N~ in FSU i," stratum. k, with equal probability and without replacement. Plan I. First stage units selected according to arbitrary selec- tion probabilities and without replacement. ing the i The probability of includ- ~. primary unit in stratum. ! in a sample of size ~ i is P • k The estimator is constructed by adding the subscript! to each symbol in formula (4.1). Summing this over! (the stratum. identifica- tion index) from 1 to S gives the desired estimator. done with formula (4.11) and the result is the appropriate formula for the variance of the stratified estimator. ried out. on the cost function (4.13). This may also be A similar procedure is car;.. This cost function consists only of variable costs denoted by C • IES • _i i Under these conditions the optimum. solutions for ~, f"ku' wk and i ~ are given by (4.19), (4.20), (4.21) and (4.22), respectively, with the subscript ! added. Defining '1u i and 13k as in Section 4.1.3 with appropriate ! sub- scripts added, the formula for ~ is found to be: ~ Ilk ~ (ex. _c c ko k=l ··kL ko )~ S = VSO + r: k=l + ~ }. k=l i=l k 1\ rt ~ik 2 cry , k = l,2, ••• ,S, k where Pk sampling phases are carried out in each selected FSU of stratwn k, and where VSO is the fixed variance required of this estimate. The least expected cost of this plan is given by: (4.40) Plan II. First stage units selected with arbitrary selection probabilities and with replacement. i th FSU at any draw in stratum ~ The probability of selecting the i is Pi. The estimator is found by adding the subscript formula (4.27), and summing over ~ from 1 to S. ~ to each symbol in The appropriate vari- ance for this estimator is found by carrying out a similar procedure. on formula (4.28). Similarly, we obtain the variable cost portion of the associated cost function. For this situation the optimum solutions for f~, f~, w~ and ~ are given by (4.19), (4.20), (4.21) and (~.22), respectively, with the subscript ~ added. Defining 0k2 the formula for as in Sect~on 4.2.3 by adding appropriate ~ subscripts Ilk subject to a fixed cost constraint is: 9' 1I1t = 8 E k = 1,2, ••• ,8, (4.42) k=l where '0 is the cost desired. 80 The expected variance of this plan is given by - 8 E ~ _-1. .r:'" k=l i=l Nk i2 cr • k (4.4,) It should be noted that for :plan I, optimum allocation values for 1I1t were given for the fixed variance constraint while for Plan II these constants were given for a fixed cost constraint. Either reault can be obtained for both plans by interchanging the terms which differ in the equations (4.40) and (4.42) and the equation (4.41) and (4.43). 5.0 SUMMARY AND CONCLUSIONS 5.1 Summary of Results Estimators of the population mean and variance have been derived appropriate for two stage sampling with non-response at the second stage. Both possess the property of unbiasedness. These were derived for the case of: (1) sampling without replacement and with arbitrary selection probabilities of the first stage units. (2 ) sampling with replacement and with arbitrary selection probabilities of the first stage units. In all cases the selection of second stage units was with equal probability and without replacement. For general cost functions appropriate to the above sample designs, optimum solutions for the constituent sample sizes and the sampling rates for non-respondents were derived. These are optimum subject to a fixed expected cost or fixed variance. Results were also extended to the stratified two stage uni-phase case. The efficiency of single phase plans calling for complete recall of the non-respondents is compared with that for subsampling of nonrespondents. cient. It has been shown that the latter procedure is more effi- An actual example of the application of both methods indicated that subsampling was 11 per cent more efficient. An extension of multiphase sampling is given following the model proposed by El-Badry (1956). Expressions for optimum first stage, sec- ond stage, first phase, second phase, etc., sample sizes have been 95 obtained for both stratified and unstratified cases when first stage uni ts are sampled with or without replacement. The conditions required for the applicability of the procedure are also derived and discussed. 5.2 Summary of Conclusions As evidenced by the numerical example given in the text, applica- tion of the results seems to produce worthwhile gains in efficiency. The restrictions imposed on the frame constants necessary for using this procedure in the single phase case do not appear to detract from its applicability. For multiphase sampling, the amount of a priori information needed for proper application of the results derived seems to be rather exces- de sive, except for two or three phases. 5.3 Suggestions for Further Research The research reported in this thesis covers only one approach to the non-response problem. As evidenced by the review of literature, many other approaches have been suggested. Thus, the suggestions for, further research will emphasize extension of the work presented in this thesis, and development of other approaches to the problem of nonresponse. Some of the areas of research suggested by this thesis are; (1) the development of "best" variance estimators for single and multiphase two-stage sampling plans. (2) methods for utilizing auxiliary population information so as to construct frames to which the procedures suggested by this thesis are amenable. 96 Researchers have generally restricted their studies to single stage sampling plans. An especially fruitful area for research appears to be the possible development o~ mod~ls for specific populations which utilize auxiliary information to eliminate non-response bias • .0_ 97 LIST OF REFERENCES Birnbaum, Z. W., and Sirken, M. G. 1950. Bias due to non-availability in sampling surveys. Jour. Amer. Stat. Assoc. 45 :99-111. Cochran, W. G. 1953. Sampling Techniques. John Wiley, Inc., New York. Deming, W. E. 1953. On a probability mechanism to attain an economic balance between the resultant error of response and the bias of non-response. Jour. Amer. Stat. Assoc. 48:743-772. Durbin, J. 1955. Inst. Stat., Non-response and call-backs in surveys. 2!!:: 72- 86. Bull. Int. Durbin, J. 1957. Sampling theory for estimates based on fewer individuals than the number selected. Bull. Int. Inst. Stat.,2§.:113. El-Badry, M. A. 1956 . A sampling procedure for mailed questionna.ires. Jour. Amer. Stat. Assoc . .'21 :209-227 • Ferber, R. 1948. The problems of bias in mail returns: Publ. Opin. Quart. 12: 669-676. Ford. R. N., and Zeisel, H. Quart. 13:495-501. 1949. Bias in mail surveys. a solution. Pub1. Opin. Gaudet, H., and Wilson, E. C. 1940. Who escapes the personal investigator? Jour. App1. Psycho1. 24:773-777. ' Hansen, M. H., and Hurwitz, W. N. 1946. The problem of non-response in sample surveys. Jour. Amer. Stat. Assoc. 41:517-529. Hendricks, W. A. 1949. Adjustment for bias caused by non-response in mailed surveys. Ag. Econ. Res. 1:52-56. Hilgard, E. R. and Payne, S. L. 1944. Those not at home: pollsters. Pub1. Opin. Quart. ~:254-261. riddle for Madow, W. G. 1949. On the theory of systematic sampling II. Math. Stat. 20 :333-354. Ann. Neyman, J. 1938. Contributions to the theory of sampling human populations. Jour. Amer. Stat. Assoc. 33:101-116. Pace, C. R. 1939. Factors influencing questionnaire returns from former university students. Jour. App1. Psycho1. ~ :388-397. Politz, A., and Simmons, W. 1949. An attempt to get the "not at .homes" into the sample without call backs. Jour. Amer. Stat. Assoc. 44:9-31. 98 Shuttleworth, F. K. 1941. Sampling errors involved in incomplete returns to mail questionnaires. Jour. App1. Psycho1. ~: 588-591. Stanton, F. turns. 1939. Notes on the validity of mail questionnaire reJour. Appl. Psycho1. ~:95-l04. stuart, A. 1954. A simple presentation of optimum sampling results. Jour. Royal Stat. Soc. l6B: 239-241. Suchman, E. A., and McCandless, B. 1940. Jour. App1. Psycho1. 24:758-769. Who answers questionnaires? Sukhatme, P. v. 1954. Sampling Theory of Surveys with Applications. Iowa State College Press and the Indian Society of Agricultural Statistics • .Ames, Iowa. Vaivanijkul, N. 1961. A comparison of respondents and non-respondents in a sample survey. unpublished M.S. Thesis, North Carolina State College, Raleigh. Wallace, D. 1947. Mail questionnaires can produce good samples of homogeneous groups. Jour. Mktg. 12:53-60. Yates, F. 1949. Sampling ,Methods for Censuses and Surveys. Edition. Charles Griffin and Co., London, 1953. Second INSTITUTE OF STATISTICS NORTH CAROLINA STATE COLLEGE (Mimeo Series available for dHribution at cost) 265. Eicker, Friedheim. 266. Consistency of parameter-estimates in a linear time-series model. October, 1960. Eicker, Friedheim. A necessary and sufficient condition for consistency of the LS estimates in linear regression. October, 1960. 267. Smith, W. L. On some general renewal theorems for nonidentically distributed variables. October, 1960. 268. Duncan, D. B. Bayes rules for a. common multiple comparisons problem and related Student-t problems. 1960. 269. Bose, R. C. Theorems in the additive theory of numbers. November, 1960. 270. Cooper, Dale and D. D. Mason. November, Available soil moisture as a stochastic process. December, 1960. 271. Eicker, FriedheIm. Central limit theorem and consistency in linear regression. December, 1960. 272. Rigney, Jackson A. The cooperative organization in wildlife stati~tics. Presented at the 14th Annual Meeting, Southeastern Association of Game and Fish Commissioners, Biloxi, Mississippi, October 23-26, 1960. Published in Mimeo Series, January, 1961. 273. Schutzenberger, M. P. On the definition of a certain class of automata. January, 1961. 274. Roy, S. N. and J. N. Shrizastaza. Inference on treatment effects and design of experiments in relation to such inferences. January, 1961. 275. Ray-Chaudhuri, D. K. An algorithm for a minimum cover of an abstract complex. February, 1961. 276. Lehman, E. H., Jr. and R. L. Anderson. Estimation of the scale parameter in the Weibull distribution using samples censored by time and by number of failures. March, 1961. 277. Hotelling, Harold. The behavior of some standard statistical tests under non-standard conditions. February, 1961. 278. Foata, Dominique. On the construction of Bose-Chaudhuri matrices with help of Abelian group characters. 1961. 279. Eicker, Friedheim. Central limit theorem for sums over sets of random variables. February, 1961. 280. Bland, R. P. A minimum average risk solution for the problem of choosing the largest mean. 281. Williams, J. S., S. N. Roy and C. C. Cockerham. 282. Roy, S. N. and R. Gnanadesikan. April, 1961. 283. Schutzenberger, M. P. 285. Patel, M. S. 286. Bishir, J. May, 1961. Equality of two dispersion matrices against alternatives of intermediate specificity. April, 1961. A coding problem arising in the transmission of numerical data. Investigations on factorial designs. W. March, 1961. An evaluation of the worth of some selected indices. On the recurrence of patterns. 284. Bose, R. C. and I. M. Chakravarti. February, April, 1961. May, 1961. Two problems in the theory of stochastic branching processes. May, 1961. 287. Konsler, T. R. A quantitative analysis of the growth and regrowth of a forage crop. May, 1961. 288. Zaki, R. M. and R. L. Anderson. Applications of linear programming techniques to some problems of production planning over time. May, 1961. 289. Schutzenberger, M. P. A remark on finite transducers. June, 1961. = b·+mcU' in a free group. 290. Schutzenberger, M. P. On the equation a2+ n 291. Schutzenberger, M. P. On a special class of recurrent events. June, 1961. June, 1961. 292. Bhattacharya, P. K. Some properties of the least square estimator in regression analysis when the 'independent' variables are stochastic. June, 1961. 293. Murthy, V. K. On the general renewal process. 294. Ray-Chaudhuri, D. K. 295. Bose, R. C. June, 1961. Application of geometry of quadrics of constructing PBIB designs. June, 1961. Ternary error correcting codes and fractionally replicated designs. May, 1961. 296. Koop, J. C. Contributions to the general theory of sam pling finite populations without replacement and with unequal probabilities. September, 1961. 297. Foradori, G. T. Some non-response sampling theory for two stage designs. Ph.D. Thesis. 298. Mallios, W. S. Some aspects of linear regression systems. Ph.D. Thesis. 299. Taeuber, R. C. On sampling with replacement: an axiomatic approach. Ph.D. Thesis. J. Srivastava, J. 300. Gross, A. On the construction of burst error correcting codes. 301. N. 303. Roy, S. N. November, 1961. August, 1961. Contribution to the construction and analysis of designs. 302. Hoeffding, Wassily. November, 1961. November, 1961. The strong laws of large numbers for u-statistics. August, 1961. August, 1961. Some recent results in normal multivariate confidence bounds. August, 1961. 304. Roy, S. N. Some remarks on normal multivariate analysis of variance. August, 1961. 305. Smith, W. L. A necessary and sufficient condition for the convergence of the renewal density. 306. Smith, W. L. A note on characteristic functions which vanish identically in an interval. 307. Fukushima, Kozo. 308. Hall, W. J. A comparison of sequential tests for the Poisson parameter. Some sequential analogs of Stein'S two-stage test. 309. Bhattacharya, P. K. August, 1961. September, 1961. September, 1961. September, 1961. Use of concomitant measurements in the design and analysis of experiments. November, 1961.
© Copyright 2025 Paperzz