From 5510737674a79e3f1980ca06706ebeb2e269c12f Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Wed, 17 Sep 2025 15:10:08 +1000 Subject: [PATCH 1/8] updates --- lectures/util_rand_resp.md | 201 +++++++++++++++++++++---------------- 1 file changed, 117 insertions(+), 84 deletions(-) diff --git a/lectures/util_rand_resp.md b/lectures/util_rand_resp.md index f6ac37f2c..5341da8ba 100644 --- a/lectures/util_rand_resp.md +++ b/lectures/util_rand_resp.md @@ -20,17 +20,11 @@ import numpy as np ## Overview -{doc}`This QuantEcon lecture ` describes randomized response surveys in the tradition of Warner {cite}`warner1965randomized` that are designed to protect respondents' privacy. +{doc}`This QuantEcon lecture ` describes randomized response surveys in the tradition of Warner {cite}`warner1965randomized` that are designed to protect respondents' privacy. +Lars Ljungqvist {cite}`ljungqvist1993unified` analyzed how a respondent's decision about whether to answer truthfully depends on **expected utility**. -Lars Ljungqvist {cite}`ljungqvist1993unified` analyzed how a respondent's decision about whether to answer truthfully depends on **expected utility**. - - - -The lecture tells how Ljungqvist used his framework to shed light on alternative randomized response survey techniques -proposed, for example, by {cite}`lanke1975choice`, {cite}`lanke1976degree`, {cite}`leysieffer1976respondent`, -{cite}`anderson1976estimation`, {cite}`fligner1977comparison`, {cite}`greenberg1977respondent`, -{cite}`greenberg1969unrelated`. +The lecture tells how Ljungqvist used his framework to shed light on alternative randomized response survey techniques proposed, for example, by {cite}`lanke1975choice`, {cite}`lanke1976degree`, {cite}`leysieffer1976respondent`, {cite}`anderson1976estimation`, {cite}`fligner1977comparison`, {cite}`greenberg1977respondent`, {cite}`greenberg1969unrelated`. @@ -57,9 +51,9 @@ $$ (eq:util-rand-one) ## Zoo of Concepts -At this point we describe some concepts proposed by various researchers +At this point we describe some concepts proposed by various researchers. -### Leysieffer and Warner(1976) +### {cite:t}`leysieffer1976randomized` The response $r$ is regarded as jeopardizing with respect to $A$ or $A^{'}$ if @@ -77,7 +71,9 @@ $$ \frac{\text{Pr}(A|r)}{\text{Pr}(A^{'}|r)}\times \frac{(1-\pi_A)}{\pi_A} = \frac{\text{Pr}(r|A)}{\text{Pr}(r|A^{'})} $$ (eq:util-rand-three) -If this expression is greater (less) than unity, it follows that $r$ is jeopardizing with respect to $A$($A^{'}$). Then, the natural measure of jeopardy will be: +If this expression is greater (less) than unity, it follows that $r$ is jeopardizing with respect to $A$($A^{'}$). + +Then, the natural measure of jeopardy will be: $$ \begin{aligned} @@ -116,21 +112,21 @@ $$ \text{Pr}(A|\text{no})=0 $$ -### Lanke(1976) +### {cite:t}`lanke1976degree` -Lanke (1975) {cite}`lanke1975choice` argued that "it is membership in Group A that people may want to hide, not membership in the complementary Group A'." +{cite:t}`lanke1975choice` argued that "it is membership in Group A that people may want to hide, not membership in the complementary Group A'." -For that reason, Lanke (1976) {cite}`lanke1976degree` argued that an appropriate measure of protection is to minimize +For that reason, {cite:t}`lanke1976degree` argued that an appropriate measure of protection is to minimize $$ \max \left\{ \text{Pr}(A|\text{yes}) , \text{Pr}(A|\text{no}) \right\} $$ (eq:util-rand-five-a) -Holding this measure constant, he explained under what conditions the smallest variance of the estimate was achieved with the unrelated question model or Warner's (1965) original model. +Holding this measure constant, he explained under what conditions the smallest variance of the estimate was achieved with the unrelated question model or {cite:t}`warner1965randomized` original model. -### 2.3 Fligner, Policello, and Singh +### {cite:t}`fligner1977comparison` -Fligner, Policello, and Singh reached similar conclusion as Lanke (1976). {cite}`fligner1977comparison` +{cite:t}`fligner1977comparison` reached similar conclusion as {cite:t}`lanke1976degree`. They measured "private protection" as @@ -139,11 +135,9 @@ $$ $$ (eq:util-rand-six) -### 2.4 Greenberg, Kuebler, Abernathy, and Horvitz (1977) +### {cite:t}`greenberg1977respondent` -{cite}`greenberg1977respondent` - -Greenberg, Kuebler, Abernathy, and Horvitz (1977) stressed the importance of examining the risk to respondents who do not belong to $A$ as well as the risk to those who do belong to the sensitive group. +{cite:t}`greenberg1977respondent` stressed the importance of examining the risk to respondents who do not belong to $A$ as well as the risk to those who do belong to the sensitive group. They defined the hazard for an individual in $A$ as the probability that he or she is perceived as belonging to $A$: @@ -157,7 +151,7 @@ $$ \text{Pr}(\text{yes}|A^{'})\times \text{Pr}(A|\text{yes})+\text{Pr}(\text{no}|A^{'}) \times \text{Pr}(A|\text{no}) $$ (eq:util-rand-seven-b) -Greenberg et al. (1977) also considered an alternative related measure of hazard that "is likely to be closer to the actual concern felt by a respondent." +{cite:t}`greenberg1977respondent` also considered an alternative related measure of hazard that "is likely to be closer to the actual concern felt by a respondent." The "limited hazard" for an individual in $A$ and $A^{'}$ is @@ -194,9 +188,9 @@ Given $r_i$ and complete privacy, the individual's utility is higher if $r_i$ In terms of a respondent's expected utility as a function of $ \text{Pr}(A|r_i)$ and $r_i$ -- The higher is $ \text{Pr}(A|r_i)$, the lower isindividual $i$'s expected utility. +- The higher is $ \text{Pr}(A|r_i)$, the lower is individual $i$'s expected utility. -- expected utility is higher if $r_i$ represents a truthful answer rather than a lie +- Expected utility is higher if $r_i$ represents a truthful answer rather than a lie. Define: @@ -243,7 +237,7 @@ Constraint {eq}`eq:util-rand-ten-b` holds for sure. Consequently, constraint {eq}`eq:util-rand-ten-a` becomes the single necessary condition for individual $i$ always to answer truthfully. -At equality, constraint $(10.\text{a})$ determines conditional probabilities that make the individual indifferent between telling the truth and lying when the correct answer is "yes": +At equality, constraint {eq}`eq:util-rand-ten-a` determines conditional probabilities that make the individual indifferent between telling the truth and lying when the correct answer is "yes": $$ U_i\left(\text{Pr}(A|\text{yes}),\text{truth}\right)= U_i\left(\text{Pr}(A|\text{no}),\text{lie}\right) @@ -267,13 +261,22 @@ The source of the positive relationship is: We can deduce two things about the truth border: -- The truth border divides the space of conditional probabilities into two subsets: "truth telling" and "lying". Thus, sufficient privacy elicits a truthful answer, whereas insufficient privacy results in a lie. The truth border depends on a respondent's utility function. +- The truth border divides the space of conditional probabilities into two subsets: "truth telling" and "lying". + + - Thus, sufficient privacy elicits a truthful answer, whereas insufficient privacy results in a lie. The truth border depends on a respondent's utility function. - Assumptions in {eq}`eq:util-rand-nine-a` and {eq}`eq:util-rand-nine-a` are sufficient only to guarantee a positive slope of the truth border. The truth border can have either a concave or a convex shape. We can draw some truth borders with the following Python code: ```{code-cell} ipython3 +--- +mystnb: + figure: + caption: | + Three types of truth border + name: fig-truth-borders +--- x1 = np.arange(0, 1, 0.001) y1 = x1 - 0.4 x2 = np.arange(0.4**2, 1, 0.001) @@ -297,12 +300,9 @@ plt.text(0.42, 0.3, "Truth Telling", fontdict={'size':28, 'style':'italic'}) plt.text(0.8, 0.1, "Lying", fontdict={'size':28, 'style':'italic'}) plt.legend(loc=0, fontsize='large') -plt.title('Figure 1.1') plt.show() ``` -Figure 1.1 three types of truth border. - Without loss of generality, we consider the truth border: @@ -310,9 +310,16 @@ $$ U_i(\text{Pr}(A|r_i),\phi_i)=-\text{Pr}(A|r_i)+f(\phi_i) $$ -and plot the "truth telling" and "lying area" of individual $i$ in Figure 1.2: +and plot the "truth telling" and "lying area" of individual $i$ in {numref}`fig-truth-lying-areas`: ```{code-cell} ipython3 +--- +mystnb: + figure: + caption: | + Truth telling and lying areas of individual $i$ + name: fig-truth-lying-areas +--- x1 = np.arange(0, 1, 0.001) y1 = x1 - 0.4 z1 = x1 @@ -331,7 +338,6 @@ plt.text(0.5, 0.4, "Truth Telling", fontdict={'size':28, 'style':'italic'}) plt.text(0.8, 0.2, "Lying", fontdict={'size':28, 'style':'italic'}) plt.legend(loc=0, fontsize='large') -plt.title('Figure 1.2') plt.show() ``` @@ -376,31 +382,31 @@ From expression {eq}`eq:util-rand-thirteen`, {eq}`eq:util-rand-fourteen-a` and { We use Python code to draw iso-variance curves. -The pairs of conditional probabilities can be attained using Warner's (1965) model. +The pairs of conditional probabilities can be attained using {cite:t}`warner1965randomized` model. Note that: - Any point on the iso-variance curves can be attained with the unrelated question model as long as the statistician can completely control the model design. -- Warner's (1965) original randomized response model is less flexible than the unrelated question model. +- {cite:t}`warner1965randomized` original randomized response model is less flexible than the unrelated question model. ```{code-cell} ipython3 class Iso_Variance: - def __init__(self, pi, n): - self.pi = pi + def __init__(self, π, n): + self.π = π self.n = n def plotting_iso_variance_curve(self): - pi = self.pi + π = self.π n = self.n nv = np.array([0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, 14.7]) x = np.arange(0, 1, 0.001) - x0 = np.arange(pi, 1, 0.001) - x2 = np.arange(0, pi, 0.001) - y1 = [pi for i in x0] - y2 = [pi for i in x2] - y0 = 1 / (1 + (x0 * (1 - pi)**2) / ((1 - x0) * pi**2)) + x0 = np.arange(π, 1, 0.001) + x2 = np.arange(0, π, 0.001) + y1 = [π for i in x0] + y2 = [π for i in x2] + y0 = 1 / (1 + (x0 * (1 - π)**2) / ((1 - x0) * π**2)) plt.figure(figsize=(12, 10)) plt.plot(x0, y0, 'm-', label='Warner') @@ -408,7 +414,7 @@ class Iso_Variance: plt.plot(x0, y1,'c:', linewidth=2) plt.plot(y2, x2, 'c:', linewidth=2) for i in range(len(nv)): - y = pi - (pi**2 * (1 - pi)**2) / (n * (nv[i] / n) * (x0 - pi + 1e-8)) + y = π - (π**2 * (1 - π)**2) / (n * (nv[i] / n) * (x0 - π + 1e-8)) plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, label=f'V{i+1}') plt.xlim([0, 1]) plt.ylim([0, 0.5]) @@ -417,7 +423,6 @@ class Iso_Variance: plt.legend(loc=0, fontsize='large') plt.text(0.32, 0.28, "High Var", fontdict={'size':15, 'style':'italic'}) plt.text(0.91, 0.01, "Low Var", fontdict={'size':15, 'style':'italic'}) - plt.title('Figure 2') plt.show() ``` @@ -433,10 +438,17 @@ Suppose the parameters of the iso-variance model follow those in Ljungqvist {cit - $n=100$ -Then we can plot the iso-variance curve in Figure 2: +Then we can plot the iso-variance curve in {numref}`fig-iso-variance`: ```{code-cell} ipython3 -var = Iso_Variance(pi=0.3, n=100) +--- +mystnb: + figure: + caption: | + Iso-variance curves for randomized response survey design + name: fig-iso-variance +--- +var = Iso_Variance(π=0.3, n=100) var.plotting_iso_variance_curve() ``` @@ -476,29 +488,37 @@ We can use a utilitarian approach to analyze some privacy measures. We'll enlist Python Code to help us. -### Analysis of Method of Lanke's (1976) +### Analysis of Method of {cite:t}`lanke1976degree` -Lanke (1976) recommends a privacy protection criterion that minimizes: +{cite:t}`lanke1976degree` recommends a privacy protection criterion that minimizes: $$ \max \left\{ \text{Pr}(A|\text{yes}) , \text{Pr}(A|\text{no}) \right\} $$ (eq:util-rand-five-b) -Following Lanke's suggestion, the statistician should find the highest possible $\text{ Pr}(A|\text{yes})$ consistent with truth telling while $\text{ Pr}(A|\text{no})$ is fixed at 0. The variance is then minimized at point $X$ in Figure 3. +Following Lanke's suggestion, the statistician should find the highest possible $\text{ Pr}(A|\text{yes})$ consistent with truth telling while $\text{ Pr}(A|\text{no})$ is fixed at 0. The variance is then minimized at point $X$ in {numref}`fig-lanke-analysis`. -However, we can see that in Figure 3, point $Z$ offers a smaller variance that still allows cooperation of the respondents, and it is achievable following our discussion of the truth border in Part III: +However, we can see that in {numref}`fig-lanke-analysis`, point $Z$ offers a smaller variance that still allows cooperation of the respondents, and it is achievable following our discussion of the truth border in Part III: ```{code-cell} ipython3 -pi = 0.3 +--- +mystnb: + figure: + caption: | + Analysis of Lanke's privacy protection method showing optimal design points + name: fig-lanke-analysis +--- + +π = 0.3 n = 100 nv = [0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, 14.7] x = np.arange(0, 1, 0.001) y = x - 0.4 z = x -x0 = np.arange(pi, 1, 0.001) -x2 = np.arange(0, pi, 0.001) -y1 = [pi for i in x0] -y2 = [pi for i in x2] +x0 = np.arange(π, 1, 0.001) +x2 = np.arange(0, π, 0.001) +y1 = [π for i in x0] +y2 = [π for i in x2] plt.figure(figsize=(12, 10)) plt.plot(x, x, 'c:', linewidth=2) @@ -508,7 +528,7 @@ plt.plot(x, y, 'r-', label='Truth Border') plt.fill_between(x, y, z, facecolor='blue', alpha=0.05, label='truth telling') plt.fill_between(x, 0, y, facecolor='green', alpha=0.05, label='lying') for i in range(len(nv)): - y = pi - (pi**2 * (1 - pi)**2) / (n * (nv[i] / n) * (x0 - pi + 1e-8)) + y = π - (π**2 * (1 - π)**2) / (n * (nv[i] / n) * (x0 - π + 1e-8)) plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, label=f'V{i+1}') @@ -522,13 +542,12 @@ plt.text(0.45, 0.35, "Truth Telling", fontdict={'size':28, 'style':'italic'}) plt.text(0.85, 0.35, "Lying",fontdict = {'size':28, 'style':'italic'}) plt.text(0.515, 0.095, "Optimal Design", fontdict={'size':16,'color':'b'}) plt.legend(loc=0, fontsize='large') -plt.title('Figure 3') plt.show() ``` -### Method of Leysieffer and Warner (1976) +### Method of {cite:t}`leysieffer1976randomized` -Leysieffer and Warner (1976) recommend a two-dimensional measure of jeopardy that reduces to a single dimension when there is no jeopardy in a 'no' answer", which means that +{cite:t}`leysieffer1976randomized` recommend a two-dimensional measure of jeopardy that reduces to a single dimension when there is no jeopardy in a 'no' answer, which means that $$ @@ -543,11 +562,11 @@ $$ This is not an optimal choice under a utilitarian approach. -### Analysis on the Method of Chaudhuri and Mukerjee's (1988) +### Analysis on the Method of {cite:t}`chaudhuri1988randomized` -{cite}`Chadhuri_Mukerjee_88` +{cite}`Chadhuri_Mukerjee_88` argued that the individual may find that since "yes" may sometimes relate to the sensitive group A, a clever respondent may falsely but safely always be inclined to respond "no". -Chaudhuri and Mukerjee (1988) argued that the individual may find that since "yes" may sometimes relate to the sensitive group A, a clever respondent may falsely but safely always be inclined to respond "no". In this situation, the truth border is such that individuals choose to lie whenever the truthful answer is "yes" and +In this situation, the truth border is such that individuals choose to lie whenever the truthful answer is "yes" and $$ \text{Pr}(A|\text{no})=0 @@ -569,7 +588,7 @@ However, under a utilitarian approach there should exist other survey designs th In particular, respondents will choose to answer truthfully if the relative advantage from lying is eliminated. -We can use Python to show that the optimal model design corresponds to point Q in Figure 4: +We can use Python to show that the optimal model design corresponds to point $Q$ in {numref}`fig-optimal-design`: ```{code-cell} ipython3 def f(x): @@ -580,47 +599,61 @@ def f(x): ``` ```{code-cell} ipython3 -pi = 0.3 +--- +mystnb: + figure: + caption: | + Optimal survey design under utilitarian approach showing computed point $Q$ + name: fig-optimal-design +--- + +π = 0.3 n = 100 + nv = [0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, 14.7] x = np.arange(0, 1, 0.001) -y = [f(i) for i in x] +y = [truth_border_function(i, π) for i in x] z = x -x0 = np.arange(pi, 1, 0.001) -x2 = np.arange(0, pi, 0.001) -y1 = [pi for i in x0] -y2 = [pi for i in x2] -x3 = np.arange(0.16, 1, 0.001) -y3 = (pow(x3, 0.5) - 0.4)**2 +x0 = np.arange(π, 1, 0.001) +x2 = np.arange(0, π, 0.001) +y1 = [π for i in x0] +y2 = [π for i in x2] + +# Calculate truth border more precisely +threshold = π * (1 - π) / (1 - 2 * π)**2 +x3 = np.arange(threshold, 1, 0.001) +y3 = [(xi**0.5 - (1 - π)**0.5)**2 for xi in x3] plt.figure(figsize=(12, 10)) plt.plot(x, x, 'c:', linewidth=2) plt.plot(x0, y1,'c:', linewidth=2) plt.plot(y2, x2,'c:', linewidth=2) -plt.plot(x3, y3,'b-', label='Truth Border') +plt.plot(x3, y3,'b-', linewidth=2, label='Truth Border') plt.fill_between(x, y, z, facecolor='blue', alpha=0.05, label='Truth telling') -plt.fill_between(x3, 0, y3,facecolor='green', alpha=0.05, label='Lying') +plt.fill_between(x3, 0, y3, facecolor='green', alpha=0.05, label='Lying') + +# Plot iso-variance curves for i in range(len(nv)): - y = pi - (pi**2 * (1 - pi)**2) / (n * (nv[i] / n) * (x0 - pi + 1e-8)) - plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, label=f'V{i+1}') -plt.scatter(0.61, 0.146, c='r', marker='*', label='Z', s=150) + y_var = π - (π**2 * (1 - π)**2) / (n * (nv[i] / n) * (x0 - π + 1e-8)) + plt.plot(x0, y_var, 'k--', alpha=1 - 0.07 * i, label=f'V{i+1}') + +# Plot the calculated optimal point +plt.scatter(optimal_x, optimal_y, c='r', marker='*', label='Q (computed)', s=150) plt.xlim([0, 1]) plt.ylim([0, 0.5]) plt.xlabel('Pr(A|yes)') plt.ylabel('Pr(A|no)') plt.text(0.45, 0.35, "Truth Telling", fontdict={'size':28, 'style':'italic'}) plt.text(0.8, 0.1, "Lying", fontdict={'size':28, 'style':'italic'}) -plt.text(0.63, 0.141, "Optimal Design", fontdict={'size':16,'color':'r'}) -plt.legend(loc=0, fontsize='large') -plt.title('Figure 4') +plt.text(optimal_x + 0.02, optimal_y + 0.005, f"Optimal Design\n({optimal_x:.3f}, {optimal_y:.3f})", + fontdict={'size':12,'color':'r'}) +plt.legend(loc='upper right', fontsize='medium') plt.show() ``` -### Method of Greenberg et al. (1977) - - {cite}`greenberg1977respondent` +### Method of {cite:t}`greenberg1977respondent` -Greenberg et al. (1977) defined the hazard for an individual in $A$ as the probability that he or she is perceived as belonging to $A$: +{cite:t}`greenberg1977respondent` defined the hazard for an individual in $A$ as the probability that he or she is perceived as belonging to $A$: $$ \text{Pr}(\text{yes}|A)\times \text{Pr}(A|\text{yes})+\text{Pr}(\text{no}|A)\times \text{Pr}(A|\text{no}) @@ -646,7 +679,7 @@ $$ \text{Pr}(\text{yes}|A^{'})\times \text{Pr}(A|\text{yes}) $$ (eq:util-rand-eight-bb) -According to Greenberg et al. (1977), a respondent commits himself or herself to answer truthfully on the basis of a probability in {eq}`eq:util-rand-seven-aa` or {eq}`eq:util-rand-eight-aa` **before** randomly selecting the question to be answered. +According to {cite:t}`greenberg1977respondent`, a respondent commits himself or herself to answer truthfully on the basis of a probability in {eq}`eq:util-rand-seven-aa` or {eq}`eq:util-rand-eight-aa` **before** randomly selecting the question to be answered. Suppose that the appropriate privacy measure is captured by the notion of "limited hazard" in {eq}`eq:util-rand-eight-aa` and {eq}`eq:util-rand-eight-bb`. From 519949bfc65001ffda38abab5dfd795c0b20e98b Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Wed, 17 Sep 2025 15:27:50 +1000 Subject: [PATCH 2/8] update typos --- lectures/util_rand_resp.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lectures/util_rand_resp.md b/lectures/util_rand_resp.md index 5341da8ba..da4a4daa8 100644 --- a/lectures/util_rand_resp.md +++ b/lectures/util_rand_resp.md @@ -53,7 +53,7 @@ $$ (eq:util-rand-one) At this point we describe some concepts proposed by various researchers. -### {cite:t}`leysieffer1976randomized` +### {cite:t}`leysieffer1976respondent` The response $r$ is regarded as jeopardizing with respect to $A$ or $A^{'}$ if @@ -545,9 +545,9 @@ plt.legend(loc=0, fontsize='large') plt.show() ``` -### Method of {cite:t}`leysieffer1976randomized` +### Method of {cite:t}`leysieffer1976respondent` -{cite:t}`leysieffer1976randomized` recommend a two-dimensional measure of jeopardy that reduces to a single dimension when there is no jeopardy in a 'no' answer, which means that +{cite:t}`leysieffer1976respondent` recommend a two-dimensional measure of jeopardy that reduces to a single dimension when there is no jeopardy in a 'no' answer, which means that $$ @@ -562,7 +562,7 @@ $$ This is not an optimal choice under a utilitarian approach. -### Analysis on the Method of {cite:t}`chaudhuri1988randomized` +### Analysis on the Method of {cite:t}`Chadhuri_Mukerjee_88` {cite}`Chadhuri_Mukerjee_88` argued that the individual may find that since "yes" may sometimes relate to the sensitive group A, a clever respondent may falsely but safely always be inclined to respond "no". From 96be76d6cef722504a867609ea6a51b0b9e83183 Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Wed, 17 Sep 2025 15:29:56 +1000 Subject: [PATCH 3/8] move reference below the figure --- lectures/util_rand_resp.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lectures/util_rand_resp.md b/lectures/util_rand_resp.md index da4a4daa8..2ad22bc0d 100644 --- a/lectures/util_rand_resp.md +++ b/lectures/util_rand_resp.md @@ -588,7 +588,7 @@ However, under a utilitarian approach there should exist other survey designs th In particular, respondents will choose to answer truthfully if the relative advantage from lying is eliminated. -We can use Python to show that the optimal model design corresponds to point $Q$ in {numref}`fig-optimal-design`: +We can use Python to show that the optimal model design ```{code-cell} ipython3 def f(x): @@ -651,6 +651,8 @@ plt.legend(loc='upper right', fontsize='medium') plt.show() ``` +Here the optimal model design corresponds to point $Q$ in {numref}`fig-optimal-design`. + ### Method of {cite:t}`greenberg1977respondent` {cite:t}`greenberg1977respondent` defined the hazard for an individual in $A$ as the probability that he or she is perceived as belonging to $A$: From 77c0bbae4442a5de3e394291cd1d57a22360f356 Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Wed, 17 Sep 2025 15:53:01 +1000 Subject: [PATCH 4/8] fix minor error --- lectures/util_rand_resp.md | 36 ++++++++++++++---------------------- 1 file changed, 14 insertions(+), 22 deletions(-) diff --git a/lectures/util_rand_resp.md b/lectures/util_rand_resp.md index 2ad22bc0d..4148e9866 100644 --- a/lectures/util_rand_resp.md +++ b/lectures/util_rand_resp.md @@ -609,45 +609,37 @@ mystnb: π = 0.3 n = 100 - nv = [0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, 14.7] x = np.arange(0, 1, 0.001) -y = [truth_border_function(i, π) for i in x] +y = x - 0.4 z = x x0 = np.arange(π, 1, 0.001) x2 = np.arange(0, π, 0.001) y1 = [π for i in x0] y2 = [π for i in x2] -# Calculate truth border more precisely -threshold = π * (1 - π) / (1 - 2 * π)**2 -x3 = np.arange(threshold, 1, 0.001) -y3 = [(xi**0.5 - (1 - π)**0.5)**2 for xi in x3] - plt.figure(figsize=(12, 10)) plt.plot(x, x, 'c:', linewidth=2) -plt.plot(x0, y1,'c:', linewidth=2) -plt.plot(y2, x2,'c:', linewidth=2) -plt.plot(x3, y3,'b-', linewidth=2, label='Truth Border') -plt.fill_between(x, y, z, facecolor='blue', alpha=0.05, label='Truth telling') -plt.fill_between(x3, 0, y3, facecolor='green', alpha=0.05, label='Lying') - -# Plot iso-variance curves +plt.plot(x0, y1, 'c:', linewidth=2) +plt.plot(y2, x2, 'c:', linewidth=2) +plt.plot(x, y, 'r-', label='Truth Border') +plt.fill_between(x, y, z, facecolor='blue', alpha=0.05, label='truth telling') +plt.fill_between(x, 0, y, facecolor='green', alpha=0.05, label='lying') for i in range(len(nv)): - y_var = π - (π**2 * (1 - π)**2) / (n * (nv[i] / n) * (x0 - π + 1e-8)) - plt.plot(x0, y_var, 'k--', alpha=1 - 0.07 * i, label=f'V{i+1}') + y = π - (π**2 * (1 - π)**2) / (n * (nv[i] / n) * (x0 - π + 1e-8)) + plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, label=f'V{i+1}') -# Plot the calculated optimal point -plt.scatter(optimal_x, optimal_y, c='r', marker='*', label='Q (computed)', s=150) + +plt.scatter(0.498, 0.1, c='b', marker='*', label='Q', s=150) +plt.scatter(0.4, 0, c='y', label='X', s=150) plt.xlim([0, 1]) plt.ylim([0, 0.5]) plt.xlabel('Pr(A|yes)') plt.ylabel('Pr(A|no)') plt.text(0.45, 0.35, "Truth Telling", fontdict={'size':28, 'style':'italic'}) -plt.text(0.8, 0.1, "Lying", fontdict={'size':28, 'style':'italic'}) -plt.text(optimal_x + 0.02, optimal_y + 0.005, f"Optimal Design\n({optimal_x:.3f}, {optimal_y:.3f})", - fontdict={'size':12,'color':'r'}) -plt.legend(loc='upper right', fontsize='medium') +plt.text(0.85, 0.35, "Lying",fontdict = {'size':28, 'style':'italic'}) +plt.text(0.515, 0.095, "Optimal Design", fontdict={'size':16,'color':'b'}) +plt.legend(loc=0, fontsize='large') plt.show() ``` From 3c7ae02722cbc868a3eb3d2fe9fa1942bd2f49c2 Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Tue, 30 Sep 2025 20:17:49 +1000 Subject: [PATCH 5/8] remove references in the subtitles --- lectures/util_rand_resp.md | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/lectures/util_rand_resp.md b/lectures/util_rand_resp.md index 4148e9866..eb5fd964b 100644 --- a/lectures/util_rand_resp.md +++ b/lectures/util_rand_resp.md @@ -53,7 +53,7 @@ $$ (eq:util-rand-one) At this point we describe some concepts proposed by various researchers. -### {cite:t}`leysieffer1976respondent` +### Leysieffer and Warner (1976) The response $r$ is regarded as jeopardizing with respect to $A$ or $A^{'}$ if @@ -112,7 +112,7 @@ $$ \text{Pr}(A|\text{no})=0 $$ -### {cite:t}`lanke1976degree` +### Lanke (1976) {cite:t}`lanke1975choice` argued that "it is membership in Group A that people may want to hide, not membership in the complementary Group A'." @@ -124,7 +124,7 @@ $$ (eq:util-rand-five-a) Holding this measure constant, he explained under what conditions the smallest variance of the estimate was achieved with the unrelated question model or {cite:t}`warner1965randomized` original model. -### {cite:t}`fligner1977comparison` +### Fligner et al. (1977) {cite:t}`fligner1977comparison` reached similar conclusion as {cite:t}`lanke1976degree`. @@ -135,7 +135,7 @@ $$ $$ (eq:util-rand-six) -### {cite:t}`greenberg1977respondent` +### Greenberg et al. (1977) {cite:t}`greenberg1977respondent` stressed the importance of examining the risk to respondents who do not belong to $A$ as well as the risk to those who do belong to the sensitive group. @@ -484,11 +484,11 @@ Here are some comments about the model design: ## Criticisms of Proposed Privacy Measures -We can use a utilitarian approach to analyze some privacy measures. +We can use a utilitarian approach to analyze some privacy measures. We'll enlist Python Code to help us. -### Analysis of Method of {cite:t}`lanke1976degree` +### Analysis of Method of Lanke (1976) {cite:t}`lanke1976degree` recommends a privacy protection criterion that minimizes: @@ -498,7 +498,7 @@ $$ (eq:util-rand-five-b) Following Lanke's suggestion, the statistician should find the highest possible $\text{ Pr}(A|\text{yes})$ consistent with truth telling while $\text{ Pr}(A|\text{no})$ is fixed at 0. The variance is then minimized at point $X$ in {numref}`fig-lanke-analysis`. -However, we can see that in {numref}`fig-lanke-analysis`, point $Z$ offers a smaller variance that still allows cooperation of the respondents, and it is achievable following our discussion of the truth border in Part III: +However, as shown in {numref}`fig-lanke-analysis`, point $Z$ offers a smaller variance that still allows cooperation of the respondents, and it is achievable following our earlier discussion of the truth border: ```{code-cell} ipython3 --- @@ -545,7 +545,7 @@ plt.legend(loc=0, fontsize='large') plt.show() ``` -### Method of {cite:t}`leysieffer1976respondent` +### Method of Leysieffer and Warner (1976) {cite:t}`leysieffer1976respondent` recommend a two-dimensional measure of jeopardy that reduces to a single dimension when there is no jeopardy in a 'no' answer, which means that @@ -560,9 +560,9 @@ $$ \text{Pr}(A|\text{no})=0 $$ -This is not an optimal choice under a utilitarian approach. +This is not an optimal choice under a utilitarian approach. -### Analysis on the Method of {cite:t}`Chadhuri_Mukerjee_88` +### Analysis on the Method of Chadhuri and Mukerjee (1988) {cite}`Chadhuri_Mukerjee_88` argued that the individual may find that since "yes" may sometimes relate to the sensitive group A, a clever respondent may falsely but safely always be inclined to respond "no". @@ -574,13 +574,13 @@ $$ Here the gain from lying is too high for someone to volunteer a "yes" answer. -This means that +This means that $$ U_i\left(\text{Pr}(A|\text{yes}),\text{truth}\right)< U_i\left(\text{Pr}(A|\text{no}),\text{lie}\right) $$ -in any situation always. +always holds in any situation. As a result, there is no attainable model design. @@ -588,7 +588,7 @@ However, under a utilitarian approach there should exist other survey designs th In particular, respondents will choose to answer truthfully if the relative advantage from lying is eliminated. -We can use Python to show that the optimal model design +We can use Python to show the optimal model design ```{code-cell} ipython3 def f(x): @@ -645,7 +645,7 @@ plt.show() Here the optimal model design corresponds to point $Q$ in {numref}`fig-optimal-design`. -### Method of {cite:t}`greenberg1977respondent` +### Method of Greenberg et al. (1977) {cite:t}`greenberg1977respondent` defined the hazard for an individual in $A$ as the probability that he or she is perceived as belonging to $A$: @@ -691,7 +691,7 @@ and it follows that: Even though this hazard can be set arbitrarily close to 0, an individual in $A$ will completely reveal his or her identity whenever truthfully answering the sensitive question. -However, under utilitarian framework, it is obviously contradictory. +However, under a utilitarian framework, this is obviously contradictory. If the individuals are willing to volunteer this information, it seems that the randomized response design was not necessary in the first place. @@ -710,10 +710,10 @@ If a privacy measure is not completely consistent with the rational behavior of A utilitarian approach provides a systematic way to model respondents' behavior under the assumption that they maximize their expected utilities. -In a utilitarian analysis: +In a utilitarian analysis: - A truth border divides the space of conditional probabilities of being perceived as belonging to the sensitive group, $\text{Pr}(A|\text{yes})$ and $\text{Pr}(A|\text{no})$, into the truth-telling region and the lying region. - The optimal model design is obtained at the point where the truth border touches the lowest possible iso-variance curve. -A practical implication of the analysis of {cite}`ljungqvist1993unified` is that uncertainty about respondents' demands for privacy can be acknowledged by **choosing $\text{Pr}(A|\text{yes})$ and $\text{Pr}(A|\text{no})$ sufficiently close to each other**. +A practical implication of the analysis of {cite}`ljungqvist1993unified` is that uncertainty about respondents' demands for privacy can be acknowledged by **choosing $\text{Pr}(A|\text{yes})$ and $\text{Pr}(A|\text{no})$ sufficiently close to each other**. From d10dd4d89abb933914997530b39420d4a473b8ce Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Tue, 30 Sep 2025 20:42:25 +1000 Subject: [PATCH 6/8] update one-sentence rule --- lectures/util_rand_resp.md | 87 +++++++++++++++++++++++--------------- 1 file changed, 53 insertions(+), 34 deletions(-) diff --git a/lectures/util_rand_resp.md b/lectures/util_rand_resp.md index eb5fd964b..c0f50b0e7 100644 --- a/lectures/util_rand_resp.md +++ b/lectures/util_rand_resp.md @@ -30,7 +30,7 @@ The lecture tells how Ljungqvist used his framework to shed light on alternative ## Privacy Measures -We consider randomized response models with only two possible answers, "yes" and "no." +We consider randomized response models with only two possible answers, "yes" and "no." The design determines probabilities @@ -49,7 +49,7 @@ $$ $$ (eq:util-rand-one) -## Zoo of Concepts +## Zoo of concepts At this point we describe some concepts proposed by various researchers. @@ -100,7 +100,7 @@ An efficient randomized response model is, therefore, any model that attains the As a special example, Leysieffer and Warner considered "a problem in which there is no jeopardy in a no answer"; that is, $g(\text{no}|A^{'})$ can be of unlimited magnitude. -Evidently, an optimal design must have +Evidently, an optimal design must have $$ \text{Pr}(\text{yes}|A)=1 @@ -116,7 +116,7 @@ $$ {cite:t}`lanke1975choice` argued that "it is membership in Group A that people may want to hide, not membership in the complementary Group A'." -For that reason, {cite:t}`lanke1976degree` argued that an appropriate measure of protection is to minimize +For that reason, {cite:t}`lanke1976degree` argued that an appropriate measure of protection is to minimize $$ \max \left\{ \text{Pr}(A|\text{yes}) , \text{Pr}(A|\text{no}) \right\} @@ -167,24 +167,25 @@ $$ (eq:util-rand-eight-b) This measure is just the first term in {eq}`eq:util-rand-seven-a`, i.e., the probability that an individual answers "yes" and is perceived to belong to $A$. -## Respondent's Expected Utility +## Respondent's expected utility -### Truth Border +### Truth border -Key assumptions that underlie a randomized response technique for estimating the fraction of a population that belongs to $A$ are: +Key assumptions that underlie a randomized response technique for estimating the fraction of a population that belongs to $A$ are: - **Assumption 1**: Respondents feel discomfort from being thought of as belonging to $A$. -- **Assumption 2**: Respondents prefer to answer questions truthfully than to lie, so long as the cost of doing so is not too high. The cost is taken to be the discomfort in 1. +- **Assumption 2**: Respondents prefer to answer questions truthfully than to lie, so long as the cost of doing so is not too high. + + - The cost is taken to be the discomfort in 1. Let $r_i$ denote individual $i$'s response to the randomized question. $r_i$ can only take values "yes" or "no". -For a given design of a randomized response interview and a given belief about the fraction of the population -that belongs to $A$, the respondent's answer is associated with a conditional probability $ \text{Pr}(A|r_i)$ that the individual belongs to $A$. +For a given design of a randomized response interview and a given belief about the fraction of the population that belongs to $A$, the respondent's answer is associated with a conditional probability $\text{Pr}(A|r_i)$ that the individual belongs to $A$. -Given $r_i$ and complete privacy, the individual's utility is higher if $r_i$ represents a truthful answer rather than a lie. +Given $r_i$ and complete privacy, the individual's utility is higher if $r_i$ represents a truthful answer rather than a lie. In terms of a respondent's expected utility as a function of $ \text{Pr}(A|r_i)$ and $r_i$ @@ -207,19 +208,19 @@ $$ (eq:util-rand-nine-a) and $$ -U_i\left(\text{Pr}(A|r_i),\text{truth}\right)>U_i\left(\text{Pr}(A|r_i),\text{lie}\right) , \text{ for } \text{Pr}(A|r_i) \in [0,1] +U_i\left(\text{Pr}(A|r_i),\text{truth}\right)>U_i\left(\text{Pr}(A|r_i),\text{lie}\right), \text{ for } \text{Pr}(A|r_i) \in [0,1] $$ (eq:util-rand-nine-b) -Suppose now that correct answer for individual $i$ is "yes". +Suppose now that correct answer for individual $i$ is "yes". -Individual $i$ would choose to answer truthfully if +Individual $i$ would choose to answer truthfully if $$ U_i\left(\text{Pr}(A|\text{yes}),\text{truth}\right)\geq U_i\left(\text{Pr}(A|\text{no}),\text{lie}\right) $$ (eq:util-rand-ten-a) -If the correct answer is "no", individual $i$ would volunteer the correct answer only if +If the correct answer is "no", individual $i$ would volunteer the correct answer only if $$ U_i\left(\text{Pr}(A|\text{no}),\text{truth}\right)\geq U_i\left(\text{Pr}(A|\text{yes}),\text{lie}\right) @@ -235,15 +236,15 @@ so that a "yes" answer increases the odds that an individual belongs to $A$. Constraint {eq}`eq:util-rand-ten-b` holds for sure. -Consequently, constraint {eq}`eq:util-rand-ten-a` becomes the single necessary condition for individual $i$ always to answer truthfully. +Consequently, constraint {eq}`eq:util-rand-ten-a` becomes the single necessary condition for individual $i$ always to answer truthfully. -At equality, constraint {eq}`eq:util-rand-ten-a` determines conditional probabilities that make the individual indifferent between telling the truth and lying when the correct answer is "yes": +At equality, constraint {eq}`eq:util-rand-ten-a` determines conditional probabilities that make the individual indifferent between telling the truth and lying when the correct answer is "yes": $$ U_i\left(\text{Pr}(A|\text{yes}),\text{truth}\right)= U_i\left(\text{Pr}(A|\text{no}),\text{lie}\right) $$ (eq:util-rand-eleven) -Equation {eq}`eq:util-rand-eleven` defines a "truth border". +Equation {eq}`eq:util-rand-eleven` defines a "truth border". Differentiating {eq}`eq:util-rand-eleven` with respect to the conditional probabilities shows that the truth border has a positive slope in the space of conditional probabilities: @@ -255,17 +256,25 @@ The source of the positive relationship is: - The individual is willing to volunteer a truthful "yes" answer so long as the utility from doing so (i.e., the left side of {eq}`eq:util-rand-eleven`) is at least as high as the utility of lying on the right side of {eq}`eq:util-rand-eleven`. -- Suppose now that $\text{Pr}(A|\text{yes})$ increases. That reduces the utility of telling the truth. To preserve indifference between a truthful answer and a lie, $\text{Pr}(A|\text{no})$ must increase to reduce the utility of lying. +- Suppose now that $\text{Pr}(A|\text{yes})$ increases. + + - This reduces the utility of telling the truth. + + - To preserve indifference between a truthful answer and a lie, $\text{Pr}(A|\text{no})$ must increase to reduce the utility of lying. -### Drawing a Truth Border +### Drawing a truth border We can deduce two things about the truth border: - The truth border divides the space of conditional probabilities into two subsets: "truth telling" and "lying". - - Thus, sufficient privacy elicits a truthful answer, whereas insufficient privacy results in a lie. The truth border depends on a respondent's utility function. + - Thus, sufficient privacy elicits a truthful answer, whereas insufficient privacy results in a lie. + + - The truth border depends on a respondent's utility function. + +- Assumptions in {eq}`eq:util-rand-nine-a` and {eq}`eq:util-rand-nine-a` are sufficient only to guarantee a positive slope of the truth border. -- Assumptions in {eq}`eq:util-rand-nine-a` and {eq}`eq:util-rand-nine-a` are sufficient only to guarantee a positive slope of the truth border. The truth border can have either a concave or a convex shape. +The truth border can have either a concave or a convex shape. We can draw some truth borders with the following Python code: @@ -341,9 +350,9 @@ plt.legend(loc=0, fontsize='large') plt.show() ``` -## Utilitarian View of Survey Design +## Utilitarian view of survey design -### Iso-variance Curves +### Iso-variance curves A statistician's objective is @@ -378,7 +387,7 @@ From expression {eq}`eq:util-rand-thirteen`, {eq}`eq:util-rand-fourteen-a` and { - Iso-variance curves are always upward-sloping and concave. -### Drawing Iso-variance Curves +### Drawing iso-variance curves We use Python code to draw iso-variance curves. @@ -452,7 +461,7 @@ var = Iso_Variance(π=0.3, n=100) var.plotting_iso_variance_curve() ``` -### Optimal Survey +### Optimal survey A point on an iso-variance curves can be attained with the unrelated question design. @@ -476,19 +485,27 @@ Here are some comments about the model design: - An equilibrium of the optimal design model is a Nash equilibrium of a noncooperative game. -- Assumption {eq}`eq:util-rand-nine-b` is sufficient to guarantee existence of an optimal model design. By choosing $\text{ Pr}(A|\text{yes})$ and $\text{ Pr}(A|\text{no})$ sufficiently close to each other, all respondents will find it optimal to answer truthfully. The closer are these probabilities, the higher the variance of the estimator becomes. +- Assumption {eq}`eq:util-rand-nine-b` is sufficient to guarantee existence of an optimal model design. -- If respondents experience a large enough increase in expected utility from telling the truth, then there is no need to use a randomized response model. The smallest possible variance of the estimate is then obtained at $\text{ Pr}(A|\text{yes})=1$ and $\text{ Pr}(A|\text{no})=0$ ; that is, when respondents answer truthfully to direct questioning. + - By choosing $\text{ Pr}(A|\text{yes})$ and $\text{ Pr}(A|\text{no})$ sufficiently close to each other, all respondents will find it optimal to answer truthfully. -- A more general design problem would be to minimize some weighted sum of the estimator's variance and bias. It would be optimal to accept some lies from the most "reluctant" respondents. + - The closer are these probabilities, the higher the variance of the estimator becomes. -## Criticisms of Proposed Privacy Measures +- If respondents experience a large enough increase in expected utility from telling the truth, then there is no need to use a randomized response model. + +The smallest possible variance of the estimate is then obtained at $\text{ Pr}(A|\text{yes})=1$ and $\text{ Pr}(A|\text{no})=0$; that is, when respondents answer truthfully to direct questioning. + +- A more general design problem would be to minimize some weighted sum of the estimator's variance and bias. + + - It would be optimal to accept some lies from the most "reluctant" respondents. + +## Criticisms of proposed privacy measures We can use a utilitarian approach to analyze some privacy measures. We'll enlist Python Code to help us. -### Analysis of Method of Lanke (1976) +### Analysis of method of Lanke (1976) {cite:t}`lanke1976degree` recommends a privacy protection criterion that minimizes: @@ -496,7 +513,9 @@ $$ \max \left\{ \text{Pr}(A|\text{yes}) , \text{Pr}(A|\text{no}) \right\} $$ (eq:util-rand-five-b) -Following Lanke's suggestion, the statistician should find the highest possible $\text{ Pr}(A|\text{yes})$ consistent with truth telling while $\text{ Pr}(A|\text{no})$ is fixed at 0. The variance is then minimized at point $X$ in {numref}`fig-lanke-analysis`. +Following Lanke's suggestion, the statistician should find the highest possible $\text{ Pr}(A|\text{yes})$ consistent with truth telling while $\text{ Pr}(A|\text{no})$ is fixed at 0. + +The variance is then minimized at point $X$ in {numref}`fig-lanke-analysis`. However, as shown in {numref}`fig-lanke-analysis`, point $Z$ offers a smaller variance that still allows cooperation of the respondents, and it is achievable following our earlier discussion of the truth border: @@ -562,7 +581,7 @@ $$ This is not an optimal choice under a utilitarian approach. -### Analysis on the Method of Chadhuri and Mukerjee (1988) +### Analysis on the method of Chadhuri and Mukerjee (1988) {cite}`Chadhuri_Mukerjee_88` argued that the individual may find that since "yes" may sometimes relate to the sensitive group A, a clever respondent may falsely but safely always be inclined to respond "no". @@ -697,7 +716,7 @@ If the individuals are willing to volunteer this information, it seems that the It ignores the fact that respondents retain the option of lying until they have seen the question to be answered. -## Concluding Remarks +## Concluding remarks The justifications for a randomized response procedure are that From 088f30f2d0781728761eb3e66179811239f28549 Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Fri, 17 Oct 2025 15:57:20 +1100 Subject: [PATCH 7/8] updates --- lectures/util_rand_resp.md | 202 +++++++++++++++++++++++-------------- 1 file changed, 128 insertions(+), 74 deletions(-) diff --git a/lectures/util_rand_resp.md b/lectures/util_rand_resp.md index c0f50b0e7..e855a3e51 100644 --- a/lectures/util_rand_resp.md +++ b/lectures/util_rand_resp.md @@ -3,18 +3,16 @@ jupytext: text_representation: extension: .md format_name: myst + format_version: 0.13 + jupytext_version: 1.17.1 kernelspec: - display_name: Python 3 - language: python name: python3 + display_name: Python 3 (ipykernel) + language: python --- -```{code-cell} ipython3 -import matplotlib.pyplot as plt -import numpy as np -``` -# Expected Utilities of Random Responses +# Expected utilities of random responses ## Overview @@ -26,9 +24,15 @@ Lars Ljungqvist {cite}`ljungqvist1993unified` analyzed how a respondent's decisi The lecture tells how Ljungqvist used his framework to shed light on alternative randomized response survey techniques proposed, for example, by {cite}`lanke1975choice`, {cite}`lanke1976degree`, {cite}`leysieffer1976respondent`, {cite}`anderson1976estimation`, {cite}`fligner1977comparison`, {cite}`greenberg1977respondent`, {cite}`greenberg1969unrelated`. +We use the following imports: +```{code-cell} ipython3 +import matplotlib.pyplot as plt +import numpy as np +``` -## Privacy Measures + +## Privacy measures We consider randomized response models with only two possible answers, "yes" and "no." @@ -51,7 +55,7 @@ $$ (eq:util-rand-one) ## Zoo of concepts -At this point we describe some concepts proposed by various researchers. +At this point, we describe some concepts proposed by various researchers. ### Leysieffer and Warner (1976) @@ -71,7 +75,7 @@ $$ \frac{\text{Pr}(A|r)}{\text{Pr}(A^{'}|r)}\times \frac{(1-\pi_A)}{\pi_A} = \frac{\text{Pr}(r|A)}{\text{Pr}(r|A^{'})} $$ (eq:util-rand-three) -If this expression is greater (less) than unity, it follows that $r$ is jeopardizing with respect to $A$($A^{'}$). +If this expression is greater (less) than unity, it follows that $r$ is jeopardizing with respect to $A$ ($A^{'}$). Then, the natural measure of jeopardy will be: @@ -84,7 +88,7 @@ g(r|A^{'})&=\frac{\text{Pr}(r|A^{'})}{\text{Pr}(r|A)} $$ (eq:util-rand-four) -Suppose, without loss of generality, that $\text{Pr}(\text{yes}|A)>\text{Pr}(\text{yes}|A^{'})$, then a yes (no) answer is jeopardizing with respect $A$($A^{'}$), that is, +Suppose, without loss of generality, that $\text{Pr}(\text{yes}|A)>\text{Pr}(\text{yes}|A^{'})$, then a yes (no) answer is jeopardizing with respect to $A$ ($A^{'}$), that is, $$ \begin{aligned} @@ -126,7 +130,7 @@ Holding this measure constant, he explained under what conditions the smallest v ### Fligner et al. (1977) -{cite:t}`fligner1977comparison` reached similar conclusion as {cite:t}`lanke1976degree`. +{cite:t}`fligner1977comparison` reached a similar conclusion as {cite:t}`lanke1976degree`. They measured "private protection" as @@ -211,7 +215,7 @@ $$ U_i\left(\text{Pr}(A|r_i),\text{truth}\right)>U_i\left(\text{Pr}(A|r_i),\text{lie}\right), \text{ for } \text{Pr}(A|r_i) \in [0,1] $$ (eq:util-rand-nine-b) -Suppose now that correct answer for individual $i$ is "yes". +Suppose now that the correct answer for individual $i$ is "yes". Individual $i$ would choose to answer truthfully if @@ -238,7 +242,7 @@ Constraint {eq}`eq:util-rand-ten-b` holds for sure. Consequently, constraint {eq}`eq:util-rand-ten-a` becomes the single necessary condition for individual $i$ always to answer truthfully. -At equality, constraint {eq}`eq:util-rand-ten-a` determines conditional probabilities that make the individual indifferent between telling the truth and lying when the correct answer is "yes": +At equality, constraint {eq}`eq:util-rand-ten-a` determines the conditional probabilities that make the individual indifferent between telling the truth and lying when the correct answer is "yes": $$ U_i\left(\text{Pr}(A|\text{yes}),\text{truth}\right)= U_i\left(\text{Pr}(A|\text{no}),\text{lie}\right) @@ -282,8 +286,9 @@ We can draw some truth borders with the following Python code: --- mystnb: figure: - caption: | - Three types of truth border + caption: 'Three types of truth border + + ' name: fig-truth-borders --- x1 = np.arange(0, 1, 0.001) @@ -293,11 +298,23 @@ y2 = (pow(x2, 0.5) - 0.4)**2 x3 = np.arange(0.4**0.5, 1, 0.001) y3 = pow(x3**2 - 0.4, 0.5) plt.figure(figsize=(12, 10)) -plt.plot(x1, y1, 'r-', label=r'Truth Border of: $U_i(Pr(A|r_i),\phi_i)=-Pr(A|r_i)+f(\phi_i)$') +plt.plot( + x1, y1, 'r-', + label=r'Truth Border of: $U_i(Pr(A|r_i),\phi_i)=' + r'-Pr(A|r_i)+f(\phi_i)$' +) plt.fill_between(x1, 0, y1, facecolor='red', alpha=0.05) -plt.plot(x2, y2, 'b-', label=r'Truth Border of: $U_i(Pr(A|r_i),\phi_i)=-Pr(A|r_i)^{2}+f(\phi_i)$') +plt.plot( + x2, y2, 'b-', + label=r'Truth Border of: $U_i(Pr(A|r_i),\phi_i)=' + r'-Pr(A|r_i)^{2}+f(\phi_i)$' +) plt.fill_between(x2, 0, y2, facecolor='blue', alpha=0.05) -plt.plot(x3, y3, 'y-', label=r'Truth Border of: $U_i(Pr(A|r_i),\phi_i)=-\sqrt{Pr(A|r_i)}+f(\phi_i)$') +plt.plot( + x3, y3, 'y-', + label=r'Truth Border of: $U_i(Pr(A|r_i),\phi_i)=' + r'-\sqrt{Pr(A|r_i)}+f(\phi_i)$' +) plt.fill_between(x3, 0, y3, facecolor='green', alpha=0.05) plt.plot(x1, x1, ':', linewidth=2) plt.xlim([0, 1]) @@ -305,14 +322,15 @@ plt.ylim([0, 1]) plt.xlabel('Pr(A|yes)') plt.ylabel('Pr(A|no)') -plt.text(0.42, 0.3, "Truth Telling", fontdict={'size':28, 'style':'italic'}) -plt.text(0.8, 0.1, "Lying", fontdict={'size':28, 'style':'italic'}) +plt.text(0.42, 0.3, "Truth Telling", + fontdict={'size': 28, 'style': 'italic'}) +plt.text(0.8, 0.1, "Lying", + fontdict={'size': 28, 'style': 'italic'}) plt.legend(loc=0, fontsize='large') plt.show() ``` - Without loss of generality, we consider the truth border: $$ @@ -325,8 +343,9 @@ and plot the "truth telling" and "lying area" of individual $i$ in {numref}`fig- --- mystnb: figure: - caption: | - Truth telling and lying areas of individual $i$ + caption: 'Truth telling and lying areas of individual $i$ + + ' name: fig-truth-lying-areas --- x1 = np.arange(0, 1, 0.001) @@ -334,17 +353,25 @@ y1 = x1 - 0.4 z1 = x1 z2 = 0 plt.figure(figsize=(12, 10)) -plt.plot(x1, y1,'r-',label=r'Truth Border of: $U_i(Pr(A|r_i),\phi_i)=-Pr(A|r_i)+f(\phi_i)$') +plt.plot( + x1, y1, 'r-', + label=r'Truth Border of: $U_i(Pr(A|r_i),\phi_i)=' + r'-Pr(A|r_i)+f(\phi_i)$' +) plt.plot(x1, x1, ':', linewidth=2) -plt.fill_between(x1, y1, z1, facecolor='blue', alpha=0.05, label='truth telling') -plt.fill_between(x1, z2, y1, facecolor='green', alpha=0.05, label='lying') +plt.fill_between(x1, y1, z1, facecolor='blue', alpha=0.05, + label='truth telling') +plt.fill_between(x1, z2, y1, facecolor='green', alpha=0.05, + label='lying') plt.xlim([0, 1]) plt.ylim([0, 1]) plt.xlabel('Pr(A|yes)') plt.ylabel('Pr(A|no)') -plt.text(0.5, 0.4, "Truth Telling", fontdict={'size':28, 'style':'italic'}) -plt.text(0.8, 0.2, "Lying", fontdict={'size':28, 'style':'italic'}) +plt.text(0.5, 0.4, "Truth Telling", + fontdict={'size': 28, 'style': 'italic'}) +plt.text(0.8, 0.2, "Lying", + fontdict={'size': 28, 'style': 'italic'}) plt.legend(loc=0, fontsize='large') plt.show() @@ -358,7 +385,7 @@ A statistician's objective is - to find a randomized response survey design that minimizes the bias and the variance of the estimator. -Given a design that ensures truthful answers by all respondents, Anderson(1976, Theorem 1) {cite}`anderson1976estimation` showed that the minimum variance estimate in the two-response model has variance +Given a design that ensures truthful answers by all respondents, Anderson (1976, Theorem 1) {cite}`anderson1976estimation` showed that the minimum variance estimate in the two-response model has variance $$ \begin{aligned} @@ -409,29 +436,35 @@ class Iso_Variance: π = self.π n = self.n - nv = np.array([0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, 14.7]) + nv = np.array([0.27, 0.34, 0.49, 0.74, 0.92, 1.1, + 1.47, 2.94, 14.7]) x = np.arange(0, 1, 0.001) x0 = np.arange(π, 1, 0.001) x2 = np.arange(0, π, 0.001) y1 = [π for i in x0] y2 = [π for i in x2] - y0 = 1 / (1 + (x0 * (1 - π)**2) / ((1 - x0) * π**2)) + y0 = 1 / (1 + (x0 * (1 - π)**2) / + ((1 - x0) * π**2)) plt.figure(figsize=(12, 10)) plt.plot(x0, y0, 'm-', label='Warner') plt.plot(x, x, 'c:', linewidth=2) - plt.plot(x0, y1,'c:', linewidth=2) + plt.plot(x0, y1, 'c:', linewidth=2) plt.plot(y2, x2, 'c:', linewidth=2) for i in range(len(nv)): - y = π - (π**2 * (1 - π)**2) / (n * (nv[i] / n) * (x0 - π + 1e-8)) - plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, label=f'V{i+1}') + y = π - (π**2 * (1 - π)**2) / \ + (n * (nv[i] / n) * (x0 - π + 1e-8)) + plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, + label=f'V{i+1}') plt.xlim([0, 1]) plt.ylim([0, 0.5]) plt.xlabel('Pr(A|yes)') plt.ylabel('Pr(A|no)') plt.legend(loc=0, fontsize='large') - plt.text(0.32, 0.28, "High Var", fontdict={'size':15, 'style':'italic'}) - plt.text(0.91, 0.01, "Low Var", fontdict={'size':15, 'style':'italic'}) + plt.text(0.32, 0.28, "High Var", + fontdict={'size': 15, 'style': 'italic'}) + plt.text(0.91, 0.01, "Low Var", + fontdict={'size': 15, 'style': 'italic'}) plt.show() ``` @@ -439,7 +472,7 @@ Properties of iso-variance curves are: - All points on one iso-variance curve share the same variance -- From $V_1$ to $V_9$, the variance of the iso-variance curve increase monotonically, as colors brighten monotonically +- From $V_1$ to $V_9$, the variance of the iso-variance curve increases monotonically, as colors brighten monotonically Suppose the parameters of the iso-variance model follow those in Ljungqvist {cite}`ljungqvist1993unified`, which are: @@ -453,8 +486,9 @@ Then we can plot the iso-variance curve in {numref}`fig-iso-variance`: --- mystnb: figure: - caption: | - Iso-variance curves for randomized response survey design + caption: 'Iso-variance curves for randomized response survey design + + ' name: fig-iso-variance --- var = Iso_Variance(π=0.3, n=100) @@ -463,7 +497,7 @@ var.plotting_iso_variance_curve() ### Optimal survey -A point on an iso-variance curves can be attained with the unrelated question design. +A point on iso-variance curves can be attained with the unrelated question design. We now focus on finding an "optimal survey design" that @@ -477,7 +511,7 @@ To construct an optimal design - The point where this set touches the lowest possible iso-variance curve determines an optimal survey design. -Consquently, a minimum variance unbiased estimator is pinned down by an individual who is the least willing to volunteer a truthful answer. +Consequently, a minimum variance unbiased estimator is pinned down by an individual who is the least willing to volunteer a truthful answer. Here are some comments about the model design: @@ -485,7 +519,7 @@ Here are some comments about the model design: - An equilibrium of the optimal design model is a Nash equilibrium of a noncooperative game. -- Assumption {eq}`eq:util-rand-nine-b` is sufficient to guarantee existence of an optimal model design. +- Assumption {eq}`eq:util-rand-nine-b` is sufficient to guarantee the existence of an optimal model design. - By choosing $\text{ Pr}(A|\text{yes})$ and $\text{ Pr}(A|\text{no})$ sufficiently close to each other, all respondents will find it optimal to answer truthfully. @@ -503,11 +537,11 @@ The smallest possible variance of the estimate is then obtained at $\text{ Pr}(A We can use a utilitarian approach to analyze some privacy measures. -We'll enlist Python Code to help us. +We'll enlist Python code to help us. -### Analysis of method of Lanke (1976) +### Analysis of the method of Lanke (1976) -{cite:t}`lanke1976degree` recommends a privacy protection criterion that minimizes: +{cite:t}`lanke1976degree` recommends a privacy protection criterion that minimizes: $$ \max \left\{ \text{Pr}(A|\text{yes}) , \text{Pr}(A|\text{no}) \right\} @@ -523,14 +557,16 @@ However, as shown in {numref}`fig-lanke-analysis`, point $Z$ offers a smaller va --- mystnb: figure: - caption: | - Analysis of Lanke's privacy protection method showing optimal design points + caption: 'Analysis of Lanke''s privacy protection method showing optimal design + points + + ' name: fig-lanke-analysis --- - π = 0.3 n = 100 -nv = [0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, 14.7] +nv = [0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, + 14.7] x = np.arange(0, 1, 0.001) y = x - 0.4 z = x @@ -544,22 +580,30 @@ plt.plot(x, x, 'c:', linewidth=2) plt.plot(x0, y1, 'c:', linewidth=2) plt.plot(y2, x2, 'c:', linewidth=2) plt.plot(x, y, 'r-', label='Truth Border') -plt.fill_between(x, y, z, facecolor='blue', alpha=0.05, label='truth telling') -plt.fill_between(x, 0, y, facecolor='green', alpha=0.05, label='lying') +plt.fill_between(x, y, z, facecolor='blue', alpha=0.05, + label='truth telling') +plt.fill_between(x, 0, y, facecolor='green', alpha=0.05, + label='lying') for i in range(len(nv)): - y = π - (π**2 * (1 - π)**2) / (n * (nv[i] / n) * (x0 - π + 1e-8)) - plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, label=f'V{i+1}') + y = π - (π**2 * (1 - π)**2) / \ + (n * (nv[i] / n) * (x0 - π + 1e-8)) + plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, + label=f'V{i+1}') -plt.scatter(0.498, 0.1, c='b', marker='*', label='Z', s=150) +plt.scatter(0.498, 0.1, c='b', marker='*', label='Z', + s=150) plt.scatter(0.4, 0, c='y', label='X', s=150) plt.xlim([0, 1]) plt.ylim([0, 0.5]) plt.xlabel('Pr(A|yes)') plt.ylabel('Pr(A|no)') -plt.text(0.45, 0.35, "Truth Telling", fontdict={'size':28, 'style':'italic'}) -plt.text(0.85, 0.35, "Lying",fontdict = {'size':28, 'style':'italic'}) -plt.text(0.515, 0.095, "Optimal Design", fontdict={'size':16,'color':'b'}) +plt.text(0.45, 0.35, "Truth Telling", + fontdict={'size': 28, 'style': 'italic'}) +plt.text(0.85, 0.35, "Lying", + fontdict={'size': 28, 'style': 'italic'}) +plt.text(0.515, 0.095, "Optimal Design", + fontdict={'size': 16, 'color': 'b'}) plt.legend(loc=0, fontsize='large') plt.show() ``` @@ -581,9 +625,9 @@ $$ This is not an optimal choice under a utilitarian approach. -### Analysis on the method of Chadhuri and Mukerjee (1988) +### Analysis of the method of Chaudhuri and Mukerjee (1988) -{cite}`Chadhuri_Mukerjee_88` argued that the individual may find that since "yes" may sometimes relate to the sensitive group A, a clever respondent may falsely but safely always be inclined to respond "no". +{cite:t}`Chadhuri_Mukerjee_88` argued that since "yes" may sometimes relate to the sensitive group A, a clever respondent may falsely but safely always be inclined to respond "no". In this situation, the truth border is such that individuals choose to lie whenever the truthful answer is "yes" and @@ -599,7 +643,7 @@ $$ U_i\left(\text{Pr}(A|\text{yes}),\text{truth}\right)< U_i\left(\text{Pr}(A|\text{no}),\text{lie}\right) $$ -always holds in any situation. +always holds. As a result, there is no attainable model design. @@ -607,7 +651,7 @@ However, under a utilitarian approach there should exist other survey designs th In particular, respondents will choose to answer truthfully if the relative advantage from lying is eliminated. -We can use Python to show the optimal model design +We can use Python to show the optimal model design. ```{code-cell} ipython3 def f(x): @@ -621,14 +665,16 @@ def f(x): --- mystnb: figure: - caption: | - Optimal survey design under utilitarian approach showing computed point $Q$ + caption: 'Optimal survey design under utilitarian approach showing computed point + $Q$ + + ' name: fig-optimal-design --- - π = 0.3 n = 100 -nv = [0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, 14.7] +nv = [0.27, 0.34, 0.49, 0.74, 0.92, 1.1, 1.47, 2.94, + 14.7] x = np.arange(0, 1, 0.001) y = x - 0.4 z = x @@ -642,22 +688,30 @@ plt.plot(x, x, 'c:', linewidth=2) plt.plot(x0, y1, 'c:', linewidth=2) plt.plot(y2, x2, 'c:', linewidth=2) plt.plot(x, y, 'r-', label='Truth Border') -plt.fill_between(x, y, z, facecolor='blue', alpha=0.05, label='truth telling') -plt.fill_between(x, 0, y, facecolor='green', alpha=0.05, label='lying') +plt.fill_between(x, y, z, facecolor='blue', alpha=0.05, + label='truth telling') +plt.fill_between(x, 0, y, facecolor='green', alpha=0.05, + label='lying') for i in range(len(nv)): - y = π - (π**2 * (1 - π)**2) / (n * (nv[i] / n) * (x0 - π + 1e-8)) - plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, label=f'V{i+1}') + y = π - (π**2 * (1 - π)**2) / \ + (n * (nv[i] / n) * (x0 - π + 1e-8)) + plt.plot(x0, y, 'k--', alpha=1 - 0.07 * i, + label=f'V{i+1}') -plt.scatter(0.498, 0.1, c='b', marker='*', label='Q', s=150) +plt.scatter(0.498, 0.1, c='b', marker='*', label='Q', + s=150) plt.scatter(0.4, 0, c='y', label='X', s=150) plt.xlim([0, 1]) plt.ylim([0, 0.5]) plt.xlabel('Pr(A|yes)') plt.ylabel('Pr(A|no)') -plt.text(0.45, 0.35, "Truth Telling", fontdict={'size':28, 'style':'italic'}) -plt.text(0.85, 0.35, "Lying",fontdict = {'size':28, 'style':'italic'}) -plt.text(0.515, 0.095, "Optimal Design", fontdict={'size':16,'color':'b'}) +plt.text(0.45, 0.35, "Truth Telling", + fontdict={'size': 28, 'style': 'italic'}) +plt.text(0.85, 0.35, "Lying", + fontdict={'size': 28, 'style': 'italic'}) +plt.text(0.515, 0.095, "Optimal Design", + fontdict={'size': 16, 'color': 'b'}) plt.legend(loc=0, fontsize='large') plt.show() ``` From 14530ad3641dc0db98accdc8a870d405f909e855 Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Fri, 17 Oct 2025 16:18:09 +1100 Subject: [PATCH 8/8] updates --- lectures/util_rand_resp.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/lectures/util_rand_resp.md b/lectures/util_rand_resp.md index e855a3e51..80b7bff4c 100644 --- a/lectures/util_rand_resp.md +++ b/lectures/util_rand_resp.md @@ -175,13 +175,16 @@ This measure is just the first term in {eq}`eq:util-rand-seven-a`, i.e., the pro ### Truth border -Key assumptions that underlie a randomized response technique for estimating the fraction of a population that belongs to $A$ are: +Key assumptions that underlie a randomized response technique for estimating the fraction of a population that belongs to $A$ are -- **Assumption 1**: Respondents feel discomfort from being thought of as belonging to $A$. +```{prf:assumption} -- **Assumption 2**: Respondents prefer to answer questions truthfully than to lie, so long as the cost of doing so is not too high. +- *Assumption 1*: Respondents feel discomfort from being thought of as belonging to $A$. + +- *Assumption 2*: Respondents prefer to answer questions truthfully than to lie, so long as the cost of doing so is not too high. - The cost is taken to be the discomfort in 1. +``` Let $r_i$ denote individual $i$'s response to the randomized question. @@ -789,4 +792,4 @@ In a utilitarian analysis: - The optimal model design is obtained at the point where the truth border touches the lowest possible iso-variance curve. -A practical implication of the analysis of {cite}`ljungqvist1993unified` is that uncertainty about respondents' demands for privacy can be acknowledged by **choosing $\text{Pr}(A|\text{yes})$ and $\text{Pr}(A|\text{no})$ sufficiently close to each other**. +A practical implication of the analysis of {cite}`ljungqvist1993unified` is that uncertainty about respondents' demands for privacy can be acknowledged by *choosing $\text{Pr}(A|\text{yes})$ and $\text{Pr}(A|\text{no})$ sufficiently close to each other*.