Data 102, Spring 2024
Homework 5
Due: 5:00 PM Friday, April 12, 2024
Submission Instructions
Homework assignments throughout the course will have a written portion and a code portion.
Please follow the directions below to properly submit both portions.
Written Portion:
• Every answer should contain a calculation or reasoning.
• You may write the written portions on paper or in LATEX.
• If you type your written responses, please make sure to put it in a markdown cell instead
of writing it as a comment in a code cell.
• Please start each question on a new page.
• It is your responsibility to check that work on all the scanned pages is legible.
Code Portion:
• You should append any code you wrote in the PDF you submit. You can either do so
by copy and paste the code into a text file or convert your Jupyter Notebook to PDF.
• Run your notebook and make sure you print out your outputs from running the code.
• It is your responsibility to check that your code and answers show up in the PDF file.
Submitting:
You will submit a PDF file to Gradescope containing all the work you want graded (including
your math and code).
• When downloading your Jupyter Notebook, make sure you go to File → Save and
Export Notebook As → PDF; do not just print page from your web browser because
your code and written responses will be cut off.
• Combine the PDFs from the written and code portions into one PDF. Here is a useful
tool for doing so. As a Berkeley student, you get free access to Adobe Acrobat, which
you can use to merge as many PDFs as you want.
• Please see this guide for how to submit your PDF on Gradescope. In particular, for
each question on the assignment, please make sure you understand how to select the
corresponding page(s) that contain your solution (see item 2 on the last page).
1
Data 102 Homework 5 Due: 5:00 PM PT Friday, April 12, 2024
Late assignments will count towards your slip days; it is your responsibility to ensure you
have enough time to submit your work.
Data science is a collaborative activity. While you may talk with others about the homework, please write up your solutions individually. If you discuss the homework with your
peers, please include their names on your submission. Please make sure any handwritten
answers are legible, as we may deduct points otherwise.
Simulation Study of Bandit Algorithms
In this problem, we evaluate the performance of two algorithms for the multi-armed bandit
problem. The general protocol for the multi-armed bandit problem with K arms and n rounds
is as follows: in each round t = 1, . . . , n the algorithm chooses an arm At ∈ {1, . . . , K} and
then observes reward Xt for the chosen arm. The bandit algorithm specifies how to choose
the arm At based on what rewards have been observed so far. In this problem, we consider
a multi-armed bandit for K = 2 arms, n = 50 rounds, and where the reward at time t is
Xt ∼ N (At − 1, 1), i.e. N (0, 1) for arm 1 and N (1, 1) for arm 2.
(a) (4 points) Consider the multi-armed bandit where the arm At ∈ {1, 2} is chosen according to the explore-then-commit algorithm (below) with c = 4. Let Gn =
Pn
t=1 Xt denote
the total reward after n = 50 iterations. Simulate the random variable Gn a total of
B = 2000 times and save the values G
(b)
n , b = 1, . . . , B in a list. Report the empirical
averaged regret 1
B
PB
b=1
50µ
∗ − G
(b)
n
(where µ
∗
is the mean of the best arm) and plot
a normalized histogram of the rewards.
Algorithm 1 Explore-then-Commit Algorithm
input: Number of initial pulls c per arm
for t = 1, . . . , cK : do
Choose arm At = (t mod K) + 1
end
Let Aˆ ∈ {1, . . . , K} denote the arm with the highest average reward so far.
for t = cK + 1, cK + 2, . . . , n : do
Choose arm At = Aˆ
end
(b) (4 points) Consider the multi-armed bandit where the arm At ∈ {1, 2} is chosen according to the UCB algorithm (below) with c = 4, n = 50 rounds. Repeat the simulation
in Part (a) using the UCB algorithm, again reporting the empirical averaged regret and
the histogram of G
(b)
n for b = 1 . . . B for B = 2000. How does the empirical averaged
regret compare to your results from part (a)?
Note: If TA(t) denote the number of times arm A has been chosen (up to and including
time t) and ˆµA,t is the average reward from choosing arm A (up to and including t), then
use the upper confidence bound ˆµA,TA(t−1) +
q2 log(20)
TA(t−1) . Note also that this algorithm
is slightly different than the one used in the lab and lecture as we are using an initial
exploration phase.
2
Data 102 Homework 5 Due: 5:00 PM PT Friday, April 12, 2024
Algorithm 2 UCB Algorithm
input: Number of initial pulls c per arm
for t = 1, . . . , cK : do
Choose arm At = (t mod K) + 1
end
for t = cK + 1, cK + 2 . . . : do
Choose arm At with the highest upper confidence bound so far.
end
(c) (1 point) Compare the distributions of the rewards by also plotting them on the same
plot and briefly justify the salient differences.
Markov Decision Process for Robot Soccer
A soccer robot R is on a fast break toward the goal, starting in position 1. From positions
1 through 3, it can either shoot (S) or dribble the ball forward (D). From 4 it can only shoot.
If it shoots, it either scores a goal (state G) or misses (state M). If it dribbles, it either
advances a square or loses the ball, ending up in state M.
In this Markov Decision Process (MDP), the states are 1, 2, 3, 4, G, and M, where G
and M are terminal states. The transition model depends on the parameter y, which is the
probability of dribbling successfully (i.e., advancing a square). Assume a discount of γ = 1.
For k ∈ {1, 2, 3, 4}, we have
Pr(G | k, S) = k
6
Pr(M | k, S) = 1 −
k
6
Pr(k + 1 | k, D) = y
Pr(M | k, D) = 1 − y,
R(k, S, G) = 1
and rewards are 0 for all other transitions.
(a) (3 points) Denote by V
π
the value function for the specific policy π. What is V
π
(1) for
the policy π that always shoots?
(b) (4 points) Denote by Q∗
(s, a) the value of a q-state (s, a), which is the expected utility
when starting with action a at state s, and thereafter acting optimally. What is Q∗
(3, D)
in terms of y?
(c) (3 points) For what range of values of y is Q∗
(3, S) ≥ Q∗
(3, D)? Interpret your answer
in plain English.
请加QQ:99515681 邮箱:99515681@qq.com WX:codinghelp
-
Zymeworks Announces FDA Clearance of Investigational New Drug Application for ZW171, a novel 2+1 T-cVANCOUVER, British Columbia, June 17, 2024 (GLOBE NEWSWIRE) -- Zymeworks Inc. (Nasdaq: ZYME), a clinical-stage biotechnology company developing a di2024-06-17
-
Indonesia Stock Exchange Partners with Nasdaq to Upgrade Market InfrastructureTechnology partnership will further enhance overall resilience and integrity of the exchange, while supporting the rapid deployment of new products2024-06-17
-
Adalvo 的 Liraglutide 預充式注射筆成為歐盟首款獲得批准的仿製藥馬爾他聖瓜安, June 17, 2024 (GLOBE NEWSWIRE) -- Adalvo 宣布 Liraglutide 預充式注射筆成功取得 DCP 批准,成為歐盟首款獲得批准的仿製藥。 根據 IQVIA 的報2024-06-17
-
促进生育,助力三胎——“三胎免费生”联合公益行动正式启动为积极响应国家号召实施三胎生育政策,扩大妇幼服务健康供给,在云南省优生优育妇幼保健协会指导下,昆明广播电视台联合昆明安琪儿妇产医院,于6月13日在昆明广播2024-06-17
-
学党史传承红色精神 守党纪筑牢自律防线——平安养老险湖南分公司党支部开展主题党日活动七一前夕,平安养老险湖南分公司党支部全体成员走进“千年学府、百年师范”——湖南第一师范,开展了一次学史明理、学史增信、学史崇德、学史力行的主题党日活动。重2024-06-17