4 Monte Carlo simulation





RECOMMENDED READING

B & H Chapters 1.6 - 1.7.





4.1 Discussion

Monte Carlo simulation is a critical tool that we’ll use throughout the semester to explore and approximate uncertainty. The basic idea is similar to using a physical coin to understand its fairness (ie. the probability of observing Heads):

  • Flip the coin over and over and over.
  • Record the proportion of Heads in these sample flips.
  • The more times we flip the coin, the better these sample proportions approximate the “true” fairness of the coin.

The main difference is that in a Monte Carlo simulation, we use a computer to do the heavy lifting!



History

“Monte Carlo” was the code name used by Stanislav Ulam, John von Neumann, et al for their top secret nuclear weapons project at the Los Alamos National Laboratory in the 1940’s2

The code name was inspired by the opulent Monte Carlo casino in the equally opulent Principality of Monaco on the French Riviera.3 (See other excellent code names.).



4.2 Exercises

There are 10 marbles in an urn: 4 green (\(G\)), 5 red (\(R\)), and 1 white (\(W\)). Thus if you pick a marble at random:

\[P(G) = 0.4, \; P(R) = 0.5, \; P(W) = 0.1\]

We’ll run a Monte Carlo simulation study in RStudio to connect these probabilities to what we’d actually observe in repeated observations of this experiment. In doing so:

  • Open the Rmd (Rmarkdown) template on Moodle.
  • Run the provided code one chunk at a time, taking care to identify what each line achieves!
  • You’re encouraged to take notes in Rmarkdown, but do whatever is most comfortable to you.
  • Some code chunks have blanks for you to fill in. These include an eval = FALSE statement to indicate that they should not yet be evaluated – if they were, they would produce errors. Once you’ve filled in the blanks and your code is working, either eliminate the eval = FALSE statement or change it to eval = TRUE.


  1. Load packages



  1. Review sample_n()
    Set up the marbles.

    1. Use sample_n() to sample 3 marbles with replacement and 3 marbles without replacement:

    2. Run the following chunk multiple times. How does the weight argument impact the results?



  1. Run the marble experiment 100 times
    1. Fill in the blanks to run the marble experiment (with the original probabilities) 100 times. Store the results in a data frame named run_100.

    2. Count up the number of times you picked each color.

    3. Plot the number of times you picked each color.

    4. Check out the table from part b. How close are the observed color proportions to the theoretical probabilities, \(P(G),P(R),P(W)\)?


    Solution



  1. Run the marble experiment 10,000 times
    Repeat the above exercise, this time running the experiment 10,000 times (not 100 times). Now how close are the observed color proportions to their theoretical probabilities?



  1. Reproducibility
    Sampling is inherently random. Your results can differ every time you repeat the experiment. But it’s important to be able to reproduce your results (we’ll chat about this as a class). To this end, you can set the random number generating seed to a positive integer, here 354. Run the following code chunk multiple times. You should get the same answer every time AND the same answer as your neighbor!


    Solution



  1. Conditional probability
    A pig comes along and eats one of your marbles. Unfortunately, you weren’t close by and so could only make out that the marble was not red. In light of this information, the probability that the pig ate a green marble jumps from \(P(G) = 0.4\) to \[P(G|R^c) = 0.8\]
    Let’s support this calculation using the final_run simulation.
    1. In general, conditional information restricts the sample size. In this case, we can use the filter() function to restrict the simulation results to only those that don’t have a red marble. We store this new dataset as not_red:

      Note: != = “not equal to”

    2. Calculate the proportion of each color remaining in the not_red marbles. Does this (approximately) match the 0.8 calculation?



  1. Try again
    Suppose that 60% of Mac students have seen 0 Fast & Furious films, 10% have seen 1, 12% have seen 2, and 18% have seen 3.
    1. Pick a Mac student at random. Simulate the probabilities that they’ve seen 0, 1, 2, or 3 F&F films. Summarize these with a visualization and table.
    2. Use this simulation to approximate the conditional probability that a student has seen 3 F&F films if we learn that they’ve seen more than 0.


    Solution



4.3 Extra practice in a more complicated setting

If you’ve finished and absorbed the exercises above, keep going! Recall our study in the previous activity. A new friend tells you that they have 2 children. To greatly oversimplify things, we’ll assume binary gender identities for both children: boy or girl. Define the following events:

  • \(A\) = both children are boys
  • \(B\) = at least one child is a boy
  • \(C\) = at least one child is a boy that was born in the morning
  • \(D\) = the eldest child is a boy that was born in the morning

We showed that \(P(A) < P(A|B) < P(A|C) < P(A|D)\). Most people find this surprising. In such scenarios we can support and gain insight into our calculations using Monte Carlo simulation.


  1. Supporting our results with simulation
    We’ll start as simply as possible, focusing only on the gender identities of the two children. To this end, we’ll simulate this information for 10,000 pairs of siblings. NOTE: There are many ways to do this. We’ll start with an approach that matches the logical steps in our thinking, though isn’t very efficient. If time allows, you can explore a more computationally efficient approach at the end of the activity.

    1. Use info from the following table to confirm that our calculation that \(P(A) = 0.25\) was reasonable. Take note of how tabyl() can be used to summarize two variables.

    2. Suppose you learn that at least 1 child is a boy (\(B\)). Restrict the simulation data to only the results that match this info. NOTE: | is the RStudio equivalent of “or” or \(\cup\).

    3. Construct a table of these results to confirm your earlier calculation that \(P(A|B) = 1/3\) was reasonable.


    Solution



  1. Incorporating time of day
    Next, let’s incorporate the birth time of day into our analysis.
    1. For each of the 10,000 pairs of simulated children, simulate & store a birth time of day (\(m\) = morning or \(n\) = night). NOTE: we must use a different random number seed than when we generated gender_id_1 and gender_id_2 or else the time variables would be exactly the same as the gender_id variables, just with different numbers!

    2. What does the mutate() function do in part a?
    3. Suppose you learn that at least 1 child was a boy born in the morning (\(C\)). Restrict the simulation data to only the results that match this info and subsequently construct a table from which you can confirm that your earlier calculation that \(P(A|C) = 3/7 \approx 0.429\) is reasonable. NOTE: & is the RStudio equivalent of “and” or \(\cap\).

    4. Similarly, simulate \(P(A|D)\) which we earlier calculated to be 0.5.


    Solutions



  1. Building intuition
    So long as you set the seed, your Monte Carlo approximations should match those below. Comment on the trend in the denominators and explain the significance of this trend.

    \[\begin{split} P(A) & \approx \frac{2470}{10000} = 0.247 \\ P(A|B) & \approx \frac{2470}{7523} \approx 0.328 \\ P(A|C) & \approx \frac{1862}{4420} \approx 0.421 \\ P(A|D) & \approx \frac{1201}{2462} \approx 0.488 \\ \end{split}\]



  1. Even more extra: a more efficient simulation
    Above, we simulated the simulated separate vectors of gender_id and time_of_day for each child, then combined these into a data frame. A more efficient approach is to first define the sample space of all possible combinations of gender_id & time_of_day for a pair of children, then sample from these.

    1. Confirm that the following produces a valid sample space of all possible combinations of gender_id & time_of_day for a pair of children.

    2. We can sample rows from the sample_space using sample_n()!

    3. Use child_sim as you used children above to support your calculations of \(P(A)\) and \(P(A|B)\).


    Solution



  1. Recommended note taking
    To help out your future self, provide a summary of what the following functions do and how they work: head(), sample_n(), set.seed(), tabyl(), filter(), mutate(), ggplot()

  1. A billboard at uranium processing plant in Oak Ridge, TN, a sister site of Los Alamos. (https://simple.wikipedia.org/wiki/Manhattan_Project)

  2. The Monte Carlo Casino in Monaco. image: https://www.flickr.com/photos/nat507/16530961901