{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab2 - Drop Out & Start Up (Cloud Data Insurance)\n",
"\n",
"\n",
"#### Authors:\n",
"\n",
"##### v1.0 (2014 Fall) Rishi Sharma \\*\\*, Sahaana Suri \\*\\*, Kangwook Lee \\*\\*, Kannan Ramchandran \\*\\*\n",
"\n",
"##### v1.1 (2015 Fall) Kabir Chandrasekher \\*, Max Kanwal \\*, Kangwook Lee \\*\\*, Kannan Ramchandran \\*\\*\n",
"\n",
"Your best friend Ben Bitdiddle is convinced he's got a great new start up idea that's going to change the world. He's encouraging you to drop out and start the new company `Bitdiddlers, Inc.` with him. Before joining his start up, you'll want to do some analysis of your own to make sure his ideas are sound.\n",
"\n",
"`Bitdiddlers, Inc.` will be a company in cloud data storage, which is already a pretty crowded field (Dropbox, Box, Google, Microsoft, Amazon, etc.), so you are understandably skeptical. However, there is one wrinkle to Ben's business model which he claims will help lower costs and **#disrupt** the current cloud storage model.\n",
"\n",
"One thing that drives up the cost for cloud storage companies is ensuring that customer data is not lost. You never hear about Dropbox losing your presentation, or Facebook losing your photos from that bachelor party (even though you may want them to), or really any instances of customer data loss associated with cloud storage services. This is because storage companies incur an enormous overhead of replicating customer data many, many times over in order to ensure an extremeley low probability of data loss.\n",
"\n",
"Ben Bitdiddle has decided that this expectation of 100% data integrity is increasingly unreasonable as humanity transitions deeper into a digital era. Digital goods, not unlike physical goods, are prone to permanent loss. Thus, Ben has taken inspiration from insurance companies, which help mitigate the pain of financial loss usually resulting from lost earning potential, goods, crops, shelter, etc.. Rather than promise 100% data integrity, which drives up costs and is inevitably a promise that cloud storage companies **will** be forced to break, Ben proposes that `Bitdiddlers, Inc.` will break ground in the new frontier of Cloud Data Insurance.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Insurance Model\n",
"\n",
"The basic premise of insurance is that customers pay a premium $p$ in order to insure their goods of value $v$ (in this instance, goods = data). In most cases, these goods remain unharmed (data remains secure), so the insurer happily pockets this premium from customers. In a small subset of cases, customer goods are damaged or lost (data is corrupted) and insurers must pay the customer the value $v$ of the goods that were lost. In order for the business to be profitable over the course of a year, the sum over all customer premiums must be larger than the costs of paying customers for lost goods plus the costs of operation.\n",
"\n",
"Let's formalize this idea in terms of the problem at hand. We will make several simplifying assumptions in order to get a cursory look at the problem, rather than explore all of its intricate complexities. The purpose will be to make a back-of-the-hand evaluation of the business. In order to truly evaluate the opportunity, the model would have to be thoroughly refined, but the skeleton given is not a bad place to start."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Insurance Equation:\n",
"\n",
"We first define the following variables.\n",
"\n",
"$E$: annual profit\n",
"\n",
"$P$: total annual premium = sum of all premiums charged to all customers per year\n",
"\n",
"$L$: total loss incurred in one year (total payout to customers with lost data)\n",
"\n",
"$U$: total expenses incurred in one year\n",
"\n",
"Then, the following equation holds. \n",
"\n",
"$\\Large E = P - L - U$\n",
"\n",
"### Replicating Data\n",
"\n",
"You will replicate each customer's data $r$ times, with each copy to be stored on a different disk to protect against failures. Costs will increase as $r$ increases, but the number of people who end up using their coverage will decrease (because data will be less likely to fail). You will see in this lab that $r$ is a business model parameter that you will tune.\n",
"\n",
"For simplicity, we assume the following:\n",
"\n",
"1. All of your (N) customers store data of size $10$GB. From this point on, the stored data will be referred to as the customer's \"file\". The company will be storing the files on a system of several 1 TB hard disks, each of which costs \\$100. Since \\$100/TB = 10 ${\\mathrm{c}\\mkern-6mu{\\mid}}$ /GB, the total cost per customer $C= \\$1$. Thus, we have that $\\large U = r N C$. We ignore all other operating costs in this model.\n",
"\n",
"2. All customer data is valued at $V =$ $\\$10000$. Total loss is a random variable depending on number of customers who lose data $X$. Thus, we have that $\\large L = VX$\n",
"\n",
"3. $N$ customers are willing to pay an annual premium of $P_0=\\$10$. Thus, the total annual premium is $\\large P = NP_0 = 10N $\n",
"\n",
"$\\Large E = P - L - U = NP_0 - rNC - VX$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Loss Model\n",
"\n",
"Let us assume an exceedingly simple model for data loss. Let us say that each disk has an I.I.D probability of failure of $p = 0.01$ on each day. At the end of each day, all disks that have failed are replaced. All content on those failed disks is replaced by a replica of that content.\n",
"\n",
"Suppose you have a file on $r$ disks. If one disk with your file fails on the first day, one of the $r-1$ replicas of the file will be used to restore the file on a replacement disk for the failed disk. You will lose your file if all $r$ disks fail on any given day. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"from numpy import random\n",
"import matplotlib.pyplot as plt\n",
"from __future__ import division\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## $\\mathcal{Q}1.$ Disk Failures"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### a. Find the probability of losing a file within a year if the file is stored in $r$ different disks. Each disk fails with probability $p = .01$ on each day, and all data on all disks is restored at the end of each day if possible (i.e. at least one disk has not failed)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1a. SOLUTION GOES HERE "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### b. Now plot the analytic expression you obtained from part (a) over the range of $r \\in \\{1,...,6\\}$. Some starter code is provided."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Solution here\n",
"r_values = np.r_[1:7]\n",
"fig = plt.figure(figsize=(10,6))\n",
"p_loss_analytic = # FILL IN EXPRESSION <-- YOUR CODE HERE\n",
"plt.plot(r_values, p_loss_analytic)\n",
"plt.xlabel('r')\n",
"plt.title('Analytic Probability of File Loss in One Year')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### c. Write a script to simulate the scenario described in part (a) for the same $r$ values at part (b). One trial is a simulation of one full year of disk failures. For each value of $r$, run 1000 trials to approximately determine the probability of losing a file in one year. Do your simulated results match what you expected based on analysis? Plot both the analytic probability and simulated probability on the same graph. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Solution here\n",
"r_values = np.r_[1:7]\n",
"p = 0.01\n",
"k = 1000 # number of trials for each value of r\n",
"\n",
"p_loss_simulated = []\n",
"for r in r_values:\n",
" # YOUR CODE HERE #"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Plotting Code\n",
"fig = plt.figure(figsize=(10,6))\n",
"plt.plot(r_values, p_loss_simulated, 'blue')\n",
"plt.plot(r_values, p_loss_analytic, 'red')\n",
"plt.xlabel('r')\n",
"plt.legend(('Simulated Probability of Loss','Analytic Probability of Loss'))\n",
"plt.title('Analytic & Simulated Probability of File Loss in One Year')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## $\\mathcal{Q}2.$ What do you expect?\n",
"\n",
"### a. What is $\\mathbb{E}[X]$, the expected number of customers who will lose their file per year? Remember, there are $N$ total customers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2a. SOLUTION GOES HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### b. What is your expected profit per customer, or $\\frac{\\mathbb{E}[E]}{N}$? Simplify as much as possible."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2b. SOLUTION GOES HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### c. Plot $\\frac{\\mathbb{E}[E]}{N}$, the analytic expression from part (b), as a function of $r$. Here you can plot values for $r \\in \\{3,...,6\\}$ (you will notice that for $r=1$ or $r=2$, we are expected to lose a lot of money per customer)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Solution here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### d. Given the results you see, what is the optimal number of times you should replicate customer data to maximize profit per customer? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2d. SOLUTION GOES HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Beyond 'Average'\n",
"\n",
"Is an 'average'-based calculation enough? For example, which of the following two scenarios sounds better to you?\n",
"\n",
"- Make \\$1,000 with probability 1\n",
"- Make \\$1,000,000 with probability 0.002 and lose \\$1,000 with probability 0.998\n",
"\n",
"I bet most of you would prefer the first option to the second option. However, if you take a look at the expected profits of two options, you will find the following results.\n",
"\n",
"- E[profit option 1] = \\$1,000\n",
"- E[profit option 2] = \\$1,000,000 $\\times$ 0.002 - \\$1000 $\\times$ 0.998 = \\$1,002\n",
"\n",
"This example illustrates that one should consider more than just the average profit when designing a policy. One thing you definitely should consider is the probability of bankruptcy. Let us model bankruptcy as the event where you lose money in your first year of operation and are forced to shut your business down. While this measure obviously isn't perfect (Twitter still doesn't turn a profit), it's not a bad indicator of failure for an insurance company, which are expected to turn a profit relatively early on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## $\\mathcal{Q}3.$ Funding Runs Dry \n",
"\n",
"### a. What is the probability of your company becoming bankrupt as a function of $r$? You go bankrupt if $E < 0$ at the end of the year. Express your answer in the form of $P(X > \\text{something})$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3a. SOLUTION GOES HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### b. Explain why it is difficult to calculate this value $P(X > \\text{something})$ exactly. Also, explain why it was easy to calculate $\\mathbb{E}[X]$ in $\\mathcal{Q}2$a. More precisely, what makes one of these values difficult to calculate and the other easy to calculate?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3b. SOLUTION GOES HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some Useful Bounds\n",
"\n",
"Later in the course, we will study bounding and dealing with expressions like $P(X > \\text{someting})$ in more detail. For now, we will just outline three probabilistic inequalities superficially and employ them here. The important thing to keep in mind with all three of these is that they allow us to analyze statements like $P(X > \\text{someting})$.\n",
"\n",
"### Markov\n",
"For any *non-negative* random variable $Z$, we have: \n",
"$$P[Z \\ge a] \\le \\frac{E[Z]}{a}$$\n",
"\n",
"### Chebyshev\n",
"For an arbitrary random variable $Z$, we have:\n",
"$$P[\\left| Z - E[Z] \\right| \\ge \\epsilon] \\le \\frac{\\text{Var}(Z)}{\\epsilon^2}$$\n",
"\n",
"### Chernoff\n",
"The general Chernoff bound for a random variable $Z$, every $a$, and $s \\ge 0$ is:\n",
"\n",
"$$P[Z \\ge a ] \\le e^{-sa}E\\bigl[e^{sZ}\\bigr]$$\n",
"\n",
"In the special case of the mean of a binomial random variable $Z \\sim \\text{Bin}(n,p)$, we have:\n",
"$$P\\biggl[\\frac{1}{n}\\sum_{i=1}^{n}Z_i \\ge a \\biggr] \\le e^{-nD(a||p)}$$\n",
"where $D(a||p) = a\\ln \\frac{a}{p} + (1-a)\\ln \\frac{1-a}{1-p}$ is the Kullback-Leibler Divergence."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### c. \n",
"\n",
"i. Use Markov's inequality to put an upper bound of the probability of bankruptcy. Use the fact that $X$ must be an integer to tighten the bound slightly.\n",
"\n",
"ii. Do the same using Chebyshev's inequality.\n",
"\n",
"iii. Do the same using the Chernoff bound.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3c. Your Solution Here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### d. Plot these upper bounds for $r \\in \\{2, 3, 4, 5, 6\\}$. Does your earlier choice for $r$ still seem like a good bet?"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"### Your code here ###"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### e. (Bonus) Can you use the fact that $P(\\text{bankruptcy}) \\leq 1 - P(X=0)$ to calculate a better upper bound? Plot this as well"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"### Code for bonus problem here ###"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the upcoming weeks of the course, we'll look at bounding probabilities of this nature, at which point we'll be able to do more to analyze our probability of going bankrupt. Wrestling with them and visualizing them before diving into the theory will give you a better feel and intuition for what these bounds can be used for and how good each one can be. We encourage you to play around with them! For now, we can be happy with the understanding above, which is esentially saying we can afford to lose one file for every \\$10,000 ($V$) in revenue we generate."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Where do we go from here?\n",
"\n",
"We made several assumptions while building our model to simplify the analysis and fixed several parameters that need not have been fixed. Nevertheless, we ended up with an optimal number of replications $r$ which is very similar to industry standard. Thus, it is unlikely that this business would succeed if you used the parameters specified. The next step in analysis would be to try to increase premiums or decrease insurance payouts to end up with a more viable business."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"___\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" In case you are unconvinced that cloud data insurance is a very real start up idea and think that the course staff is just blowing smoke in order to ask probability questions, you can educate yourself by reading this article and looking at some of the startups and big players getting involved in the area.\n",
"\n",
"$$$$\n",
"\n",
" By submitting this lab you are agreeing that if you start a company based on any of the ideas presented in this lab, the course staff is entitled to 10% equity in preferred shares. This agreement is in the nature of comic relief, and bears no legal value in any way whatsoever. If you are to start a company, however, please name it `Bitdiddlers, Inc.` to honor the course staff, who are all well reputed diddlers of the bits"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n",
"[1] D. Ford, F. Labelle, F. Popovici, M. Stokely, V.-A. Truong,\n",
"L. Barroso, C. Grimes, and S. Quinlan. Availability in globally\n",
"distributed storage systems. In OSDI, 2010"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}