Ms987 optimization for analytics group project write a memo


Optimization for Analytics Group Project

You are recruited for a health analytics project by Health, a data consultancy company developing data-driven predictive analytics and decision making tools for a broad range of clients from healthcare industry, including governmental organizations such as NHS and various pharmaceuticals, as well as for partner companies of Health that primarily involve medical device developers. Health do not only provide specific analysis or solutions to a client's current needs, but they also carry in-house research to investigate opportunities that may give them the competitive advantage of first-entry into market, in particular in the arena of medical devices. The company houses a good number of technical experts, including data scientists, optimization experts and statisticians, and they also employ management consultants and a sales team.

Health have been getting heavily more involved in optimization-related projects than usual, and although they are in the process of recruiting more optimization experts in-house, they have a number of urgent projects with due dates approaching and hence they need your expertise in this area. The two specific projects they require your help are: 1) an in-house project to efficiently diagnose specific cancer patients, and 2) a consultancy project for an external client to help them make various operational and strategic decisions. These two projects are described in detail in the following two parts, including background and expectations for each part.

PART 1: Helping Diagnosis for Cancer Patients

There are various medical methods available for cancer diagnosis. These methods vary not only from simpler and cheaper ones to those requiring more time or resources, but also in their effectiveness and between different types of cancers. Due to these variations as well as a specific patient's needs, one method may be more preferable to use than others (or in combination with another method to be conclusive), in order not to delay the diagnosis - whether it is "benign" and hence releasing the stress on the patient, or "malignant" and hence starting the necessary treatment as soon as possible.

Anant Krishnan and Christine Wright, two data scientists with optimization expertise from Health, are currently working on a specific diagnostic procedure called "Fine-needle aspiration" (FNA), which is a type of biopsy. In this procedure, a thin, hollow needle is inserted into a mass (such as a lump) for sampling of cells, and the sample of cells are examined under a microscope. Using this technique, digitized images are used to compute a range of attributes of the cell nuclei present, such as average/variation in radius, texture, area and compactness.

Anant and Christine obtained some real data on the use of FNA, as it can be seen in the data file called "PatientData.txt" (see "Data File for Part 1" on MyPlace). This file contains data for 569 patients, where each line of the data refers to a single patient, and each line contains 32 entries, first entry being a patient identifier number (anonymized), second entry being the actual diagnosis (B = benign or M = malignant), and the remaining 30 entries showing real-valued data for their measurements for 30 FNA attributes.

Anant and Christine would like to be able to compute a function (defined over the 30-attributes) that can be used to classify each patient as correctly as possible. They worked out that what they are trying to achieve is a "separating plane" (say, f(x) = wx - γ = 0, or equivalently wx = γ, where w and γ are the variables) and they can use linear programming for this purpose. The idea of a separating plane is that you identify a line which separates one class of data (say, benign) from the other (say, malignant), as much as possible. For example, see a very simple example with only 2 attributes (x- and y-axis) and 8 patients (4 in each class, blue for M and red for B) in the figure on the left: here, both the black and red lines show feasible separating planes that are perfectly capable to classify data, since all blue (M) points satisfy f(x) > 0 (as they are above the line) and all red (B) points satisfy f(x) ≤ 0 (as they are below the line). Of course this simplistic example misses two important issues (in addition to being only two-dimensional and hence not covering all 30 attributes), which we discuss next:

337_figure.png

1) What if there was a blue dot in the middle of red dots? Of course in this case neither of the above two lines (nor any other line you can come up with yourself) would have been a "perfect" separating plane. However, remember our aim is not to find a perfect one, but rather the best we could get and either of the two lines above would have still worked for that purpose, with one misclassified data point (i.e., the new one just added) and all other eight points classified correctly. Our objective would be to minimize such misclassified points (more on this later).

2) What if there are multiple separating planes? As we already see in this simplistic example with two easily separable sets of data, there is indeed an infinite number of separating planes you can come up with. Although any of these planes would have worked to classify the data, considering we have limited data on hand, it is wise to pick the plane that is "farthest" from both sets of points (or "most central" in between two data sets), since this will more likely to reduce any misclassifications in the future as it will increase the separation of the two sets (more on this later).

Before addressing these issues, we will define some notations and explain some concepts. First of all, recall the definition of the separating function:

f(x) = wx - γ,

where the data vector x has 30 dimensions (as there are 30 attributes for each patient), and hence decision variable vector w will also have 30 dimensions (whereas γ is only a scalar, i.e., it has only one dimension). Note that we can rewrite this expression as follows:

f(x) = wx - γ = i=130 wixi - y = w1x1 + w2x2+ w3x3 + · · · + w30x30 - γ

Let B and M indicate the set of patients with benign and malignant diagnosis, respectively. For the data vector x of any patient in M, we would like to have f(x) = wx - γ > 0, whereas for the vector x of any patient in B, we would like to have f(x) = wx - γ ≤ 0 (but it is likely these conditions will be not satisfied for some of the patients). Ananth and Christine worked out that to address the first issue on hand, they need to measure an "error" only if a given point is on the wrong side of the line (which should be 0 otherwise), and minimize the sum of all these errors. Working out the maths in order to write this as an LP, they note that:

1) For the data vector x of any patient in M, we will have a constraint in the form: - wx + γ + 1 ≤ y, where y is the nonnegative variable measuring the error for this specific patient;

2) For the data vector x of any patient in B, we will have a constraint in the form: wx - γ + 1 ≤ z, where z is the nonnegative variable measuring the error for this specific patient;

3) The objective function will be the minimization of the summation of all error variables over all patients.

To address the second issue, Ananth and Christine realized that the objective function can be updated simply by adding |w| (i.e. the absolute value of vector w, or i=130|wi| in mathematical terms) to the objective function to be minimized. Of course they are aware that absolute value is not a linear function, but they believe this can be linearized easily by tweaking the LP stated above slightly. (Hint: You might recall the linear regression problem from the class.)

Anant and Christine expect the following actions for this project:

a) You should first read the data for the first 500 patients from the input file using FICO Xpress Mosel. When doing so, ensure you record the diagnosis for each patient, as well as the measurements for 30 attributes (you do not need patient identifiers).

b) Using the data, build an LP model as described above in FICO Xpress Mosel, where you ensure the data for each patient is used in the correct set of constraint, and solve it to find the best values for w and γ. Comment throughout your file where appropriate.

c) Ensure your FICO Xpress model outputs into an output file the values of w and γ as well as error values for each patient (preferable if you print out only error values that are not zero).

d) Write a memo to Ananth and Christine (at most 2 pages with standard margins, 11pt Arial or Calibri, 1.5 line spacing), briefly explaining key aspects of your FICO Xpress model (such as sets you built, how you structured constraints correctly, etc.) and any key messages regarding model outputs. It is perfectly fine to refer to your FICO Xpress model or output file when writing your memo. (Note: A memo is in a way an 'informal letter' written to someone to give key messages; also remember Ananth and Christine are technical experts so you can certainly use technical language if you want.)

e) BONUS: If time permits, Christine and Ananth would like you to test your function on the remaining 69 patients from the input data. This will require you to read their data and then plug in with the optimal values of w and γ to calculate error values, which should be also written into the output file. Note you will not resolve the LP.

PART 2: Improving Operations and Strategic Decision Making for NHS

Health is closely working with various NHS boards and facilities in Scotland, in order to develop analytical solution methods for a number of challenging decision making problems stemming from distribution operations as well as strategic decision making. A team led by Shona Thompson and Dmitri Zotov require your help by addressing some of these challenges, particularly those stemming from NHS GGC (Greater Glasgow & Clyde).

NHS GGC has broadly two separate distribution networks in place. The first network, run by NHS Scotland, primarily helps to distribute bulk/commonplace items to hospitals and clinics on a daily basis. This network does not require a "cold chain", that is, it can be run without a need for refrigerated vehicles. The NHS GCC is simply charged costs per standard box carried per mile, which has currently a rate of £1.20 (for example, if three boxes are carried 2 miles from the depot, then the cost of this operation would be £7.20). The current routes used for these distribution operations as well as their distances in miles can be found in the data file "GGC-network.txt", where travel in both directions on each route is possible (and same mileage is used for the cost purposes of both directions). The key depot for this distribution network is the Gartnavel General Hospital ("GGH"), and the Pharmacy Distribution Centre ("PDC") also serves as a supply point, though rather with a significantly less capacity.

The second distribution network is run by NHS GCC itself, with the primary focus of medicines requiring special care including "cold chain". Similar to the other distribution network in place for bulk/commonplace items, this network also employs standardized boxes (though a different type and size, suitable for the special vehicles used in this network). The main difference in this network is that there are a limited number of customized vehicles, and due to safety concerns and employee regulations, deliveries can happen only either between 9am-12noon or 1pm-4pm. At the start of the day, all vehicles are loaded and ready to leave from the depot at Pharmacy Distribution Centre ("PDC"), they all return to PDC in the middle of the day for loading afternoon deliveries (as well as lunch break for drivers), and following afternoon deliveries, they all return again back to PDC. Due to the 3-hour window for deliveries either in the morning or afternoon, and the fact that unloading at each hospital/clinic takes substantial time, each vehicle can make at most 5 deliveries in the 3- hour window (note that they can go through as many locations as they need to in order to get back to PDC, as long as they do not make any further deliveries). Since Wishaw is not included in this network, most distances are quite small (you can remove all routes from/to Wishaw to accommodate that). A delivery can be made only once to a hospital/clinic due to safety concerns, and hence it cannot be split (whether between morning and afternoon, or between two vehicles at the same time).

NHS GCC is also experiencing some issues in their loading and unloading operations due to the high level of staff time required, and hence they are considering some strategic investment of robotics in some of their facilities, which will potentially help reduce their costs. Shona and Dmitri are informed that there are two possible types of robotics considered for the bulk/commonplace items distribution network, which can be leased only with a full year contract (a year consists of 260 working days). The first type of robotics, designed for loading operations, can be possible installed at GGH and/or PDC, with each installation costing £40,000 for the whole year. NHS GCC is informed by NHS Scotland that for each installation, the current rate of £1.20 per box per mile will be reduced by £0.10 (hence if both sites were installed, then the new rate would be £1.00). On the other hand, there is a possibility to install a type of robotics for unloading operations, for which GRI and/or SGH are considered as candidate locations. This type of robotics have a fixed cost of £15,000 for the whole year per installation, and NHS GCC is informed by NHS Scotland that for each installation, the current rate of £1.20 per box per mile will be further reduced by £0.10 (in addition to any reduction of the rate by the installation of loading type of robotics), but only for the direct routes connected to the facility this robotics is installed (for example, if loading robotics were installed only at GGH, this would reduce the rate to £1.10 per box per mile for the whole network, and if in addition an unloading robotics were installed at GRI, the direct routes from/to GRI, such as GRI to/from Stobhill, will have a rate of £1.00). It is possible to install any combination of these robotics, including the extreme cases of none of them and all of them, and anything in between.

Shona and Dmitri expect you to carry out the following actions for this project:

a) Shona and Dmitri have access to the current daily demands for laboratory items, which you can find in the data file "GCC-demands.txt" for different hospitals and clinics. They are also aware that GGH can currently supply up to 245 boxes a day, and PDC can supply up to 55 boxes a day. Also note that currently the direct routes from/to PDC (such as PDC to/from GRI) have a capacity of 20 boxes a day in either direction (no other route capacities in the network). They believe that one can build a balanced network model using the daily demands as well as distances provided, and hence they expect you to help them to build such a network LP model in FICO Xpress Mosel. Make sure your model reads all input files. Comment throughout your file where appropriate.

b) Ensure your FICO Xpress model from a) outputs into an output file of type.csv (comma-delimited), where the values of flows are written only if they are not zero. Also write a one-page memo to Shona and Dmitri (within standard margins, 11pt Arial or Calibri, 1.5 line spacing) any key messages regarding model outputs, including whether you obtained integer flows or not, and why.

c) BONUS: Shona and Dmitri are aware of some of the future plans of NHS GCC, which includes managing the supply of laboratory items only from a single depot at GGH, and no capacities on any routes including those currently limited from/to PDC. With this future setting, they believe that this current distribution problem will become equivalent to solving only a shortest path problem, where shortest paths from GGH to all other locations will dictate the flow of laboratory items. In order to accommodate this, they expect you to implement, either using Mosel, Python or C++, Dijkstra's shortest path algorithm. Make sure your algorithm can read the input files in their current format, and the output of the algorithm includes the length of the shortest path as well as details of the path. Comment throughout your file where appropriate.

d) Shona and Dmitri are informed that there are 8 vehicles available at PDC for cold chain deliveries, each with a capacity of 20 medicine boxes (note that you will not necessarily use all vehicles). They also have the current daily demands for these cold chain deliveries provided in the data file "GCC-demands.txt". Since they think this problem will be very complicated to model as an integer programming problem (they know how complicated even TSP is and this is with multiple tours and not all connections exist!), they think heuristics would be a more appropriate way to approach this problem. One idea they have is as follows:

Step 0: Label all nodes with demands as "undelivered". Go to Step 1.

Step 1: Start at the depot (PDC) node with a new empty vehicle and go to Step 2, unless all nodes with demands are labelled "delivered".

Step 2: If all neighbour nodes are "delivered", then go to Step 3. Otherwise, check the demands of undelivered neighbour nodes; if all their demands are strictly bigger than the remaining capacity on the vehicle, then go to Step 3. Otherwise, pick the nearest neighbor node i with undelivered demand that is less than or equal to the capacity available on the vehicle, label i as "delivered", increase number of deliveries by 1, reduce the capacity of the vehicle by the demand of i, and keep track of the route. Go to Step 4.

Step 3: Pick a random neighbour node i, such that i is not a node visited previously on this tour. If there is no such node i, then STOP (dead end!), backtrack back to the depot node to complete the tour (if there is at least one delivery, otherwise empty tour and hence no vehicle used) and go to Step 1. Otherwise, record i on the route (but no delivery), and go to Step 2.

Step 4: If the remaining capacity of the vehicle is only 1, or if the number of deliveries on the vehicle are 5, then STOP, backtrack back to the depot node to complete the tour and go to Step 1. Otherwise, go to Step 2.

Although Shona and Dmitri tried their best to ensure this heuristic algorithm works (and indeed finds a number of feasible tours covering all demand nodes), they think there might be scope to improve it and they are open to your ideas (e.g., when "backtracking", it might be sensible to check if a shorter route back to the depot can be found rather than going back exactly the same way). They would like you to implement this algorithm (with any improvements you would suggest) using Mosel, Python or C++, and also run tests for comparison (due to the "random" element, it is likely the algorithm will produce different results on different runs, so you might want to run it a few times to compare e.g. how many vehicles are used, how many miles are travelled, etc. in each case) Also write a memo to Shona and Dmitri (up to four pages within standard margins, 11pt Arial or Calibri, 1.5 line spacing) to discuss any algorithmic improvements you suggest and any key messages regarding model outputs.

e) Shona and Dmitri are informed that by the time the robotics will be installed, there will be no more route capacities currently apparent for routes to/from PDC. They are aware that they can modify the network LP model from part a) with removing these capacity constraints (and also calculating annual demands based on these daily demands), and then run this network model for any possible combination (with updated cost figures, of course). However, this would require solving 16 separate models, in addition to having more chance for errors in calculating cost figures, and this is not their intention if possible (but if you cannot think of any other way, then you can do so, however, with only half of their expectations met).

Instead, they are thinking that one can build a mixed integer programming (MIP) model for this problem: They are aware that the decisions of robotics installation at each candidate location is a binary decision, and they think that they only need to calculate the "savings" at each route by the correct use of these binary variables (and then include these savings into the objective function, among the fixed costs of robotics). For example, consider the route GGH to GU, which has a distance of 1.7. For a shipment amount of X on this route, they know that current charge is 1.7 * 1.2 * X. Then, they think that they can define a linear variable, say S, to represent the savings on this route achieved by one of the robotics investment (say, loading robotics at PDC), which needs to take a value as follows:

0.10 * 1.7 * X if robotics at PDC were installed

0 otherwise

Hence, they know that this saving variable S is limited to at most 0.10 * 1.7 * X. However, multiplying this quantity by the binary variable Y (whether robotics at PDC were installed or not) would result in a nonlinear constraint. Therefore, they think they can add one more constraint for limiting the value of S using a "big-M" number, though they are not sure what would be the right choice for these numbers or whether this approach would work or not. They expect you to help them to build such an MIP model in FICO Xpress Mosel, and write a two-page memo (within standard margins, 11pt Arial or Calibri, 1.5 line spacing), explaining if your model works and how you calculated big-M numbers, and what you recommend them with regards to model outputs.

Attachment:- Assignment Files.rar

Request for Solution File

Ask an Expert for Answer!!
Dissertation: Ms987 optimization for analytics group project write a memo
Reference No:- TGS02693799

Expected delivery within 24 Hours