Data Simulation With IBM SPSS

Blog

Data Simulation With IBM SPSS

Data Simulation With IBM SPSS

Introduction

Simulation simply means imitating a system or a process in order to study the behaviour of the system/process under many circumstances. This is a widely-used approach in handling very sensitive and high-risk process. 

We will use the Spss to model the cost of treatment for diabetes using Simulation - Monte Carlo method.

A scenario

An insurance provider is interested in estimating the yearly treatment costs for a patient with diabetes to guarantee that premiums are sufficient to cover anticipated expenditures.

Secondly, they aim to ask questions regarding the cost distribution throughout the population of patient with diabetes to reduce risk, moving beyond point estimates of the expenses a specific patient bears.

For instance, what is the price point at which expenditures are affordable for 99% of the population?

However, note that the model's target (or targets) is determined by evaluating the predictive model using the produced values of the simulated inputs and the given values of the fixed inputs.

The desired values are distributed when the operation is done several times (generally thousands or tens of thousands of times). The method produces a distinct data record containing the inputs and outputs values for each iteration.

Most importantly, sufficient data points will be used to accurately depict the distribution before they can reach any conclusions to solve this problem.

Data

So we are going to be using two types of data. However, the first one should be downloaded as a CSV file Here and the second as an XML file Here, which contains the model necessary for running this Analysis.

NOTE: "Scale" measure should be selected for the data.

Data Analysis

After importing the data  we need to open the data file diabetes_costs.sav. From the menu, choose: Analyse => Simulation. Then click Continue after selecting Select SPSS Model File as shown in figure 1.           

Simulation in Spss

Figure 1: Selected Simulation for Analysis.

Open the file diabetes costs.xml in the Select SPSS Model File window by navigating to the where directory (located where we store the file  in the directory) as shown in figure 2.

Simulation in Spss

Figure 2: Selecting the saved model for Analysis

All fields used as inputs in the prediction model are listed in the Simulated Fields tab. The following fields are part of the model used in this example: Age is the age of the person covered by the insurance;

Glucose is the person's average glycated haemoglobin level, which represents the blood glucose level over time; and Income is the person's household income s shown in figure 3.

Simulation in Spss

Figure 3: Selected Spss parameter

For each input in the model, we must either provide distribution or mark the input as fixed and supply the fixed value to perform the Simulation.

The three inputs in this example must each have a distribution given since all model inputs will be simulated. The distribution that most closely matches the data for each input may be automatically discovered when the data required to develop the model are accessible, as in this example.

To automatically fit distributions to the data, Click-Fit All.

Simulation in Spss

Figure 4 : Fit All the log normal distribution

 

The Distribution column displays the results. Along with the distribution parameters, the name of the distribution that best describes the data for each input is shown.

For instance, a lognormal distribution with the values a=7.55 and b=0.19 best fits the data for glucose. The distribution function is shown overlaid on a histogram of the data for that input from the current dataset on the graph next to the distribution parameters.

we can review the goodness of fit statistics for each fitted distribution and all the distributions that were taken into account while fitting a specific input.

In the grid of Simulated Fields, pick the row for age, and then click Fit Details as .

Simulation in Spss

Figure 5: Check all the useful goodness of statistics

 

The Anderson-Darling test is typically applied to discover the distribution that best matches the data for continuous variables like age. The distribution with the best fit to the data is the one with the lowest value of the Anderson-Darling statistic.

The fit statistics column's A value represents the Anderson-Darling statistic, which for the triangular distribution has a value of 1.29. The Fit Statistics column indicates the alternative Kolmogorov-Smirnov statistic values by the letter K.

The Kolmogorov-Smirnov statistic values in this illustration are shown, but they are not utilized to rank the distributions. we can select the test statistic that will be used to rank the distributions using an option on the Advanced Options page.

More significant p-values denote more statistical significance when analysing fit statistics for distributions. P-values less than 0.05 often signify that the distribution might not offer a good fit to the data. This holds true for both the Kolmogorov-Smirnov and the Anderson-Darling statistics.

We will see that the triangular distribution's p-value, shown by the value of P in the Fit Statistics column, is absent in this case. This is because the p-value for the triangular distribution is not accessible (it is also not available for the beta distribution).

But upon closer examination, it does seem that the triangular distribution offers the best match to the data. To cancel, click.

People with diabetes often have glycated hemoglobin values between 5 and 14. Type 5 in the glucose Min field and 14 in the corresponding Max field.

We will just consider the income range of $20,000 to $100,000 for this example. Enter 20000 in the minimum income box and 100000 in the maximum income field.

The minimum and maximum values for the age variable in the historical diabetes costs_data dataset are 13 and 65, respectively, according to the triangular distribution for age.

This is a sensible maximum to choose for the Simulation as individuals above 65 will likely no longer be policy holders. Likewise, the minimum value of 13 will guarantee that this age group will be represented in the Simulation due to the higher prevalence of type II diabetes in teens.

On the Simulation tab, choose Correlations from the Select an Item list.

Simulation in Spss

Figure 6 : Select variable minimum and maximum values

The variables linked to the simulated inputs in the active dataset were used to determine the Pearson correlations between them, which are shown in the Correlations panel. When simulating data for those inputs, known correlations between those simulated inputs are kept.

 

The Simulated Fields panel's Fit All or Fit buttons will calculate correlations. By entering a new value in the Correlations table field that corresponds to it, we can change any of the correlation values. We shall accept the correlations as determined by the data for the sake of this example.

On the Simulation tab, select Output from the Select an Item list.

Simulation in Spss

Figure 7: Recalculating Correlation between Variables

 

Change the cost and income formats to Dollar and the corresponding number of decimals to 0 in the Display Formats grid. Set the age's number of decimals to 0 as well.

On the Simulation tab, select an item, then click Save.

When the box labelled "Save the plan file for this simulation" is ticked, the Simulation's requirements are recorded in a simulation plan file.

 Without having to input all the requirements again, we can open a simulation plan file in the Simulation Builder or the Run Simulation dialogue, make optional changes, and rerun the Simulation. Additionally, we can allow other users to execute the Simulation by sharing the file containing the simulation plan.

Choose a location to save the simulation plan file by clicking Browse, then give the file a name. Select "Run."

Simulation in Spss

Figure 8: Choosing Simulation Output

 

Interpretation and Charts

The specifics of the simulation strategy are included in the result. The plan details the predictive model that serves as the foundation for the Simulation, the input distributions that will be simulated, the correlations between those inputs, as well as several additional factors.

Simulation in Spss

Table 1: Simulation strategy distribution

Visualisation

The Pearson correlation between a target and its simulated inputs is displayed on the Correlation Tornado Chart. This example's chart displays a connection between the goal cost and the simulated inputs of age, Income, and glucose. A person's glycated haemoglobin degree is substantially connected with treatment expenses.

Simulation in Spss

Figure 9:  the Pearson correlation between a target and its simulated inputs

The distribution of the predictive model's goal is shown on the probability density chart. Reference lines are positioned at the distribution's 5% and 95% points. The probabilities in the three zones enclosed by the reference lines are shown in the table behind the chart.

We can add static reference lines at predetermined points and adjust the reference lines' positions interactively to calculate the probabilities associated with any distribution area.    

Click on the Probability Density graph twice.

Simulation in Spss

Figure 10: the probabilities in the three regions bounded by the reference lines.

 

Conclusion

Simulation is a good approach for you to study your data before venturing into spending your resources. I strongly recommend that you apply this technique on other datasets such as the dataset used in this blog: SPSS for Data Analytics

By simulating data for the inputs to a prediction model of treatment costs and comparing it to the simulated data, we were able to develop a distribution of treatment costs for diabetic patients.

For the whole population that the model represents, we have utilized the data to provide responses to regarding concerns in treatment costs. we can run the simulation again using the plan file because we saved the simulation plan, which will allow us to get more information on the distribution.

 


← Back


Comments

Bongajum Victor Jan 29, 2023

Thank you for the practice guide. please fo the steps on how to create a simulated data by modeling a new equation under the simulation builder


Leave a Reply

Success/Error Message Goes Here
Do you need help with your academic work? Get in touch

AcademicianHelp

Your one-stop website for academic resources, tutoring, writing, editing, study abroad application, cv writing & proofreading needs.

Get Quote
TOP