Introduction

This data set was created by TouringPlans.com specifically to give programmers and statisticians who may be interested in working with them someday some actual data they have compiled to practice on. Initially it was provided to students at the University of Central Florida in 2017. Because a friend of mine is one of the creators of TouringPlans.com, he graciously provided this to me to explore as well.

A total of four data sets were provided, each consisting of wait time data for one of the four Walt Disney World theme parks (Magic Kingdom, Epcot, Hollywood Studios, and Animal Kingdom) for a particular time period. I have chosen to use the Magic Kingdom (“MK”) data set for this project.

To keep the project scope from getting away from me, before importing the set into RStudio, I did some cleaning on it–specifically, I narrowed the date range down to the most recent two complete years for which data was provided (2015 and 2016) and removed some columns that were included but had no data.

For each data, the standby wait time in minutes (“SPOSTMIN”) of some specific attractions in the park (list of attractions was not provided to me) was recorded several times a day. The purpose of this project is to perform an exploratory data analysis on these wait times to look for possible patterns and to see what factors (e.g. number of hours the park is open, number of hours other parks are open, mean minimum temperature for the day, season, etc.) have a noticeable impact on those wait times.

There are some quirks and limitations to the data set that I wish did not exist. The two primary ones are that, for some reason, Epcot’s park hours were not included, and instead of the mean temperature for each day, the mean minimum temperature for each day was provided. I think the mean temperature would be more useful, but the mean minimum is better than nothing.

 

 

Now, let’s load the data set & packages and get a look at the structure.

## [1] "D:/Documents/School - WGU/Term 4/C751/Project"
## 'data.frame':    73222 obs. of  24 variables:
##  $ SPOSTMIN          : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ WGT               : num  0.81 0.81 0.81 0.81 0.81 0.81 0.81 0.81 0.81 0.81 ...
##  $ DATE              : chr  "1/1/2015" "1/1/2015" "1/1/2015" "1/1/2015" ...
##  $ TIMEPART          : chr  "7:51:00 AM" "8:02:00 AM" "8:09:00 AM" "8:16:00 AM" ...
##  $ WDW_TICKET_SEASON : chr  "peak" "peak" "peak" "peak" ...
##  $ DAYOFWEEK         : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ DAYOFYEAR         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ WEEKOFYEAR        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MONTHOFYEAR       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ YEAR              : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ SEASON            : chr  "CHRISTMAS PEAK" "CHRISTMAS PEAK" "CHRISTMAS PEAK" "CHRISTMAS PEAK" ...
##  $ HOLIDAYPX         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HOLIDAYM          : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ INSESSION         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NEWNESS           : chr  "MAJOR REFURB WITHIN 18 MONTHS" "MAJOR REFURB WITHIN 18 MONTHS" "MAJOR REFURB WITHIN 18 MONTHS" "MAJOR REFURB WITHIN 18 MONTHS" ...
##  $ SUNSET_WDW        : chr  "12:05:22 PM" "12:05:22 PM" "12:05:22 PM" "12:05:22 PM" ...
##  $ MKHOURS           : num  17 17 17 17 17 17 17 17 17 17 ...
##  $ HSHOURS           : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ AKHOURS           : num  11 11 11 11 11 11 11 11 11 11 ...
##  $ IAHOURS           : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ UFHOURS           : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ WDWMINTEMP_MEAN   : num  55.2 55.2 55.2 55.2 55.2 ...
##  $ CAPACITYLOST_MK   : int  352865 352865 352865 352865 352865 352865 352865 352865 352865 352865 ...
##  $ CAPACITYLOSTWGT_MK: int  34661635 34661635 34661635 34661635 34661635 34661635 34661635 34661635 34661635 34661635 ...

 

Analysis and Exploration

This data set consists of 73,222 observations of 24 variables, for a total of 1,757,328 data points.

Not all of these variables are relevant to a summary, so here is the summary with only those that are. These may or may not all end up being useful later in the analysis.

##     SPOSTMIN           WGT            MKHOURS         HSHOURS     
##  Min.   :  0.00   Min.   :0.8100   Min.   : 8.00   Min.   : 8.00  
##  1st Qu.: 10.00   1st Qu.:0.9000   1st Qu.:14.00   1st Qu.:11.00  
##  Median : 35.00   Median :0.9000   Median :15.00   Median :12.00  
##  Mean   : 36.02   Mean   :0.9321   Mean   :14.27   Mean   :12.02  
##  3rd Qu.: 55.00   3rd Qu.:1.0000   3rd Qu.:16.00   3rd Qu.:13.00  
##  Max.   :175.00   Max.   :1.0000   Max.   :23.00   Max.   :16.00  
##     AKHOURS        IAHOURS         UFHOURS      WDWMINTEMP_MEAN
##  Min.   : 8.0   Min.   : 8.00   Min.   : 8.00   Min.   :46.79  
##  1st Qu.: 9.0   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:58.51  
##  Median :10.0   Median :11.00   Median :12.00   Median :67.78  
##  Mean   :10.6   Mean   :11.45   Mean   :11.63   Mean   :66.12  
##  3rd Qu.:12.0   3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:73.68  
##  Max.   :14.0   Max.   :15.00   Max.   :16.50   Max.   :76.20

It’s time to start looking at some visualizations of the data. We’ll begin with a simple histogram showing the distribution of wait times.

This is somewhat surprising. Considering what I thought I knew of theme park wait times, and considering the summary shown above, I would have expected a more normal distribution with most wait times centered around 36 minutes. Instead, we have a positively skewed distribution with the vast majority being in the 5-minute range. There are definitely some outliers, too, and I’m sure we’ll see those again. Given what I have personally experienced in theme park wait times, those outliers are legitimate data, so generally I will keep them included.

Scaling the y-axis to a log(10) scale and limiting the x-axis to 0 - 90, we can see 5 minutes it still the most prominent wait time recorded, but we still don’t have a normal distribution. Instead, it remains positively skewed. Interesting.

Now, we’ll get a snapshot of many of the data pairs.