Introduction

This data set was created by TouringPlans.com specifically to give programmers and statisticians who may be interested in working with them someday some actual data they have compiled to practice on. Initially it was provided to students at the University of Central Florida in 2017. Because a friend of mine is one of the creators of TouringPlans.com, he graciously provided this to me to explore as well.

A total of four data sets were provided, each consisting of wait time data for one of the four Walt Disney World theme parks (Magic Kingdom, Epcot, Hollywood Studios, and Animal Kingdom) for a particular time period. I have chosen to use the Magic Kingdom (“MK”) data set for this project.

To keep the project scope from getting away from me, before importing the set into RStudio, I did some cleaning on it–specifically, I narrowed the date range down to the most recent two complete years for which data was provided (2015 and 2016) and removed some columns that were included but had no data.

For each data, the standby wait time in minutes (“SPOSTMIN”) of some specific attractions in the park (list of attractions was not provided to me) was recorded several times a day. The purpose of this project is to perform an exploratory data analysis on these wait times to look for possible patterns and to see what factors (e.g. number of hours the park is open, number of hours other parks are open, mean minimum temperature for the day, season, etc.) have a noticeable impact on those wait times.

There are some quirks and limitations to the data set that I wish did not exist. The two primary ones are that, for some reason, Epcot’s park hours were not included, and instead of the mean temperature for each day, the mean minimum temperature for each day was provided. I think the mean temperature would be more useful, but the mean minimum is better than nothing.

 

 

Now, let’s load the data set & packages and get a look at the structure.

## [1] "D:/Documents/School - WGU/Term 4/C751/Project"
## 'data.frame':    73222 obs. of  24 variables:
##  $ SPOSTMIN          : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ WGT               : num  0.81 0.81 0.81 0.81 0.81 0.81 0.81 0.81 0.81 0.81 ...
##  $ DATE              : chr  "1/1/2015" "1/1/2015" "1/1/2015" "1/1/2015" ...
##  $ TIMEPART          : chr  "7:51:00 AM" "8:02:00 AM" "8:09:00 AM" "8:16:00 AM" ...
##  $ WDW_TICKET_SEASON : chr  "peak" "peak" "peak" "peak" ...
##  $ DAYOFWEEK         : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ DAYOFYEAR         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ WEEKOFYEAR        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MONTHOFYEAR       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ YEAR              : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ SEASON            : chr  "CHRISTMAS PEAK" "CHRISTMAS PEAK" "CHRISTMAS PEAK" "CHRISTMAS PEAK" ...
##  $ HOLIDAYPX         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HOLIDAYM          : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ INSESSION         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NEWNESS           : chr  "MAJOR REFURB WITHIN 18 MONTHS" "MAJOR REFURB WITHIN 18 MONTHS" "MAJOR REFURB WITHIN 18 MONTHS" "MAJOR REFURB WITHIN 18 MONTHS" ...
##  $ SUNSET_WDW        : chr  "12:05:22 PM" "12:05:22 PM" "12:05:22 PM" "12:05:22 PM" ...
##  $ MKHOURS           : num  17 17 17 17 17 17 17 17 17 17 ...
##  $ HSHOURS           : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ AKHOURS           : num  11 11 11 11 11 11 11 11 11 11 ...
##  $ IAHOURS           : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ UFHOURS           : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ WDWMINTEMP_MEAN   : num  55.2 55.2 55.2 55.2 55.2 ...
##  $ CAPACITYLOST_MK   : int  352865 352865 352865 352865 352865 352865 352865 352865 352865 352865 ...
##  $ CAPACITYLOSTWGT_MK: int  34661635 34661635 34661635 34661635 34661635 34661635 34661635 34661635 34661635 34661635 ...

 

Analysis and Exploration

This data set consists of 73,222 observations of 24 variables, for a total of 1,757,328 data points.

Not all of these variables are relevant to a summary, so here is the summary with only those that are. These may or may not all end up being useful later in the analysis.

##     SPOSTMIN           WGT            MKHOURS         HSHOURS     
##  Min.   :  0.00   Min.   :0.8100   Min.   : 8.00   Min.   : 8.00  
##  1st Qu.: 10.00   1st Qu.:0.9000   1st Qu.:14.00   1st Qu.:11.00  
##  Median : 35.00   Median :0.9000   Median :15.00   Median :12.00  
##  Mean   : 36.02   Mean   :0.9321   Mean   :14.27   Mean   :12.02  
##  3rd Qu.: 55.00   3rd Qu.:1.0000   3rd Qu.:16.00   3rd Qu.:13.00  
##  Max.   :175.00   Max.   :1.0000   Max.   :23.00   Max.   :16.00  
##     AKHOURS        IAHOURS         UFHOURS      WDWMINTEMP_MEAN
##  Min.   : 8.0   Min.   : 8.00   Min.   : 8.00   Min.   :46.79  
##  1st Qu.: 9.0   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:58.51  
##  Median :10.0   Median :11.00   Median :12.00   Median :67.78  
##  Mean   :10.6   Mean   :11.45   Mean   :11.63   Mean   :66.12  
##  3rd Qu.:12.0   3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:73.68  
##  Max.   :14.0   Max.   :15.00   Max.   :16.50   Max.   :76.20

It’s time to start looking at some visualizations of the data. We’ll begin with a simple histogram showing the distribution of wait times.

This is somewhat surprising. Considering what I thought I knew of theme park wait times, and considering the summary shown above, I would have expected a more normal distribution with most wait times centered around 36 minutes. Instead, we have a positively skewed distribution with the vast majority being in the 5-minute range. There are definitely some outliers, too, and I’m sure we’ll see those again. Given what I have personally experienced in theme park wait times, those outliers are legitimate data, so generally I will keep them included.

Scaling the y-axis to a log(10) scale and limiting the x-axis to 0 - 90, we can see 5 minutes it still the most prominent wait time recorded, but we still don’t have a normal distribution. Instead, it remains positively skewed. Interesting.

Now, we’ll get a snapshot of many of the data pairs.

This is a lot of visuals in a relatively small space. A larger version is available here.

This is a good quick snapshot of many of the possible data pairs. Looking at the correlation coefficients, some non-correlations are obvious and expected (Magic Kingdom park hours and the mean minimum temperature for a particular day are completely independent of one another, and so the correlation coefficient of -0.009 is expected.)

However, there are also some stronger correlations. Since wait times are the primary variable of interest, as I look at this visualization, I see a moderately strong correlation between wait times and these other variables:

  • Holiday Metric (HOLIDAYM) … 0.311
  • Percentage of Schools in Session (INSESSION) … -0.305
  • Total Open Hours for Magic Kingdom (MKOPEN) … 0.302

Surprisingly, the two strongest correlations are between wait times and Universal Studios’ Islands of Adventure total open hours (0.341) and between wait times and Universal Studios total open hours (0.320).

Let’s take a closer look at some individual plots to see what we can learn, starting with a box plot of wait times by WDW_TICKET_SEASON.

There are a couple of things I notice right away about these box plots. First, as expected, the mean wait times are longest during peak season and decrease from there. Somewhat surprisingly, peak season has a lot of outliers, but considering that “peak season” dates include Spring Breaks, Summer, Christmas, and other holidays, I expect a lot of variability in that one. The outliers in regular season may warrant more scrutiny. We may be able to see this with a scatterplot of wait times vs. month of the year.

Sure enough, the season varies from month to month for much of the year. Only June and October are each consistently one season (peak and regular, respectively). Also, there are relatively few outliers–though the ones at the bottom of the plot are confusing and need more scrutiny, just perhaps not in this project.

This time, we are looking at the wait times by day of the week, again color-coded by ticket season. These findings definitely surprise me. I would expect both weekend days to be the busiest, but Sunday is noticeably lower than the other days. Let’s take a look at the summary for each day to find out more.

## mk$DAYOFWEEK: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   30.00   31.09   45.00  135.00 
## ------------------------------------------------------------ 
## mk$DAYOFWEEK: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   40.00   40.05   60.00  175.00 
## ------------------------------------------------------------ 
## mk$DAYOFWEEK: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   30.00   36.19   55.00  150.00 
## ------------------------------------------------------------ 
## mk$DAYOFWEEK: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   35.00   38.29   55.00  165.00 
## ------------------------------------------------------------ 
## mk$DAYOFWEEK: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   35.00   35.81   55.00  165.00 
## ------------------------------------------------------------ 
## mk$DAYOFWEEK: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   10.00   30.00   33.71   50.00  130.00 
## ------------------------------------------------------------ 
## mk$DAYOFWEEK: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      15      35      37      55     155

We can see from this table that Sunday is, in fact, the lowest (or tied for the lowest) at every quantile level, and the mean is the lowest. It can be hard to really see how all 7 days compare on a table like this, though. Let’s visualize it as a box plot.

Sure enough, Sunday’s median and mean are lower, as are the quantiles. If we remove the outliers, this becomes even more obvious.

Sunday’s median is lower than 3 days, and the same as the other 3. The quantiles for Sunday are definitely smaller. And the mean is lower than all the other days. Why would this be? Wouldn’t weekend days be highest?

Not necessarily. WDW is a destination vacation, not generally a locals’ park (like Disneyland is). Because it is a destination, people are more likely to arrive on the weekend. In my experience, guests often like to finish up their packing and preparations on Saturday, then travel Sunday and stay 7 nights. These stays are Sunday - Sunday.

That makes Sunday the arrival/departure day for a larger-than-average percentage of guests, which then also makes it the day they are least likely to visit the theme parks. If we had observations for arrival/departure day, we could confirm this, but that is my educated guess.

I would like to explore this further. Let’s see if the daily attendance pattern holds across ticket seasons.

There is definitely more variation between seasons.

## season_peak$DAYOFWEEK: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    15.0    35.0    36.3    50.0   120.0 
## ------------------------------------------------------------ 
## season_peak$DAYOFWEEK: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0    20.0    45.0    45.7    70.0   175.0 
## ------------------------------------------------------------ 
## season_peak$DAYOFWEEK: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   50.00   48.86   70.00  150.00 
## ------------------------------------------------------------ 
## season_peak$DAYOFWEEK: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   20.00   45.00   47.16   70.00  165.00 
## ------------------------------------------------------------ 
## season_peak$DAYOFWEEK: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    20.0    45.0    46.5    70.0   165.0 
## ------------------------------------------------------------ 
## season_peak$DAYOFWEEK: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   15.00   40.00   39.85   60.00  130.00 
## ------------------------------------------------------------ 
## season_peak$DAYOFWEEK: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   20.00   40.00   40.65   60.00  155.00
## season_regular$DAYOFWEEK: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   30.00   28.81   45.00  135.00 
## ------------------------------------------------------------ 
## season_regular$DAYOFWEEK: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   40.00   38.76   60.00  125.00 
## ------------------------------------------------------------ 
## season_regular$DAYOFWEEK: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    5.00   25.00   30.93   45.00  145.00 
## ------------------------------------------------------------ 
## season_regular$DAYOFWEEK: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   10.00   35.00   35.38   50.00  155.00 
## ------------------------------------------------------------ 
## season_regular$DAYOFWEEK: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   30.00   30.78   45.00  105.00 
## ------------------------------------------------------------ 
## season_regular$DAYOFWEEK: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   10.00   30.00   31.77   50.00  120.00 
## ------------------------------------------------------------ 
## season_regular$DAYOFWEEK: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   35.00   35.41   55.00  115.00
## season_value$DAYOFWEEK: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    5.00   20.00   23.47   35.00   80.00 
## ------------------------------------------------------------ 
## season_value$DAYOFWEEK: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   10.00   30.00   29.97   45.00   90.00 
## ------------------------------------------------------------ 
## season_value$DAYOFWEEK: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5       5      20      24      35      85 
## ------------------------------------------------------------ 
## season_value$DAYOFWEEK: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   10.00   25.00   26.62   40.00   85.00 
## ------------------------------------------------------------ 
## season_value$DAYOFWEEK: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   10.00   25.00   24.22   35.00  105.00 
## ------------------------------------------------------------ 
## season_value$DAYOFWEEK: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    5.00   15.00   19.93   30.00   85.00 
## ------------------------------------------------------------ 
## season_value$DAYOFWEEK: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   10.00   30.00   31.27   50.00   90.00

Having 3 summaries of 7 days each is a little hard to follow. Still, as we view these three summaries, we can see that, while the mean remains lowest on Sundays across the seasons, the median is lowest for Sunday only in peak season. I’m not sure why that is. We may come back and try to understand it later.

For now, there are other aspects of this data set I want to explore, so let’s move on.

Here we are viewing the wait times vs. the mean minimum temperatures for each day.

Well, that doesn’t look right. Assuming WDWMINTEMP_MEAN is what it sounds like it is, this means that the days with the highest mean minimum temperature have the longest wait times. That wouldn’t be too surprising, except this is central Florida, and a mean minimum temperature of 70° - 75° indicates a mean high temperature of 90°+. Why would these days have the longest wait times? Wouldn’t they be the days people would want to avoid?

Let’s validate this data by comparing the WDWMINTEMP_MEAN to the month of the year to see which months are the highest minimum temperatures.

Ah, OK, this does make some sense now. Those months with the highest temperatures also have the highest concentration of value season days. A lot of guests are willing to brave the hot weather to pay the lower price. So it seems the real driving factor here is not the temperatures, but what months / ticket seasons they fall in.

That was interesting, but there is much more to see and explore.

This is unexpected. Here we have a visualization of wait times vs. the “HOLIDAYM” metric, which is a number that indicates a given dates proximity to a holiday combined with how “major” that holiday is considered when it comes to travelers. The fact that there are relatively few data points in the 5 (highest) HOLIDAYM surprises me, but that may be due to the difficulty in gathering the data at major holiday times.

What is surprising to me is the lack of any predictable pattern. I would have expected the wait times to either increase more-or-less consistently from 0 - 5 or to decrease in that same way, but it does neither. 0 is the lowest and 5 is the highest, but in between comes 2, 1, 4, and 3, in that order. I really have no idea why this is.

Considering the fact that the data pairs matrix showed a correlation coefficient of 0.311, I’m even more surprised to see apparent unpredictability, as HOLIDAYM should be a moderately strong predictor of wait times. Unfortunately, I don’t think I have enough data to explain this in more detail. We may return to it later, though.

Interesting and again not as expected. One would assume that the more schools are not in session (an INSESSION number closer to 0), the higher the average wait time. But that doesn’t appear to be the case here. However, there is so much data, there is a great deal of overplotting, so let’s see what we can do with that.

That’s better. I had really expected a stronger relationship here, given the relatively high (actually moderate, but higher than most others) inverse correlation.

From what I see here, it seems that during peak season, the mean wait times are within a fairly narrow range (40-50 minutes) until we get to about 65% of schools in session. This seems to show that during the peak season, while most schools are not in session, the percentage of schools in session doesn’t have much of an impact. Then we see a bump–which may be related to which schools are still not in session and may indicate a desire of those families to get in a trip when most kids are back in school. Then, predictably, it drops off dramatically.

During the regular season, the impact of schools being in session is clearer, and more what I would expect. For value season, there are so few data points of schools not being in session that the graph is inconclusive.

If we stack these on top of each other, we can see that the combined chart very closely mirrors the peak season chart.

So far, we’ve been looking at the park as a whole. What if we explore wait times subset by NEWNESS? Do attractions that have had major refurbishments recently have longer median or mean wait times?

This plot shows the same general trend for both the mean and median–the longest wait times are for attractions having a minor refurbishment within 6 months. The only reason I can think of for this is that the attractions with minor refurbishments within 6 months are also the “headliners” and/or guest favorites and so they naturally generate higher wait times. Without a list of what attractions are in each category, though, I can’t say this for certain.

A major refurbishment within 18 months appears to have no effect.

That was interesting, but not too revealing. Let’s see what happens when we look at wait times vs. opening hours for each park (except Epcot, which I have no data for), including the two Universal Studios parks.

There is definitely an outlier here. But which one? (Or is there more than one?)

##    Length     Class      Mode 
##       188 character character
## [1] "5/22/2015"

May 22, 2015. That makes sense. That was the date of a 24-hour celebration. Let’s run that same visualization again, but exclude that date.

There aren’t very many observations for the 8- and 9-hour days, but that makes sense. Disney wants to keep their parks open as long as possible. Surprisingly, the ones that are there are mostly regular and peak days. The two main reasons I can think of for this are that some of those days are likely during months with less daytime hours so they may close earlier, and/or days with special events in the evening (primarily Mickey’s Halloween Treat and Mickey’s Very Merry Christmas Party). We could determine that more conclusively by plotting park hours vs. time of year (maybe by SEASON), and park hours vs. SUNSET_WDW … but we’re not going to.

Once we get to 12 hours, we start to have a pretty solid upward trend. Hours generally get longer as crowds get heavier (not causation but correlation), so it makes sense to see this increase. The spike at 18 hours also makes sense as they are only open that long at the very busiest times of year.

Do wait times at any of the other parks (excluding Epcot) affect Magic Kingdom wait times? Let’s find out.

For some reason, the strongest relationship of all of these is between the Magic Kingdom wait times and Universal’s Islands of Adventure hours. It is a nearly linear relationship. As Islands of Adventure’s get longer, wait times at Magic Kingdom increase. I have no idea why that is. I know it’s not causal–there is no reason either one would directly cause the other–but I don’t know what it is. Digging into this could be an entire project by itself, but it’s beyond the scope I’ve set for this one.

Moving on now. We’ve looked quite a bit at how various factors compare or relate to WDW_TICKET_SEASON (peak, regular, or value), but how does SEASON (e.g. Christmas, Halloween, Spring Break, etc.) compare.

Wow, these are all over the place. There is very little consistency between them. While the significant variety is interesting, it would require more subsetting and exploring than there is time remaining for in this project. It would be fun to revisit another time, though.

Another aspect we’ve not yet explored is the intra-day wait times. So far, we’ve only looked at aggregate wait times for entire days. Let’s put an end to that right now and see how wait times vary during the course of a day. Initially we will explore this for the entire date range of the sample data.

It looks like we have something very close to a normal distribution here. The bars show the maximum wait time for each time of day, while the lines show the mean, median, 10th percentile, and 95th percentile. We can see that even the early morning hours (park opening until 9 a.m.) can have long wait times, but they’re still shorter than any other time of day. Let’s get a closer look at the mean and median wait times to learn a bit more about how wait times are typically distributed, not this extreme.

That paints a better everyday picture for us. Wait times start very low and climb so they are at their longest in the early afternoon. Then they decline in the late afternoon and, for those night owls, are low again at night (7:00 or after). On those nights where the park is open after midnight, median wait times approximate early morning times, though the mean is higher.

I wonder how this would look if we used the actual time (in hours) rather than the category groupings. Let’s find out, shall we?

This is in order by time of day, from midnight to 11:59 p.m., and some of them (specifically 3 am - 6am) have very few data points. Looking at the other times, though, it is clear that wait times remain very low until 9:00, then they increase dramatically until about 1:00 or 2:00 in the afternoon. They drop for a few hours, then increase for a couple of hours starting around 5:00, and then drop off precipitously for most of the remainder of the night.

The lesson here is that if you’re planning to visit the Magic Kingdom, get there early! Then, leave the park if you can for the late morning and lunch time. If you can stay up late enough and the park is open late enough, stay away until after dinner and come back for the rest of the night. (This will also serve you well as it avoids the hottest part of the day and the typical summer afternoon storms).

 

Final Plots and Summary

Final Plot 1 -

I chose this one because it revealed something that at first glance was unexpected. I was surprised by the fact that Sunday was so low compared to the other days, so I really dug into this one. It helped me get a better understanding of the data and what it represented, and clarified a gap between perception and reality, and (I think) enables the reader to digest large amounts of information.

 

 

Final Plot 2 -

I chose these two together because, once again, they surprised me. I really had to wrap my head around why wait times would be higher when temperatures were higher. I plotted months vs. temperature just to make sure the data seemed valid and reasonable. It did, but it still didn’t make sense with wait times. It wasn’t until I added the colors for ticket seasons that it became clear.

 

 

Final Plot 3 -

I chose this one because for the rest of this analysis, we’d been looking at daily values, but when it comes to understanding wait times, the intra-day times are just as important. This was my chance to dip my toe in the water that direction–though looking at the hourly data could have certainly been its own analysis project.

Before moving onto Reflections, let’s look at a data model. Realistically, the correlations are so low within the data we have that this model isn’t likely to predict much. I just want to see, based on some of the most promising-seeming criteria, how useful it might be.

 

 

Linear Data Model

## 
## Calls:
## m1: lm(formula = I(SPOSTMIN) ~ I(WDW_TICKET_SEASON), data = mk)
## m2: lm(formula = I(SPOSTMIN) ~ I(WDW_TICKET_SEASON) + WDWMINTEMP_MEAN, 
##     data = mk)
## m3: lm(formula = I(SPOSTMIN) ~ I(WDW_TICKET_SEASON) + WDWMINTEMP_MEAN + 
##     HOLIDAYM, data = mk)
## m4: lm(formula = I(SPOSTMIN) ~ I(WDW_TICKET_SEASON) + WDWMINTEMP_MEAN + 
##     HOLIDAYM + INSESSION, data = mk)
## m5: lm(formula = I(SPOSTMIN) ~ I(WDW_TICKET_SEASON) + WDWMINTEMP_MEAN + 
##     HOLIDAYM + INSESSION + MKHOURS, data = mk)
## m6: lm(formula = I(SPOSTMIN) ~ I(WDW_TICKET_SEASON) + WDWMINTEMP_MEAN + 
##     HOLIDAYM + INSESSION + MKHOURS + IAHOURS, data = mk)
## 
## ==========================================================================================================================
##                                            m1            m2            m3            m4            m5            m6       
## --------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                            43.305***     -7.548***     -9.363***      0.965       -35.755***    -70.514***  
##                                          (0.149)       (0.797)       (0.785)       (0.957)       (1.247)       (1.801)    
##   I(WDW_TICKET_SEASON): regular/peak    -10.155***     -9.072***     -5.528***     -2.989***     -0.946***      0.099     
##                                          (0.200)       (0.196)       (0.206)       (0.246)       (0.247)       (0.249)    
##   I(WDW_TICKET_SEASON): value/peak      -17.559***    -21.016***    -10.738***     -8.027***     -6.884***     -5.836***  
##                                          (0.290)       (0.287)       (0.352)       (0.379)       (0.375)       (0.375)    
##   WDWMINTEMP_MEAN                                       0.769***      0.633***      0.541***      0.606***      0.574***  
##                                                        (0.012)       (0.012)       (0.013)       (0.013)       (0.013)    
##   HOLIDAYM                                                            3.586***      3.076***      1.824***      1.435***  
##                                                                      (0.073)       (0.078)       (0.082)       (0.083)    
##   INSESSION                                                                        -7.410***     -5.474***      3.109***  
##                                                                                    (0.395)       (0.392)       (0.506)    
##   MKHOURS                                                                                         2.291***      2.152***  
##                                                                                                  (0.051)       (0.051)    
##   IAHOURS                                                                                                       2.931***  
##                                                                                                                (0.110)    
## --------------------------------------------------------------------------------------------------------------------------
##   R-squared                               0.059         0.110         0.138         0.142         0.165         0.173     
##   N                                   73222         73222         73222         73222         73222         73222         
## ==========================================================================================================================
##   Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05

As expected, there isn’t much here. The r-squared value is only 0.165, which is very weak. Oh well, it was worth looking at.

 

Reflection

This was a fun project. As I said at the beginning, this is a data set I’ve been wanting to analyze for a long time, so this was highly enjoyable in spite of (or sometimes because of) the challenges.

I ran into difficulties in the analysis a few times. One was when I encountered missing data, such as the first few months of 2016 or the missing Epcot park hours. Also, there was some additional investigation I wanted to do but the variables to do so were not included (e.g. mean daily temps).

In conducting the analysis itself, there were some types of visualizations I wanted to do, but didn’;’t know how, like plotting max wait times for each part of the day as a bar plot, with the mean, median, 10th percentile, and 90th percentile overlayed as lines. Some of the work took hours trying different combinations and approaches, and twise I had to ask a course mentor for help.

I am proud of my success, though, in that every type of visualization I wanted to include I eventually learned how to do and was able to include. Also, I don’t generally consider myself a very creative person, but I think I was quite successful in coming up with some creative comparisons and analyses.

The analysis can be enriched in future work by digging more into the patterns and insights from looking at the intra-day times. Also, having the mean daily temperatures, Epcot park hours, and the missing observations could make it even better and more interesting. In fact, since I have so much of this built, I may go back to my friend at TouringPlans.com and see if he has a more updated and complete data set he could share with me so that I can run this analysis on that data, and then possibly take it even farther.