Skip to the content.

Exploratory Data Analysis of Major Outages

By Diego Silva (d1silva@ucsd.edu). My multiclass classification model on this dataset can be found here.

Introduction

In this project, I cleaned and analyzed a data set containing major outages reported by different states in the United States from January 2000-July 2016. The main question I want to answer in this analysis, is there a significant difference between the outages distributions of the seasons in the SPP Region and the Overall outages distributions of the seasons of the NERC Regions? This data set and analysis provide an understanding of major outage patterns and how in the future they can be avoided to improve our national electrical infrastructure. Moving forward I will reference the data set as Outages.  

Cleaning and EDA

In my data cleaning process, I first looked at the raw data set to assess what steps needed to be done. I looked for any unnecessary rows and columns, checked the column names, and anything else that looked out of the ordinary for a data set. In my specific case, I noticed there were a couple of columns and rows that were unnecessary and the column names were incorrect. So I dropped said columns and rows, set the columns to their proper respective names, and reset the index of the data frame. Lastly, I ensured the data types of the columns were the best possible type that allowed me to properly analyze the data frame.  

YEAR MONTH U.S._STATE POSTAL.CODE NERC.REGION CLIMATE.REGION ANOMALY.LEVEL CLIMATE.CATEGORY CAUSE.CATEGORY CAUSE.CATEGORY.DETAIL HURRICANE.NAMES OUTAGE.DURATION DEMAND.LOSS.MW CUSTOMERS.AFFECTED RES.PRICE COM.PRICE IND.PRICE TOTAL.PRICE RES.SALES COM.SALES IND.SALES TOTAL.SALES RES.PERCEN COM.PERCEN IND.PERCEN RES.CUSTOMERS COM.CUSTOMERS IND.CUSTOMERS TOTAL.CUSTOMERS RES.CUST.PCT COM.CUST.PCT IND.CUST.PCT PC.REALGSP.STATE PC.REALGSP.USA PC.REALGSP.REL PC.REALGSP.CHANGE UTIL.REALGSP TOTAL.REALGSP UTIL.CONTRI PI.UTIL.OFUSA POPULATION POPPCT_URBAN POPPCT_UC POPDEN_URBAN POPDEN_UC POPDEN_RURAL AREAPCT_URBAN AREAPCT_UC PCT_LAND PCT_WATER_TOT PCT_WATER_INLAND OUTAGE.START OUTAGE.RESTORATION
2011 7 Minnesota MN MRO East North Central -0.3 normal severe weather nan nan 3060 nan 70000 11.6 9.18 6.81 9.28 2.33292e+06 2.11477e+06 2.11329e+06 6.56252e+06 35.5491 32.225 32.2024 2308736 276286 10673 2595696 88.9448 10.644 0.411181 51268 47586 1.07738 1.6 4802 274182 1.75139 2.2 5348119 73.27 15.28 2279 1700.5 18.2 2.14 0.6 91.5927 8.40733 5.47874 2011-07-01 17:00:00 2011-07-03 20:00:00
2014 5 Minnesota MN MRO East North Central -0.1 normal intentional attack vandalism nan 1 nan nan 12.12 9.71 6.49 9.28 1.58699e+06 1.80776e+06 1.88793e+06 5.28423e+06 30.0325 34.2104 35.7276 2345860 284978 9898 2640737 88.8335 10.7916 0.37482 53499 49091 1.08979 1.9 5226 291955 1.79 2.2 5457125 73.27 15.28 2279 1700.5 18.2 2.14 0.6 91.5927 8.40733 5.47874 2014-05-11 18:38:00 2014-05-11 18:39:00
2010 10 Minnesota MN MRO East North Central -1.5 cold severe weather heavy wind nan 3000 nan 70000 10.87 8.19 6.07 8.15 1.46729e+06 1.80168e+06 1.9513e+06 5.22212e+06 28.0977 34.501 37.366 2300291 276463 10150 2586905 88.9206 10.687 0.392361 50447 47287 1.06683 2.7 4571 267895 1.70627 2.1 5310903 73.27 15.28 2279 1700.5 18.2 2.14 0.6 91.5927 8.40733 5.47874 2010-10-26 20:00:00 2010-10-28 22:00:00
2012 6 Minnesota MN MRO East North Central -0.1 normal severe weather thunderstorm nan 2550 nan 68200 11.79 9.25 6.71 9.19 1.85152e+06 1.94117e+06 1.99303e+06 5.78706e+06 31.9941 33.5433 34.4393 2317336 278466 11010 2606813 88.8954 10.6822 0.422355 51598 48156 1.07148 0.6 5364 277627 1.93209 2.2 5380443 73.27 15.28 2279 1700.5 18.2 2.14 0.6 91.5927 8.40733 5.47874 2012-06-19 04:30:00 2012-06-20 23:00:00
2015 7 Minnesota MN MRO East North Central 1.2 warm severe weather nan nan 1740 250 250000 13.07 10.16 7.74 10.43 2.02888e+06 2.16161e+06 1.77794e+06 5.97034e+06 33.9826 36.2059 29.7795 2374674 289044 9812 2673531 88.8216 10.8113 0.367005 54431 49844 1.09203 1.7 4873 292023 1.6687 2.2 5489594 73.27 15.28 2279 1700.5 18.2 2.14 0.6 91.5927 8.40733 5.47874 2015-07-18 02:00:00 2015-07-19 07:00:00

 

The bar plot below presents the count of outages that occurred in each U.S. state. One interesting note is that California seems to have a significantly higher count of outages compared to the rest of the states, even though it is not the biggest state by land area.

The scatter plot below presents the relationship between the OUTAGE DURATION and DEMAND LOSS MW columns of Outages. The plot suggests that short Outage Duration times results in a higher Demand Loss (MW) and longer Outage Duration times results in a lower Demand Loss (MW).

 

This table shows gives a break down of Outage CATEGORY CAUSE for each state, showing which Outage ‘CATEGORY.CAUSE’ value is the most and least common among each state. Also gives a general idea of how severe the problem of outages is in each state.  

U.S._STATE equipment failure fuel supply emergency intentional attack islanding public appeal severe weather system operability disruption
Alabama 0 0 1 0 0 5 0
Alaska 1 0 0 0 0 0 0
Arizona 4 0 18 0 0 4 2
Arkansas 1 0 6 1 7 10 0
California 21 17 24 28 9 70 41
Colorado 0 1 5 1 0 4 4
Connecticut 0 0 8 0 0 10 0
Delaware 1 0 37 0 0 2 1
District of Columbia 1 0 0 0 0 9 0
Florida 4 0 2 0 3 26 10

 

Assessment of Missingness

NMAR Analysis:

In the Outages data frame, the ‘CAUTEGORY.CAUSE.DETAIL’ column is supposed to give a specific reason for the ‘CATEGORY.CAUSE’ of the outage. This column could be NMAR because the reason why a value would be missing can be due to negligence of the person logging the data because they may have felt the specific reason may not be significant enough or if an investigation was not done. This column would be MAR if another column provided information on whether an investigation was done to figure out the specific cause of the outage.

Missingness Dependency:

I wanted to find out a column in which the missingness of ‘CUSTOMERS.AFFECTED’ was dependent and a column in which its missingness was not dependent. For this investigation, I chose the ‘YEAR’ and ‘PC.REALGSP.CHANGE’ columns to test the missingness dependency of these columns.  

Missingness dependency on the ‘YEAR’ column

For my test of the missingness dependency on the ‘YEAR’ column, I first created a distribution plot as seen below. This plot shows the distribution of ‘YEAR’ by missingness of ‘CUSTOMERS.AFFECTED’. To investigate whether the differences in the distributions were significant, I decided to perform a permutation test.  

 

Null Hypothesis: The distribution of ‘YEAR’ when ‘CUSTOMERS’ is missing is the same as the distribution of ‘YEAR’ when ‘CUSTOMERS.AFFECTED’ is not missing.

Alternative Hypothesis: The distribution of ‘YEAR’ when ‘CUSTOMERS’ is missing is not the same as the distribution of ‘YEAR’ when ‘CUSTOMERS.AFFECTED’ is not missing.

I decided to use the total variation distance (TVD) as my test statistic, the observed TVD value I saw was 0.306. Additionally, I chose a significance level of 0.05 as a cut-off for my p-value, since a p-value smaller than 0.05 indicates strong evidence against my null hypothesis. The plot below shows the results of my permutation test. It displays the empirical distribution of the generated TVDs under the null. The green line shows the observed value. The p-value I calculated was essentially 0.  

In conclusion, we reject the null, since my p-value, 0, is smaller than the 0.05 significance level. There is strong enough evidence to suggest the distribution of ‘YEAR’ when ‘CUSTOMERS.AFFECTED’ is missing is not the same as the distribution of ‘YEAR’ when ‘CUSTOMERS.AFFECTED’ is not missing. As a result, the evidence suggests that the ‘CUSTOMERS.AFFECTED’ column may be dependent on the ‘YEAR’ column.  

Missingness Dependency on the ‘PC.REALGSP.CHANGE’ Column

For my test of the missingness dependency on the ‘PC.REALGSP.CHANGE’ column, I first created a kernel density plot as seen below. This plot shows the kernel density of ‘PC.REALGSP.CHANGE’ by missingness of ‘CUSTOMERS.AFFECTED’. Since the difference of means of the plots was 0.0577, which tells us the means of the two distributions are similar. But as seen in the plot the shapes look different. As a result, I decided to test whether the differences in shape were significant or not by using the Kolmogorov-Smirnov (ks) test statistic as my test statistic for this permutation test.

Null Hypothesis: The shape of the distribution of ‘PC.REALGSP.CHANGE’ when ‘CUSTOMERS.AFFECTED’ is missing is the same as the shape of the distribution of ‘PC.REALGSP.CHANGE’ when ‘CUSTOMERS.AFFECTED’ is not missing.

Alternative hypothesis: The shape of the distribution of ‘PC.REALGSP.CHANGE’ when ‘CUSTOMERS.AFFECTED’ is missing is not the same as the shape of the distribution of ‘PC.REALGSP.CHANGE’ when ‘CUSTOMERS.AFFECTED’ is not missing.

Additionally, I chose a significance level of 0.05 as a cut-off for my p-value, since a p-value smaller than 0.05 indicates strong evidence against my null hypothesis.

After performing the permutation test with the ks as my test statistic, I got a p-value of 0.279.

Since the p-value is greater than the 0.05 significance level, we fail to reject the null. There is not enough evidence to suggest that the shape of the distribution of ‘PC.REALGSP.CHANGE’ when ‘CUSTOMERS.AFFECTED’ is missing is not the same as the shape of the distribution of ‘PC.REALGSP.CHANGE’ when ‘CUSTOMERS.AFFECTED’ is not missing. Which suggests the missingness of ‘CUSTOMERS.AFFECTED’ may not be dependent on ‘PC.REALGSP.CHANGE’.

Overall Conclusion For Missingness Dependency Investigation:

In my missingness dependency investigation, I concluded that there is strong enough evidence to suggest that the missingness of the ‘CUSTOMERS.AFFECTED’ column may be dependent on the ‘YEAR’ but the same can not be said for the ‘PC.REALGSP.CHANGE’. Since my investigation shows that there is not enough evidence to suggest that the missingness of ‘CUSTOMERS.AFFECTED’ may not be dependent on ‘PC.REALGSP.CHANGE’.  

Hypothesis Testing

Is there significant difference between the distribution of outages of the seasons in the SPP NERC Region compared to the Overall distribution of outages in the NERC Regions?

To answer my main question I decided to compare the distributions of the outages in each season of the SPP Region and the Overall NERC Region distribution of outages in each season. To create these distributions, I first created a new column called ‘Season’, which stated the season in which the outage occurred. Once I did that I then created a table that displays the distributions, the table below displays a couple of the NERC Regions.  

NERC.REGION Fall Spring Summer Winter
ECAR 0.235294 0.264706 0.411765 0.0882353
FRCC 0.325581 0.186047 0.325581 0.162791
FRCC, SERC 0 0 0 1
HECO 0.666667 0 0.333333 0
HI 1 0 0 0

 

The plot below displays the distributions of the outages in each season in the SPP Region and the Overall NERC Region distribution of outages in each season. I noticed that there does seem to be a difference between the distributions but to test whether the differences are significant or not I decided to perform a hypothesis test. Like my previous tests, I chose a significance level of 0.05 as a cut-off for my p-value, since a p-value smaller than 0.05 indicates strong evidence against my null hypothesis.  

 

Null Hypothesis: There is not a difference between the outages distributions of the seasons in the SPP Region and the Overall outages distributions of the seasons of the NERC Regions

Alternative Hypothesis: There is a difference between the outages distributions of the seasons in the SPP Region and the Overall outages distributions of the seasons of the NERC Regions

For my hypothesis test, I will use total variation distance (TVD) as my test statistic, since I am comparing the distributions of outages in each season which are categorical distributions. My observed TVD was 0.123.

After performing my hypothesis, I plotted my results in the plot below. The plot displays the empirical distributions of the generated TVDs under the null. The green line in the plot represents my observed TVD. Additionally, my calculated p-value was 0. Which can be seen in the graph since there are essentially no TVDs generated in the hypothesis test under the null that were equal to or greater than my observed TVD.

In conclusion, the probability that the observed TVD came from the distrubtion of TVDs under that the assumption that null is true is essentially 0. There is strong enough evidence to suggest that there is a difference between the outages distributions of the seasons in the SPP Region and the Overall outages distributions of the seasons of the NERC Regions