Major Outage Cause Category Classifier

By Diego Silva (d1silva@ucsd.edu). My exploratory data analysis on this dataset can be found here.

Introduction

For this project, I will be using a data set containing major outages reported by different states in the United States from January 2000-July 2016. This data set contains 1534 rows and 55 columns. My goal for this project is to be able to predict the category cause that resulted in a major outage. Since I am trying to predict the category of the cause of the major outage, the type of problem I have is a multiclass classification problem. As a result, I will be predicting the category cause of future major outages using a Decision Tree Classifier (DTC). My response variable for my multiclass classification problem is the Category Cause. I decided to predict the Category Cause of major outages because I wanted to see if it is possible based on a given set of conditions to predict the cause of a major outage that occurred. The type of data I will have access will be of the following geographical data, time data, weather data, and demographics. Specifically, the data I will be using are the Month, U.S State, NERC Region, Climate Region, Hurricane Name, Population number, the Percentage of the Urban Population of the total population in the U.S State. Since I am trying to predict the category cause for a major outage that already occurred, the data previously would already be recorded and available for me to use. I will be accessing the quality of my model using the accuracy metric of the DTC to see how well my model predicts the category cause of a major outage. I chose accuracy as my evaluation metric because I want to see how many predictions my model got correct overall, I am not worried about false positives or negatives of my model and for the scope of my problem precision or recall do not have much importance since there are no negative outcomes of any false positives or negatives my model might make. For the scope of my problem, overall accuracy is more important than knowing the false predictions.

Data Cleaning

In my data cleaning process, I first looked at the raw data set to assess what steps needed to be done. I looked for any unnecessary rows and columns, checked the column names, and anything else that looked out of the ordinary for a data set. In my specific case, I noticed there were a couple of columns and rows that were unnecessary and the column names were incorrect. So I dropped said columns and rows, set the columns to their proper respective names, and reset the index of the data frame. Lastly, I ensured the data types of the columns were the best possible type which allowed me to properly analyze the data frame. Lastly, I combined columns ‘OUTAGE.START.DATE’ and ‘OUTAGE.START.TIME’ into one column called ‘OUTAGE.START’. And columns ‘OUTAGE.RESTORATION.DATE’ and ‘OUTAGE.RESTORATION.TIME’ into one column called ‘OUTAGE.RESTORATION’.

Outages DataFrame

YEAR	MONTH	U.S._STATE	POSTAL.CODE	NERC.REGION	CLIMATE.REGION	ANOMALY.LEVEL	CLIMATE.CATEGORY	OUTAGE.START.DATE	OUTAGE.START.TIME	OUTAGE.RESTORATION.DATE	OUTAGE.RESTORATION.TIME	CAUSE.CATEGORY	CAUSE.CATEGORY.DETAIL	HURRICANE.NAMES	OUTAGE.DURATION	DEMAND.LOSS.MW	CUSTOMERS.AFFECTED	RES.PRICE	COM.PRICE	IND.PRICE	TOTAL.PRICE	RES.SALES	COM.SALES	IND.SALES	TOTAL.SALES	RES.PERCEN	COM.PERCEN	IND.PERCEN	RES.CUSTOMERS	COM.CUSTOMERS	IND.CUSTOMERS	TOTAL.CUSTOMERS	RES.CUST.PCT	COM.CUST.PCT	IND.CUST.PCT	PC.REALGSP.STATE	PC.REALGSP.USA	PC.REALGSP.REL	PC.REALGSP.CHANGE	UTIL.REALGSP	TOTAL.REALGSP	UTIL.CONTRI	PI.UTIL.OFUSA	POPULATION	POPPCT_URBAN	POPPCT_UC	POPDEN_URBAN	POPDEN_UC	POPDEN_RURAL	AREAPCT_URBAN	AREAPCT_UC	PCT_LAND	PCT_WATER_TOT	PCT_WATER_INLAND
2011	7	Minnesota	MN	MRO	East North Central	-0.3	normal	2011-07-01 00:00:00	17:00:00	2011-07-03 00:00:00	20:00:00	severe weather	nan	nan	3060	nan	70000	11.6	9.18	6.81	9.28	2.33292e+06	2.11477e+06	2.11329e+06	6.56252e+06	35.5491	32.225	32.2024	2308736	276286	10673	2595696	88.9448	10.644	0.411181	51268	47586	1.07738	1.6	4802	274182	1.75139	2.2	5348119	73.27	15.28	2279	1700.5	18.2	2.14	0.6	91.5927	8.40733	5.47874
2014	5	Minnesota	MN	MRO	East North Central	-0.1	normal	2014-05-11 00:00:00	18:38:00	2014-05-11 00:00:00	18:39:00	intentional attack	vandalism	nan	1	nan	nan	12.12	9.71	6.49	9.28	1.58699e+06	1.80776e+06	1.88793e+06	5.28423e+06	30.0325	34.2104	35.7276	2345860	284978	9898	2640737	88.8335	10.7916	0.37482	53499	49091	1.08979	1.9	5226	291955	1.79	2.2	5457125	73.27	15.28	2279	1700.5	18.2	2.14	0.6	91.5927	8.40733	5.47874
2010	10	Minnesota	MN	MRO	East North Central	-1.5	cold	2010-10-26 00:00:00	20:00:00	2010-10-28 00:00:00	22:00:00	severe weather	heavy wind	nan	3000	nan	70000	10.87	8.19	6.07	8.15	1.46729e+06	1.80168e+06	1.9513e+06	5.22212e+06	28.0977	34.501	37.366	2300291	276463	10150	2586905	88.9206	10.687	0.392361	50447	47287	1.06683	2.7	4571	267895	1.70627	2.1	5310903	73.27	15.28	2279	1700.5	18.2	2.14	0.6	91.5927	8.40733	5.47874
2012	6	Minnesota	MN	MRO	East North Central	-0.1	normal	2012-06-19 00:00:00	04:30:00	2012-06-20 00:00:00	23:00:00	severe weather	thunderstorm	nan	2550	nan	68200	11.79	9.25	6.71	9.19	1.85152e+06	1.94117e+06	1.99303e+06	5.78706e+06	31.9941	33.5433	34.4393	2317336	278466	11010	2606813	88.8954	10.6822	0.422355	51598	48156	1.07148	0.6	5364	277627	1.93209	2.2	5380443	73.27	15.28	2279	1700.5	18.2	2.14	0.6	91.5927	8.40733	5.47874
2015	7	Minnesota	MN	MRO	East North Central	1.2	warm	2015-07-18 00:00:00	02:00:00	2015-07-19 00:00:00	07:00:00	severe weather	nan	nan	1740	250	250000	13.07	10.16	7.74	10.43	2.02888e+06	2.16161e+06	1.77794e+06	5.97034e+06	33.9826	36.2059	29.7795	2374674	289044	9812	2673531	88.8216	10.8113	0.367005	54431	49844	1.09203	1.7	4873	292023	1.6687	2.2	5489594	73.27	15.28	2279	1700.5	18.2	2.14	0.6	91.5927	8.40733	5.47874

Baseline Model

Features

Since I will be building a Decision Tree Classifier, I will pick 2 features that I believe will be good variables to predict the category cause of future major outages. The features I want to use to train my model are ‘MONTH’ and ‘NERC.REGION’. I chose these features because after performing a quick exploratory data analysis, I noticed a significant amount of the cause category of major outages was due to severe weather. So I chose two variables that I believe reflect my findings. Since these features give insight into the time of the year and specific locations that the outages occurred in would provide useful information to decide whether outages occurred due to severe weather or not. In my features, I have two nominal categorical features.

MONTH	NERC.REGION
7	MRO
5	MRO
10	MRO
6	MRO
7	MRO
11	MRO
7	MRO
6	MRO
3	MRO
6	MRO

Feature Engineering & Preprocessing

My categorical features need to be transformed so that they can be used in my DTC model. Since all of my categorical features are nominal I will one hot encode them to transform them from a categorical feature into a numerical feature so that my model can use them.

Baseline Pipeline

I will be splitting my data into two sections training and testing data. Where 0.75% of my data will be dedicated to training and 0.25% will be dedicated to validation. My model will be trained on purely the training data but I will access the accuracy of the model using both training and testing data to compare the two results. To build my model I will be using a pipeline to preprocess the categorical features into numerical ones to train a decision tree classifier.

Summary of Results

My current baseline model has an accuracy of 0.605 on training data and a score of 0.536 on testing data. This tells me my model does not predict the outcome of training data that well and on new never seen data my model performs even worse. This tells me the model is slightly overfitted with the training data and needs more features since it is only accurate 53.6% percent of the time when tested on new data. So I would not consider my baseline model as “good.”

Final model

New Features

As previously mentioned to improve my baseline model I needed more features that give more insight into what the category cause of the major outage was. I did some more exploratory data analysis and found that the second and third-highest counts of category causes of major outages were intentional attacks and system operability disruption. These category causes are more general than severe weather so I looked for more features that could reflect that. The new features I chose were, ‘CLIMATE.REGION’, ‘POPULATION’, ‘HURRICANE.NAMES’, ‘AREAPCT_URBAN’, and ‘PCT_WATER_TOT’. For the features ‘U.S._STATE’, ‘POPULATION’, and ‘AREAPCT_URBAN’ since these features reflect the demographics of the people affected by the major outages I believe it would better explain the high number of major outages due to intentional attacks and system operability disruption. As for the features ‘CLIMATE.REGION’, ‘HURRICANE.NAMES’, and ‘PCT_WATER_TOT’ since severe weather is still the most common reason why major outages occurred I believe this would provide my model better information on the weather and land conditions so that it predicts if major outages occurred due to severe weather or not more accurately.

Preprocessing Line Additions

Custom Function Transformer that binarizes the 'HURRICANE.NAMES' column.
One hot encodes 'U.S._STATE' and 'CLIMATE.REGION' columns.
Transforms the 'POPULATION' column into quantiles.
Leaves 'AREAPCT_URBAN', and 'PCT_WATER_TOT' columns as is.

Decision Tree Classifier Hyperparameter Fine Tuner

To further improve my model I decided I want to fine-tune the hyperparameters of the Decision Tree Classifier. Those being the max depth and the minimum sample split. The reason why I chose to tune the max depth is that I want to make my tree more expressive since there are multiple cause categories I want my tree to be a bit more complex. Secondly, the reason why I chose to fine-tune the minimum sample split of the tree is that it is not guaranteed that my tree will be built symmetrically. As a result, I want to optimize generalization performance by increasing the number of minimum sample splits. This allows some branches to grow deeper than others producing more tree splits to better classify the samples.

Summary of Final Model and Results

Final Model Breakdown

In conclusion, the final model I chose was a Decision tree Classifier. The features I chose for my model were ‘MONTH’, ‘U.S._STATE’, ‘NERC.REGION’, ‘CLIMATE.REGION’, ‘POPULATION’, ‘HURRICANE.NAMES’, ‘AREAPCT_URBAN’, and ‘PCT_WATER_TOT’. Five of those features were nominal categorical features, while three of them were numerical features. I one hot-encoded four of the nominal categorical features to turn them into numerical ones. I binarized ‘HURRICANE.NAMES’ using a custom function transformer I made to binarize its values. As for the numerical features, I turned the ‘POPULATION’ feature into quantiles and left ‘AREAPCT_URBAN’, and ‘PCT_WATER_TOT’ as is. As for the hyperparameters of the Decision Tree Classifier, I ended up choosing a max depth of 10 and a minimum sample split of 5. The way I did this was by performing a grid search, the grid search took my unfit pipeline and a dictionary of ranges of values for the hyperparameters I wanted to fine-tune. Then it performed a k-fold cross-validation to find the combination of hyperparameters with the best average validation performance.

Results Breakdown

As for the results, I saw significant improvement in both the training data and testing data accuracy. The accuracy of the training data was 0.804 and for the testing data, it was 0.589. This tells me that the new features I included and engineered alongside the hyperparameters I fine-tuned were able to better optimize generalization performance. My model became more generalized allowing for better predictions of the category cause of major outages on both the training and unseen data. My last model was too specific as it only had access to two features that relate mostly to the severe weather causes but not the other causes. But my final model was able to become more generalized as it had more features to make better predictions since they gave more insight into both the weather conditions and also demographics of the population affected by the outage.

Fairness Analysis

To access the fairness of my final model, I want to see whether my model is fair when predicting the cause category of major outages between low and high-population states. I will continue to use accuracy as my evaluation metric to conduct my fairness analysis. Since there is no exact definition of low and high-population states I define my definitions here. I first created a new column with the quantiles of the population for each row. I then defined Low-population states as states with a Population quantile of three or lower, and High Population states as states with Population Quantiles greater than 3. I will be using the absolute difference in accuracy as my test statistic. Additionally, I will choose a significance level of 0.05 as a cut-off for my p-value since a p-value smaller than 0.05 indicates strong evidence against my null hypothesis. Lastly, to conduct my fairness analysis I will use a permutation test to test my hypotheses.

Hypotheses:

Null Hypothesis: The classifier’s accuracy is the same for both low population states and high population states, and any differences are due to chance.

Alternative Hypothesis: There is a difference in accuracy for low population states and high population states.

Observed Absolute Difference in Accuracy: 0.0411

Set Up

To begin my fairness analysis, I first need to create a new column that contains the Population Quantiles of the state that the major outages occurred in. Once I had that, I then turned the column into a Boolean column, on the condition if a row had a value of 3 or lower it will be cast as True else False. Now that I have my two groups, high and low population states I was able to begin my analysis.

Summary of Results

The plot below shows the results of my permutation test. It displays the empirical distribution of the generated absolute differences in accuracy under the null. The red line shows the observed value. The p-value I calculated was 0.46.

Since the p-value is greater than the significance level, 0.46 > 0.05, we fail to reject the null hypothesis. There is not enough evidence to suggest that there may be a difference in accuracy between low and high-population states.