Pandas is an open-source Python package that provides users with high-performing and flexible data structures. These structures are designed to make analyzing relational or labeled data both easy and intuitive. Pandas is one of the most popular and quintessential tools leveraged by data scientists when developing a machine learning model. The most crucial step in the machine learning process is not simply fitting a model to a given data set. Most of the model development process takes place in the pre-processing and data exploration phase. An accurate model requires good predictors and, in order to acquire them, the user must understand the raw data. Through Pandas’ numerous data wrangling and analysis tools, this important step can easily be achieved. The goal of this blog is to highlight some of the central and most commonly used tools in Pandas while illustrating their significance in model development. The data set used for this demo consists of a supermarket chain’s sales across multiple stores in a variety of cities. The sales data is broken down by items within the stores. The goal is to predict a certain item’s sale.
Reading the Data
When starting a new Python script, modules required for the analysis, including Pandas, must be imported into the environment:
The Pandas package is then used to load the data into the environment in a specific Pandas data frame structure. Pandas can read from and write to a wide array of format types such as CSV, JSON, and Parquet. For this example, the data is contained in a CSV file that is saved in the current working directory folder for the Python environment:
Once the data frame is created, there are a variety of viewing and inspecting tools available in order to achieve a better understanding of the raw data. The df.head(n) and df.tail(n) functions allow users to examine the first and last rows respectively:
The df.shape and df.info provide information about the number of rows and columns in a data frame, the data types, and missing data:
Basic statistical values, such as the minimum and maximum of a column, can be calculated by wrapping the appropriate function around the df.column_name in parentheses:
Data Cleaning and Feature Engineering
After getting a general overview and understanding of the data, the next step toward successful model development is cleaning the data and creating new, possibly more influential variables from the existing raw data.
The variable Item_Identifier follows a labeling pattern of letters per each product (i.e., ‘FD’ for food, ‘DR’ for drinks and ‘NC’ for n) followed by a three-digit code. It may be more useful to have a group type variable with just these two letters rather than the entire code. To achieve this, the .map() function applies a selection of only the first two values in the item identifier and returns as a new column labeled Item_Group_Type.
In order to analyze the outlet establishment year as a numerical variable, a new column entitled Outlet_Age can be calculated by subtracting the outlet’s year by the the max year value of the dataset plus one (assuming this data’s collection ended the prior year). The Outlet_Establishment_Year variable can then be dropped using the Pandas df.drop() function.
Since most machine learning models in Python will return errors if null values exist within the data, identifying their existence and rectifying the issue is a crucial step. The code below counts the number of null values for each column in the store sales data frame:
There are a variety of methods to address null values within a data set. Some common approaches include simply removing the rows containing the null values, forward or backward filling of values for timeseries data, or replacing the null value with a calculated value. This data set contains a significant number of null values. Removing the entire subject could lead to a lack of complete data, therefore filling in the missing values is a more appropriate method.
Since the Item_Weight variable is numeric, replacing a null value with the item’s average weight is a logical approach. The first line of the code uses Pandas df.groupby() function combined with the .agg function to find the mean weight of each unique item and store the results in another Pandas data frame. In lines two and three, setting the index of the original data frame to match the index of the new Item_Identifier_Mean data frame allows the null values to be easily imputed with their matching mean values. It is then necessary to check to see if this method resolved all the null values. Line 5 of the code reveals that four rows still contain null values.
To investigate the cause of this persisting issue, the new mean imputed data frame is merged on Item_Idenitifier with the original data frame using Pandas df.merge(). The merge reveals that the items with null values in the new data frame only appeared once in the original data and had no weight information, therefore a mean could not be calculated.
This same method is then applied to the Outlet_Size variable, except the mode of the outlet type is used as the imputed value since Outlet_Size is categorical. To calculate the mode, the outlet types are converted into a Pandas structure called a series, then the mode is applied to each of these series using the .apply function. From there, a similar merge as before can take place. However, since the merge is on a column name instead of the index, both Outlet_Size columns are retained in the new dataset with “_x” and “_y” appended to the names. Only one of these columns is needed, therefore renaming to remove the “_x” and dropping the Outlet_Size_y column is conducted in line 4.
Examining the new cleaned data shows that Outlet_Type no longer contains null values.
Returning to the null Item_Weight values, since there were four items with only one record each, these values can simply be dropped. Checking the final cleaned data set reveals that all null values have been corrected:
Dividing the features into categorical and numerical sets allows for further examination of any other possible incongruities in the data. Using the df.select_dtypes() function, two data frames are created containing only the specified data types:
Through the use of a for loop and the .unique() function, the number of unique values for each categorical feature and the label can be displayed. As is fairly common with categorical variables recorded from different sources, the labeling technique of Item_Fat_Content is inconsistent. “Low Fat” is represented as both “low fat” and “LF” while “Regular” is also recorded as “reg”. This type of discrepancy will cause issues when creating dummy variables.
To fix the fat content labeling issue, the .replace() function applies a new uniform naming schema to the Item_Fat_Content variable:
With the data cleaning and feature engineering completed, an even closer examination of the data and its relations can be conducted using Pandas in conjunction with the Matplotlib library. The histogram below shows that the target variable, Item_Outlet_Sales, is right skewed:
Applying the .corr() function to the numerical features data frame created earlier, provides a symmetrical data frame of the variables’ correlations to each other. Plotting a heatmap of this correlation data frame highlights that Item_MRP has the strongest, positive correlation with the target variable while Item_Visibility has a negative correlation.
Using the .pairplot() function allows for visualization of relationships among the all the numeric variables at once:
The categorical variables’ relationships with the target variable can also be examined through the use of df.pivot_table(). This function operates like pivot tables in Excel by creating an index and applying an aggregation function over a specified value. Plotting these pivot tables then allows for easy trend visualization. From the Impact of Item Type chart, it appears that there are too many unique groups for a significant difference in sales to exist. The Impact of Outlet Size chart reveals that medium-sized stores have, on average, more sales.
This data analysis and exploration will aid in guiding feature, model, and parameter selection during the model development phase.
Finally, the creation of dummy variables is required in order to use the categorical variables in the modeling process. Pandas has an easy to use function, pd.get_dummies(), that converts each of the specified columns into binary variables based on their unique values. For instance, the Outlet_Size variable is now decomposed into three separate variables: Outlet_Size_High, Outlet_Size_Medium, Outlet_Size_Small.
With pre-processing, cleaning, and data exploration complete, the final phase of modeling can now take place. Sklearn is a commonly used machine learning library in Python that contains multiple modeling and evaluation tools. The first step is to enable the train_test_split() function of this package to divide the cleaned data frame into two separate data frames. The larger data frame, which will represent 85% of the entire data, will be used to train the model, while the remaining 15% will be used to evaluate and determine whether the model is appropriate.
Next, the train and test data frames are each divided into two separate data frames, one containing the desired predictors and the other containing the target variable:
Since the target variable is continuous, a simple, yet standard approach is to test a linear regression model. Once imported from the sklearn package, the function is applied to the train data using the model.fit() function. The predictions are then stored in an array using model.predict(). Model evaluation is conducted by using a variety of the metric functions from sklearn, along with plotting the actual vs. predicted values. From these results, it appears linear regression may not be the best model for this data.
The next step is to try a different model to see if it produces better results. XGBoost is an implementation of gradient boosted decision trees. It is a commonly applied model that is designed for speed and high performance. As in the linear regression modeling, once the XGBRegressor tool is imported, it is applied to the train data and predictions are stored to evaluate the model. The metrics and plot both reveal higher performance than the linear regression model:
This model then needs to be evaluated against the test data to determine if it is in fact a good model. The test data reveals that there is still some relevance to the model but further parameter tuning, and possibly other model selections may lead to better results.
Overall, as evident by this demo, the actual fitting and tuning of a model is a small step compared to the entire machine learning process. In order to even obtain a data set for modeling, pre-preprocessing, data cleaning, feature engineering and exploration are all required. These stages are simplified and easily executable using the Pandas package. Pandas enables data scientists to not only understand and analyze the data themselves but also display the results of their analysis and modeling to their audience, making this library imperative for successful machine learning in Python.