Comparing PyTorch to R's lm() For Prediction

Scope

In a previous blog post, I talked about using the lm() function in R to model the in-store sales of a small retail clothing store. This is the story of me doing more or less the same thing using Python, and in particular the Pytorch neural network library. A word of caution: this is not really comparing R's lm() and PyTorch, this is actually comparing my ability to use R's lm() and my ability to use PyTorch. Also, it is just for one system (the retail store), and even if it were a fair comparison of the two ways of predicting future results for this store, you might get different results with other kinds of systems.

The code is all available on github here, in case you want to take a look, along with the data as a csv file.

What The Data Looks Like

It's pretty much the date, some freeform notes, and the total sales for the day, excluding websales, and also scaled so that "1" equals "the highest value in the history of the store thus far".

salesdate	notes	total_sales
2010-01-02		0.19223003042097855
2010-01-03		0.0278870039985653
2010-01-04		0.06632171845317959

We transform the date to give separate columns for day of the week, day of the month, month of the year, a number of dummy columns for holidays and special events, and so forth. There are also some columns for things like the fact that the store moved twice, and how we handle it when they sell tickets for concerts at a local venue.

I should probably also add here that this is all so much easier because of pandas, which is a splendid library for using Python in data analysis. For the most part I'm not going to talk too much about the data cleanup, because I went over all that in the previous post, and I'm pretty much doing the same thing except using pandas instead of R. Just to give one example though, here's how I used pandas to create the day-of-the-week columns:


    dayOfWeek={0:'1Monday', 1:'2Tuesday', 2:'3Wednesday', 3:'4Thursday', 4:'5Friday', 5:'6Saturday', 6:'7Sunday'}
    df['weekday']              = df['salesdatetime'].dt.dayofweek.map(dayOfWeek)
    df['weekday']              = df['weekday'].astype('category')
    df,cat_cols_to_drop        = convert_categorical(df,'weekday',cat_cols_to_drop)

The function "convert_categorical" which I call in that last line looked like this:


def convert_categorical(df,col_name,cat_cols_to_drop):
    values = df[col_name].unique()
    print(col_name,df.groupby(col_name).salesdate.nunique())
    one_hot = pd.get_dummies(df[col_name])
    for v in values:
        df[col_name+':'+v] = one_hot[v]
    cat_cols_to_drop.append(col_name)
    return df,cat_cols_to_drop

A "one_hot" is a term for a group of columns with 1's and 0's to represent a single factor, as for example 7 columns to represent the days of the week. The "cat_cols_to_drop" was a list of columns that I wanted to drop from the dataframe once I had created all of the new columns (e.g. for holidays). For comparison, here's how I handled the weekday categories in R:


so_data$weekday              <- weekdays(as.Date(so_data$salesdate))
so_data$weekday              <- factor(so_data$weekday,
                                   levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"),
                                   labels=c("1Mon","2Tue","3Wed","4Thu","5Fri","6Sat","7Sun"))

It seems slightly shorter in R, but in practice the "convert_categorical" code in python gets used by every categorical factor (month, holidays, events etc.) so the total amount of code is not that different.

Train, Test, and Test For Real

In the first script, after transforming the input data in various ways, I save it as two separate dataframes, using a cutoff date. This allows me to simulate training a neural network using the pre-cutoff data, and then testing it on the post-cutoff data. But, within the second script that actually creates and trains the network, I split the data again (this time randomly) into "test" and "train" subsets. The reason why has to do with "overfitting", which has been described as the original sin of machine learning.

Imagine that you knew about four days: 0.1 on Monday the 3rd of January, 0.5 on Saturday the 8th of January, 0.2 on Monday the 10th of January, and 0.6 on Saturday the 15th of January. A valid model would be something like 0.15 if it's a Monday, 0.55 if it's a Saturday. On the other hand, an invalid, overfitted model would be something like 0.1 if it's a Monday, 0.1 if it's the 10th of the month, 0.5 if it's a Saturday, and 0.1 if it's the 15th of the month. However, if we were to calculate an R-squared, the second (incorrect) model would do better. We have not just fitted the model to the data, we have "overfitted".

There are various ways to fix this, but one simple way is to hold back a fraction of the data, called the "test" data, and only modify the model coefficients using the other part, called the "train" data. The test data is only used to determine when you have stopped improving, so that you can stop the training at that point. Thus hopefully, you will fit the model using the training data, but not overfit.

Pytorch seems to be the most convenient python-based neural network library out there, at least that I have seen thus far. Having written neural network code myself back in the late 90's (in QBASIC, and I shudder to think what that code looked like), I have checked in on that field every few years for the last two decades or so. Pytorch was the first time when I thought, "hey, using this looks easier than just writing it myself".

The heart of the pytorch-related code is here:


    N, D_in, H, D_out = len(x_train), len(x_train[0]), 25, 1
    model = torch.nn.Sequential(
        torch.nn.Linear(D_in, H),
        torch.nn.ReLU(),
        torch.nn.Linear(H, D_out),
    )

For those who care, this is a three-layer network, so it has one "hidden" (i.e. neither input nor output) layer. The input and output layers are linear, but the hidden layer is "ReLU", or "rectified linear unit". This means essentially that a node in the hidden layer computes the weighted sum of the inputs, and returns the greater of 0 and that weighted sum. So, for values greater than 0 it is a linear unit, but it is rectified to never return a negative result. There are a lot of other options here, and there are a bajillion papers written about how this should be done, but ReLU is commonly used in PyTorch tutorials and it seems to work. More traditional neural networks used a sigmoid function, either from -1 to +1 or 0 to +1, but the ReLU seems to work fine.

This is one of about fifty million parameters which you could decide to tweak, from the learning rate to the loss function to the number of hidden layers to the presence of momentum factors, etc. etc. ad infinitum. My experience is that if you are getting too deep into this, you're doing it wrong. The most important factor in how your model does is how good your input data is, and whether it's coded/transformed in a way that makes it a tractable problem for the machine learning algorithm to train on. If you get substantially different results because you used a different value of one of these parameters for the network, I would worry that your results will be brittle, and not generalize to new data or especially new systems. If it is a choice between time spent optimizing across network parameters, or time spent coding and transforming your data so that it represents the problem space better (e.g. "day of the week seems important to retail, maybe I should make that a separate input column"), spend your time on the latter, it is far more likely to pay off in better real-world results.

The actual training happens here:


    loss_fn                       = torch.nn.MSELoss(size_average=False)
    best_so_far                   = None
    learning_rate                 = 1e-4
    for t in range(inputs['max_epochs']):
        y_train_pred              = model(x_train)
        loss                      = loss_fn(y_train_pred, y_train)
        #...we'll talk about the lines that go here a little bit later
        model.zero_grad()
        loss.backward()
        for param in model.parameters():
            param.data -= learning_rate * param.grad.data

An "epoch" means a pass through all of our training data. The "loss function" means how we calculate "was that better or worse?". Here we use the "mean squared error". For each pass through the training data, we calculate the loss function, and "back propagate", which is to say the bigger our error the more we tweak the model parameters, especially for the neurons which were firing a lot for that set of inputs. There are lots of math details there, but if you have halfway plausible choices it should work, and pytorch docs and tutorials help guide you to reasonable choices.

I skipped some code in the middle of that for loop above, the one we do for each "epoch". It is included here for separate attention and discussion.


        if t != 0 and (t & (t - 1) == 0): #i.e. t is a power of 2, check if we should stop
            y_test_pred           = model(x_test)
            loss_test             = loss_fn(y_test_pred,y_test)
            print('epoch',t,'loss on train data',loss.data[0],' and on test data ',loss_test.data[0])
            y_test_pred_values    = y_test_pred.data.numpy()[:,0]
            y_test_values         = y_test.data.numpy()[:,0]
            if not best_so_far or loss_test.data[0] < best_so_far:
                best_so_far = loss_test.data[0]
            elif t < 100: #i.e. we have just gotten started here
                pass
            else: #i.e. it has not improved for the last 50% of t, meaning we should stop
                break

The purpose of this code is to check if we have either gotten stuck, or are now no longer actually improving, but just overfitting. We calculate the "loss" (i.e. error) using the loss_fn, but we do it using our test data rather than our train data. This means that, if we have begun overfitting, we are not overfitting to this data, because it is not included in our training set. Then, we check if this is the best result so far. If it is, we are still making progress, and should keep going (up to some maximum which I normally set to 10,000 epoch because that seems to be enough). If it is not, but we have not even done 100 epochs yet, then let's not be so hasty as to stop yet. For the intermediate case, between 100 and 10,000 epochs, we will only keep going if we have improved since the last time we checked.

So, how often do we check? Here we use a nifty trick based on the fact that binary math is, you know, based on powers of two:

if t != 0 and (t & (t - 1) == 0): #i.e. t is a power of 2

This is not terribly clear code, hence the comment. Basically you are taking a bitwise "AND" of two binary numbers, one of which is 1 less than the other. This will only be 0 in the case where it is a power of 2, for example 256 = "100000000" and 255 = "011111111". If we have not improved since the last time we checked, and we are checking every (power of 2) epochs, that means that for the last 50% of the epochs we have made no real progress. For example, on epoch 1024 we are no better than we were at epoch 512. In this case, it's pretty clear we are stuck, or overfitting, so we stop.

I should note that it is entirely possible for neural networks to get stuck. The initial weights are assigned small random numbers, to avoid certain weird situations that can happen when you use all the same number. Usually, you will end up with something roughly equivalent if you are running it multiple times against the same data. It is not impossible, however, to get caught in a "local minimum", where any change to the weights results in a worse result. There are a variety of tricks to make this less likely, some of which are called 'annealing' out of a metaphorical comparison to what happens with atoms in a cooling metal. It appears from my experience that the pytorch folks have done a good job at all of this, because I don't notice us getting stuck in local minima very often, but it can happen, because it is still a stochastic process. It's worth running this whole process a few times, just to see if you get substantially different results.

So, when I create and train my neural network, I get these results:


epoch 1 loss on train data 37.865196228027344  and on test data  8.005111694335938
epoch 2 loss on train data 29.715255737304688  and on test data  6.018046855926514
epoch 4 loss on train data 26.748794555664062  and on test data  5.298993110656738
epoch 8 loss on train data 24.492658615112305  and on test data  4.786937713623047
epoch 16 loss on train data 21.734256744384766  and on test data  4.188896656036377
epoch 32 loss on train data 18.938934326171875  and on test data  3.6367855072021484
epoch 64 loss on train data 16.49313735961914  and on test data  3.2282917499542236
epoch 128 loss on train data 14.67071533203125  and on test data  2.9947779178619385
epoch 256 loss on train data 13.651124000549316  and on test data  2.914315700531006
epoch 512 loss on train data 13.050300598144531  and on test data  2.8654191493988037
epoch 1024 loss on train data 12.370577812194824  and on test data  2.8132970333099365
epoch 2048 loss on train data 11.320158958435059  and on test data  2.773655652999878
epoch 4096 loss on train data 10.35883903503418  and on test data  2.715595245361328
epoch 8192 loss on train data 9.743453979492188  and on test data  2.752704620361328
r_value 0.528460251656 p_value 6.84713809868e-60 std_err 0.0455443863372

But, if I do it again on precisely the same data, I get:


epoch 1 loss on train data 27.38541030883789  and on test data  5.716619968414307
epoch 2 loss on train data 25.527198791503906  and on test data  5.264286994934082
epoch 4 loss on train data 24.2283935546875  and on test data  4.976038932800293
epoch 8 loss on train data 22.383073806762695  and on test data  4.5771260261535645
epoch 16 loss on train data 20.08486557006836  and on test data  4.097776412963867
epoch 32 loss on train data 17.666545867919922  and on test data  3.6203649044036865
epoch 64 loss on train data 15.615555763244629  and on test data  3.217402935028076
epoch 128 loss on train data 14.272446632385254  and on test data  2.9643828868865967
epoch 256 loss on train data 13.463448524475098  and on test data  2.84104061126709
epoch 512 loss on train data 12.743260383605957  and on test data  2.773592233657837
epoch 1024 loss on train data 11.818140983581543  and on test data  2.732182741165161
epoch 2048 loss on train data 10.91415786743164  and on test data  2.730459451675415
epoch 4096 loss on train data 10.29919719696045  and on test data  2.7701609134674072
r_value 0.519299414092 p_value 1.51873460912e-57 std_err 0.0480106654575

The results are similar, but not identical. In this case, it even resulted in the network deciding to pack it in after 4096 epochs, whereas the first time it continued until 8192 (for quite meager gains, btw). In both cases, the R-squared between the network's predictions and the actual store sales was around 0.52, on the test sample. But remember, this test sample was just randomly selected from the same population that we took the training sample from. In the real world, we would take our training from a given period of time, and then hope to be able to make predictions about the system's behavior at a later point in time. Predictions about the future, as Yogi Berra might have said.

In the end, as with R, the only really reliable test of whether or not your system can predict future results, is using it to try to predict future results. There is really no substitute, because even if you have done everything perfectly the system you're trying to predict may simply have changed, and a model (whether linear or neural network or any other kind) which is based on past data may not be able to make good predictions. But, assuming that the system in question is more or less the same as before, pandas + pytorch seems to be a pretty painless way to turn your past history into a trained model.

So, Which Is Better?

Points in favor of R's lm()

While the model coefficients can sometimes be easy to misinterpret (see previous post for more discussion of this), it is possible to learn something from looking at them. The model weights in a neural network, on the other hand, are for all practical purposes opaque. If you are looking for insight from your model, and not just from its predictions, then lm() is better than pytorch.
If you have reason to believe that the system in question changes slowly over time, lm() has an easy way to weight the data towards more recent dates (using the "weights" argument to lm). Pytorch probably has some way to do it, but if it is as easy as this I have not yet found it.
Categorical inputs (aka 'factors') are certainly possible to use in pytorch, and not especially difficult, but it does seem like in R it is a bit less code and fewer steps.

Points in favor of pytorch

Since lm() is not stochastic, and the model weights are calculated more or less all at once, it is not as straightforward to use a split between test and train data to keep your model from overfitting. To put it another way, a test/train split can help you recognize that your model is overfitted, but not help you to stop before that happens, because it happens all at once. Pytorch can be trained up until the test/train split indicates you have gotten all the real signal there is to get, and stop before you fit to any noise.
There are fundamental mathematical limits to how well you can simulate a system with R's linear model. A neural network, with at least one hidden layer, can in theory learn just about anything (and if you add more layers it becomes more than just theoretically anything). If you have a system that you suspect has some highly nonlinear features, it is more likely that a multilayer neural network can learn it, than a linear model such as lm(). In practice, though, most problems which are not image recognition, can be recognized well enough with either one.

Having done the same problem (predicting a small clothing retailer's in-store sales) with both R's lm() and pytorch, which do I prefer? Well, first of all one should admit that both get the job done. If you are comfortable with one and not with the other, then use the one you are comfortable with. But better yet, is to become comfortable with both.