Feature Engineering

You want your features to induce four things: information, linear independence, resilience, and speed.

Example: you want to figure out who someone is.

Obviously, your features should have enough information that the model can accurately answer a question. A good feature is a name.

You don’t want redundant features. The first character of the name is highly linearly dependent with the name. Not perfectly, as you could get noise on the line, a misclick, or something similar, but very, very dependent. The first character of the name would be less useful if the name is another feature.

You don’t want a system so light that noise in any one feature throws the model off. So you don’t only want the name as a feature because people change their names, use nicknames, and so forth. If all you’ve got is the name, and Thomas is going by Tom, you’ve got a problem. A problem that the first letter of the name as a name might solve.

Computers don’t know the connection between Thomas and Tom, much less Elizabeth and Liz. So you’ve got to make judgement calls. But a balance of the three aspects such that the programmer maximizes the norm is a good direction to start.

Of course, you also want your model to work on finite hardware, so speed maximization is a factor. Maximize norm(I, LI, R)*speed.

DSP

I spent pretty much all summer cleaning up a signal. While the deep learning models have been training, I also significantly improved the inputs. The features are cleaner with less noise. It’s going to take a few weeks to get the input to the model.

I’m considering, and will probably go to, a tensorflow model running on Google’s Coral board. The issue right now is I don’t know how to make an Intel Max 10 board talk to a Coral board. The problem with Matlab is I can’t get that running on a deployable board, so it just isn’t a meaningful option.

Sequence Training for Neural Networks in Matlab

I had 77 pairs of sequences and sequence responses in Matlab. I had two cell arrays, sequences and responses, of dimension 77×1. Each cell held a 10×10,000 array. I created options, layers, and hyperparamaters, and executed

[net, info] = trainNetwork(sequences, responses, layers, options);

The network trained. Things worked.

I got more data, hundreds of pairs. I could still train the network, but I was rapidily coming to the memory limit on my HPC. I wanted to use datastores.

S(cripts) 1
save(‘sequences.mat’,’sequences’);
save(‘responses.mat’,’responses’);

S2
AData = fileDatastore(‘sequences.mat’,’Readfcn’,’@load’);
BData = fileDatastore(‘responses.mat’,’Readfcn’,’@load’);
CData = combine(AData, BData);

… %stuff
[net, info] = trainNetwork(CData, layers, options);

Error using trainNetwork (line 184)
Invalid training data. Predictors must be a N-by-1 cell array of sequences, where N is the number of sequences. All sequences
must have the same feature dimension and at least one time step.

Error in S1 (line ##)
[net, info] = trainNetwork(CData, layers, options);

Or in English, it did not work.

Using preview, I got this:

ans =

1×2 cell array

{1×1 struct} {1×1 struct}

First, the load function creatues a struct. I needed a de-struct-ing function.

Now, I got this.

preview(CData)

ans =

1×2 cell array

{259×1 cell} {259×1 cell}

Second, the combine function creats another cell array, meaning I had a 1×2 cell array (CData), each cell holding a 200×1 cell array (AData and BData), but those were holding lesser datastores.

The solution to the latter was saving each cell as an individual file with one vairable per file, that variable being the 10×10,000 array, NOT A CELL.

S3
%file manip, mkdir, addpath, etc.

for n=1:length(sequences)
sequence1 = sequences{n,1};
response1 = responses{n,1};
save(strcat(‘sequences’,string(n),’,mat’),’sequence1′);
save(strcat(‘responses’,string(n),’,mat’),’sequence1′);
end

AND THEN running S4

%file manip, preprocessing, etc.

getVarFromStruct = @(strct,varName) strct.(varName);
xds = fileDatastore(“sequences*.mat”,”ReadFcn”,@(fname) getVarFromStruct(load(fname),”sequence1″),”FileExtensions”,”.mat”);
yds = fileDatastore(“responses*.mat”,”ReadFcn”,@(fname) getVarFromStruct(load(fname),”response1″),”FileExtensions”,”.mat”);

%options, layers, hp, etc.

[net, info] = trainNetwork(CData, layers, options);

And it worked.

In the preview of CData up there, it creates 2 cell arrays OF CELLS. TrainNetwork doesn’t want cells; it wants data. So the extra layer of cells caused all those errors.

That took me weeks, and someone else had to explain it.