Symbolic sequence prediction with machine learning

Machine learning with symbols

Given a sequence of symbols, ask you to predict the following symbols, what will you do with machine learning? An intuitive way is to transform the symbols to numerical labels, decide the appropriate windows size for features input (lag), and then define a classification problem. slearn build a pipeline for this process, and provide user-friendly API.

First import the package:

from slearn import symbolicML

We can predict any symbolic sequence by choosing the classifiers available in scikit-learn. Currently slearn supports:

Classifiers	Parameter call
Multi-layer Perceptron	‘MLPClassifier’
K-Nearest Neighbors	‘KNeighborsClassifier’
Gaussian Naive Bayes	‘GaussianNB’
Decision Tree	‘DecisionTreeClassifier’
Support Vector Classification	‘SVC’
Radial-basis Function Kernel	‘RBF’
Logistic Regression	‘LogisticRegression’
Quadratic Discriminant Analysis	‘QuadraticDiscriminantAnalysis’
AdaBoost classifier	‘AdaBoostClassifier’
Random Forest	‘RandomForestClassifier’
LightGBM	‘LGBM’

Now we predict a simple synthetic symbolic sequence

string = 'aaaabbbccd'

First, we define the classifier, and specify the ws (windows size or lag) and classifier_name following the above table, initialize with

sbml = symbolicML(classifier_name="MLPClassifier", ws=3, random_seed=0, verbose=0)

Then we can use the method encode to split the features and target for training models. The we use method forecast to apply forecasting:

pred = sbml.forecast(x, y, step=5, hidden_layer_sizes=(10,10), learning_rate_init=0.1)

The parameters of x, y, and step are fixed, the rest of parameters are depend on what classifier you specify, the parameter settings can be referred to scikit-learn library. For nerual network, you can define the parameters of hidden_layer_sizes and learning_rate_init, while for support vector machine you might define C.

Generating symbols

slearn library also contains functions for the generation of strings of tunable complexity using the LZW compressing method as base to approximate Kolmogorov complexity.

from slearn import *
df_strings = LZWStringLibrary(symbols=3, complexity=[3, 9])
df_strings

Also, you can deploy RNN test on the symbols you generate:

df_iters = pd.DataFrame()
for i, string in enumerate(df_strings['string']):
    kwargs = df_strings.iloc[i,:-1].to_dict()
    seed_string = df_strings.iloc[i,-1]
    df_iter = RNN_Iteration(seed_string, iterations=2, architecture='LSTM', **kwargs)
    df_iter.loc[:, kwargs.keys()] = kwargs.values()
    df_iters = df_iters.append(df_iter)
df_iter.reset_index(drop=True, inplace=True)
df_iters.reset_index(drop=True, inplace=True)
print(df_iters)