Feature Extraction and scikit-learn

Feature Extraction and scikit-learn

I highly recommend using conda for managing environments.

conda create -n myenv python=3
source activate myenv
pip install scikit-learn

Official scikit-learn tutorials

Consistent interface:

clf = DecisionTreeClassifier()
clf.fit(trainX, trainY)
output = clf.predict(testX)

You can easily swap out models! (just need to import)

clf = DecisionTreeClassifier()
clf = MultinomialNB()
clf = RandomForestClassifier()
clf = MLPClassifier(solver='adam', hidden_layer_sizes=(10, 10))

See all the models scikit-learn supports here.

The input $X$ is a 2d array, $y$ is also an array

X = [[1,2,3,4],
     [2,4,1,6],
     [3,2,0,1]]
y = [0,1,0]

Each $x \in X$ is a feature vector. Your job in homework 1 is to do feature extraction. These models can't understand raw text, so you need to convert them into something they can understand, i.e. a vector of numbers.

Feature extraction for NLP tasks

Homework 1 focuses on feature extraction. You can use structural features (e.g. line length) or pattern matching with regex. Here are some other techniques for extracting features (will help with Homework 2):

Bag of words: convert each sentence to a vector the size of your vocabulary (unique words). The values in the vector are the term frequencies (counts) for that word.

"the cat ate the rat" becomes [..., 1, ..., 1, ..., 2, ..., 1, ...]

Lowercase everything: can cut down on your vocabulary size, but may conflate words

"The cat ate the rat" $=$ "the cat ate the rat"

"Apple" $=$ "apple"

Remove stopwords, common words that don't contribute much to the sentence. You can get stopwords from NLTK or find a list online.

import nltk
stopwords = nltk.stopwords.words('english')

"the cat ate the rat" $=$ "a cat ate a rat"

n-grams: moving window of words/characters

def ngrams(seq, n):
     return [seq[i:i + n] for i in range(len(seq) - n + 1)]

print(ngram(['how', 'are', 'you', 'today', '?'], 2))  # [['how', 'are'], ['are', 'you'], ['you', 'today'], ['today', '?']]
print(ngrams('information', 3))  # ['inf', 'nfo', 'for', 'orm', 'rma', 'mat', 'ati', 'tio', 'ion']

For more, see lecture on term vocabulary

Saving your model

If your model takes a while to train (probably not for hw1), you might want to save the trained model to a file and load it during test time.

import joblib

# save model
joblib.dump(clf, 'model.joblib')

# load model
clf = joblib.load('model.joblib')