Feature Extraction and scikit-learn
I highly recommend using conda for managing environments.
conda create -n myenv python=3
source activate myenv
pip install scikit-learn
Official scikit-learn tutorials
Consistent interface:
clf = DecisionTreeClassifier()
clf.fit(trainX, trainY)
output = clf.predict(testX)
You can easily swap out models! (just need to import)
clf = DecisionTreeClassifier()
clf = MultinomialNB()
clf = RandomForestClassifier()
clf = MLPClassifier(solver='adam', hidden_layer_sizes=(10, 10))
See all the models scikit-learn supports here.
The input $X$ is a 2d array, $y$ is also an array
X = [[1,2,3,4],
[2,4,1,6],
[3,2,0,1]]
y = [0,1,0]
Each $x \in X$ is a feature vector. Your job in homework 1 is to do feature extraction. These models can't understand raw text, so you need to convert them into something they can understand, i.e. a vector of numbers.
Feature extraction for NLP tasks
Homework 1 focuses on feature extraction. You can use structural features (e.g. line length) or pattern matching with regex. Here are some other techniques for extracting features (will help with Homework 2):
Bag of words: convert each sentence to a vector the size of your vocabulary (unique words). The values in the vector are the term frequencies (counts) for that word.
"the cat ate the rat" becomes [..., 1, ..., 1, ..., 2, ..., 1, ...]
Lowercase everything: can cut down on your vocabulary size, but may conflate words
"The cat ate the rat" $=$ "the cat ate the rat"
"Apple" $=$ "apple"
Remove stopwords, common words that don't contribute much to the sentence. You can get stopwords from NLTK or find a list online.
import nltk
stopwords = nltk.stopwords.words('english')
"the cat ate the rat" $=$ "a cat ate a rat"
n-grams: moving window of words/characters
def ngrams(seq, n):
return [seq[i:i + n] for i in range(len(seq) - n + 1)]
print(ngram(['how', 'are', 'you', 'today', '?'], 2)) # [['how', 'are'], ['are', 'you'], ['you', 'today'], ['today', '?']]
print(ngrams('information', 3)) # ['inf', 'nfo', 'for', 'orm', 'rma', 'mat', 'ati', 'tio', 'ion']
For more, see lecture on term vocabulary
Saving your model
If your model takes a while to train (probably not for hw1), you might want to save the trained model to a file and load it during test time.
import joblib
# save model
joblib.dump(clf, 'model.joblib')
# load model
clf = joblib.load('model.joblib')