We assume that you’ve made a local copy of http://cs.jhu.edu/~jason/465/hw-prob/ (for example, by
downloading and unpacking the zipfile there) and that you’re currently
in the code/ subdirectory.
If you’re working on the ugrad network, activate a
“conda environment” that has the Python packages you’ll need:
conda activate nlp-class
If this worked, then your prompt should be prefixed by the environment name, like this:
(nlp-class) arya@ugradx:~/hw-prob/code$
This means that various third-party packages are now available for you to “import” in your Python scripts. You are also, for sure, using the same versions of everything as the autograder is. To return to the default environment, do
conda deactivate nlp-class
Alternatively, you can set this up on your own machine by installing Miniconda, a tool for installing and managing Python environments. Miniconda and its big sibling Anaconda (which has swallowed more Python packages) are all the rage in NLP and deep learning. Install Miniconda following your platform-specific instructions from here. Then create the environment with
conda env create -f nlp-class.yml
after which you can activate the environment as explained above.
HW2 and some subsequent assignments rely on a vector math library called PyTorch, which has become popular in the deep learning community. Because you may not have seen or used it before, this note serves as a whirlwind intro. This intro assumes familiarity with Python. It won’t teach you about neural networks, which this library enables you to build more easily.
You can start by reading PyTorch’s own tutorial. For HW2, you can skip the section “Computation Graphs and Automatic Differentiation”, although you’ll need it for HW3.
PyTorch Tensor objects are more like the fixed-length
arrays from C++ or Java than the dynamically sized
ArrayList and list classes in Java and Python.
It’s slow to append elements to a tensor (e.g. using
torch.concat), because this is a non-destructive operation
that allocates a new, larger tensor and copies all the old elements
over. So, if you’re reading in an embedding matrix from a file, either
read it into a dynamic structure and then convert to a tensor, or else
pre-allocate a tensor of the final size and fill it row-by-row.
Another thing to keep in mind (once you’ve looked at the tutorial) is the relationship between vectorized operations (which are fast for several reasons) and their looping counterparts (which are slooooow). PyTorch knows how to apply functions to structures with different shapes.
import torch
a = torch.tensor([0.1, 0.2, 0.3, 0.4])
b = torch.tensor([11.2, 33.4, 55.6, 77.8])
# Addition: The looping version
c = torch.empty_like(a) # same size as a
for j in range(len(c)):
c[j] = a[j] + b[j]
# Addition: The vectorized version
d = a + b
# Sanity check
assert (c == d).all() # All entries match.
Another example, using a custom function:
e = torch.tensor([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]])
def complicated_function(a):
"""f(a) = 1 / (1 + e^(-a))"""
return 1 / (1 + a.neg().exp())
# Complicated function: The looping version
f = torch.empty_like(e)
for j in range(len(e)):
for k in range(len(e[0])): # Length of the next dimension
f[j, k] = complicated_function(e[j, k])
# Complicated function: The vectorized version
g = complicated_function(e)
# Sanity check
assert (f == g).all()
(You should probably use more descriptive variable names in real code.)
Useful debugging tools:
my_tensor.shape # Is it the size I thought?
my_tensor.dtype # Did I accidentally store ints?
type(my_variable) # Generic; works on all types.
# Provides less info on Tensor objects, though ...
breakpoint() # Sets a breakpoint
log.debug(f"{my_var=}")
log.debug(f"{some_expression(involving_some+values)=}")
# Easiest way to construct a message; it’ll even
# include the variable or expression
You’ll complete the findsim.py starter code.
The findsim.py starter code imports PyTorch, as well as
our Integerizer class (which is implemented and documented
in integerize.py).