The goal of TAGAM is to streamline the use of word embeddings in text analysis. The package is designed to take raw textual descriptions and transform the information for use in your analysis. The package also includes functions for running a three stage generalized additive model framework.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("scottmanski/TAGAM")

Text Conversion

To convert the words to vectors, we must download and format the word embeddings. Here, we choose to use the 300 dimensional GloVe word embeddings. The TAGAM::formatWordEmbeddings function takes the raw word embeddings matrix and converts them into a named list, where each entry is a numeric vector of length 300. For example, we would use word_embeddings["the"] to extract the numeric vector for the word the.

library(TAGAM)
temp <- tempfile()
download.file("http://nlp.stanford.edu/data/wordvecs/glove.6B.zip", temp)

embedding_matrix <- read.table(unz(temp, "glove.6B.300d.txt"), quote = "",
                               comment.char = "", stringsAsFactors = FALSE)
word_embeddings <- TAGAM::formatWordEmbeddings(embedding_matrix, normalize = TRUE)

Cosine Similarity

To measure the similarity between two words, we use the cosine similarity between the vector representations of the words. That is, for two words with vector representations a and b, respectively, the cosine similarity is defined as,
$$\textrm{sim}_{\textrm{cos}} ({\bf a}, {\bf b}) = \frac{{\bf a}\cdot {\bf b}}{\lVert {\bf a}\lVert_2\cdot\lVert{\bf b}\lVert_2}.$$

word_embeddings <- TAGAM::formatWordEmbeddings(embedding_matrix_example, normalize = TRUE, verbose = TRUE)
#> Normalizing word embeddings.
#> Formatting word embeddings.

a <- "statistic"
b <- "statistics"
TAGAM::cs(a, b, word_embeddings)
#>            statistic
#> statistics 0.6047858

For a given phrase, let D = (b1, ..., bS) where each bi, i ∈ {1, ..., S} is a word in the phrase. Then define the cosine similarity between a word a and a phrase D as
$$\textrm{sim}_{\textrm{cos}} ({\bf a}, D) = \underset{s=1,...,S}{\max}\left\{ \textrm{sim}_{\textrm{cos}} ({\bf a}, {\bf b}_s)\right\}.$$

In this way, we construct a matrix of cosine similarities Xn × pn where n is the number of observations, pn is the number of unique words used. Let W be the vector of unique words with length pn, and let D be the list of descriptions with length n, where each element in the list is a vector Di containing the words used in description i. Then
Xij = simcos(Wj, Di)    for    i ∈ {1, ..., n}, j ∈ {1, ..., pn}.

Each value in the matrix is now continuous and restricted to [ − 1, 1].

library(tidyverse)
library(tidytext)
D <- c("Statistics is great!",
       "My dog is fluffy",
       "What is your favorite class?")
x <- tibble(line = 1:length(D), text = D) %>%
  unnest_tokens(word, text)

w <- c("statistics", "dog", "fluffy", "favorite", "class")

TAGAM::cs.matrix(x, words = w, word_embeddings)
#> 3 x 5 Matrix of class "dgeMatrix"
#>      statistics       dog     fluffy  favorite     class
#> [1,]  1.0000000 0.2670640 0.01818783 0.3341204 0.2700382
#> [2,]  0.1796945 1.0000000 1.00000000 0.4486072 0.2700382
#> [3,]  0.2170957 0.3788118 0.13183853 1.0000000 1.0000000