Utilize word embedding matrices for Statistical analysis • TAGAM

The goal of TAGAM is to streamline the use of word embeddings in text analysis. The package is designed to take raw textual descriptions and transform the information for use in your analysis. The package also includes functions for running a three stage generalized additive model framework.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("scottmanski/TAGAM")

Text Conversion

To convert the words to vectors, we must download and format the word embeddings. Here, we choose to use the 300 dimensional GloVe word embeddings. The TAGAM::formatWordEmbeddings function takes the raw word embeddings matrix and converts them into a named list, where each entry is a numeric vector of length 300. For example, we would use word_embeddings["the"] to extract the numeric vector for the word the.

library(TAGAM)
temp <- tempfile()
download.file("http://nlp.stanford.edu/data/wordvecs/glove.6B.zip", temp)

embedding_matrix <- read.table(unz(temp, "glove.6B.300d.txt"), quote = "",
                               comment.char = "", stringsAsFactors = FALSE)
word_embeddings <- TAGAM::formatWordEmbeddings(embedding_matrix, normalize = TRUE)

Cosine Similarity

To measure the similarity between two words, we use the cosine similarity between the vector representations of the words. That is, for two words with vector representations a and b, respectively, the cosine similarity is defined as,
$$\textrm{sim}_{\textrm{cos}} ({\bf a}, {\bf b}) = \frac{{\bf a}\cdot {\bf b}}{\lVert {\bf a}\lVert_2\cdot\lVert{\bf b}\lVert_2}.$$

word_embeddings <- TAGAM::formatWordEmbeddings(embedding_matrix_example, normalize = TRUE, verbose = TRUE)
#> Normalizing word embeddings.
#> Formatting word embeddings.

a <- "statistic"
b <- "statistics"
TAGAM::cs(a, b, word_embeddings)
#>            statistic
#> statistics 0.6047858

For a given phrase, let D = (b₁, ..., b_S) where each b_i, i ∈ {1, ..., S} is a word in the phrase. Then define the cosine similarity between a word a and a phrase D as
$$\textrm{sim}_{\textrm{cos}} ({\bf a}, D) = \underset{s=1,...,S}{\max}\left\{ \textrm{sim}_{\textrm{cos}} ({\bf a}, {\bf b}_s)\right\}.$$

In this way, we construct a matrix of cosine similarities X_{n × p_n} where n is the number of observations, p_n is the number of unique words used. Let W be the vector of unique words with length p_n, and let D be the list of descriptions with length n, where each element in the list is a vector D_i containing the words used in description i. Then
X_ij = sim_cos(W_j, D_i) for i ∈ {1, ..., n}, j ∈ {1, ..., p_n}.

Each value in the matrix is now continuous and restricted to [ − 1, 1].

library(tidyverse)
library(tidytext)
D <- c("Statistics is great!",
       "My dog is fluffy",
       "What is your favorite class?")
x <- tibble(line = 1:length(D), text = D) %>%
  unnest_tokens(word, text)

w <- c("statistics", "dog", "fluffy", "favorite", "class")

TAGAM::cs.matrix(x, words = w, word_embeddings)
#> 3 x 5 Matrix of class "dgeMatrix"
#>      statistics       dog     fluffy  favorite     class
#> [1,]  1.0000000 0.2670640 0.01818783 0.3341204 0.2700382
#> [2,]  0.1796945 1.0000000 1.00000000 0.4486072 0.2700382
#> [3,]  0.2170957 0.3788118 0.13183853 1.0000000 1.0000000

TAGAM

Installation

Text Conversion

Cosine Similarity

Links

License

Developers