Cosine Similarity Matrix

Creates the design matrix of cosine similarities from textual observations and a vector of words.

cs.matrix(
  x,
  words,
  word_embeddings,
  method = "max",
  parallel = FALSE,
  n.cluster = NULL,
  sparse = FALSE
)

Arguments

x	a tibble containing 2 columns; line and word. The 'line' column contains the observation number that the word from the 'word' column appears in. See 'Examples'.
words	a character vector of words that will represent the columns of the resulting matrix.
word_embeddings	named list of word embeddings. See `formatWordEmbeddings`.
method	function to apply across each column. Options include `c("max", "sum", "mean")`.
parallel	logical, indicating if the matrix should be calculated in parallel.
n.cluster	integer, the number of clusters to use if `parallel=TRUE`.
sparse	logical, indicating if a sparse matrix should be returned.

Value

a (sparse) matrix of cosine similarities

Details

A function to create a design matrix of cosine similarities from textual observations and a vector of words. The resulting matrix will be of dimension $unique(x$line) \times length(words)$.

Consider 2 words with word embedding representations $a$ and $b$. Then the cosine similarity is defined as $$sim_cos(a,b)=\frac{a \cdot b}{|| a ||_2 \cdot || b ||_2}$$.

If method = "max", for a given line with $m$ words, each row of the returned matrix is defined as $max_{i=1,...,m}(sim_cos(a_j, b_i))$. method = "sum" or method = "mean" are defined in a similar fashion.

References

Goldberg, Y. (2017) Neural Network Methods for Natural Language Processing. San Rafael, CA: Morgan & Claypool Publishers.

Examples

if (FALSE) {
require(dplyr)
require(tidytext)

word_embeddings <- formatWordEmbeddings(embedding_matrix_example, normalize = TRUE)


sentences <- data.frame("Description" = c("Statistics is great!",
                                          "My dog is fluffy.",
                                          "What is your favorite class?"),
                        stringsAsFactors = FALSE)
x <- tibble(line = 1:nrow(sentences), text = sentences$Description) %>%
  unnest_tokens(word, text)

cs.matrix(x, words = c("stats", "cat"), word_embeddings)
}

Arguments

Value

Details

References

See also

Examples