Creates the design matrix of cosine similarities from textual observations and a vector of words.
cs.matrix( x, words, word_embeddings, method = "max", parallel = FALSE, n.cluster = NULL, sparse = FALSE )
x | a tibble containing 2 columns; line and word. The 'line' column contains the observation number that the word from the 'word' column appears in. See 'Examples'. |
---|---|
words | a character vector of words that will represent the columns of the resulting matrix. |
word_embeddings | named list of word embeddings. See |
method | function to apply across each column. Options include |
parallel | logical, indicating if the matrix should be calculated in parallel. |
n.cluster | integer, the number of clusters to use if |
sparse | logical, indicating if a sparse matrix should be returned. |
a (sparse) matrix of cosine similarities
A function to create a design matrix of cosine similarities from textual observations and a vector of words. The resulting matrix will be of dimension \(unique(x$line) \times length(words)\).
Consider 2 words with word embedding representations \(a\) and \(b\). Then the cosine similarity is defined as $$sim_cos(a,b)=\frac{a \cdot b}{|| a ||_2 \cdot || b ||_2}$$.
If method = "max"
, for a given line with \(m\) words, each row of the returned matrix is defined as \(max_{i=1,...,m}(sim_cos(a_j, b_i))\).
method = "sum"
or method = "mean"
are defined
in a similar fashion.
Goldberg, Y. (2017) Neural Network Methods for Natural Language Processing. San Rafael, CA: Morgan & Claypool Publishers.
if (FALSE) { require(dplyr) require(tidytext) word_embeddings <- formatWordEmbeddings(embedding_matrix_example, normalize = TRUE) sentences <- data.frame("Description" = c("Statistics is great!", "My dog is fluffy.", "What is your favorite class?"), stringsAsFactors = FALSE) x <- tibble(line = 1:nrow(sentences), text = sentences$Description) %>% unnest_tokens(word, text) cs.matrix(x, words = c("stats", "cat"), word_embeddings) }