Creates the design matrix of cosine similarities from textual observations and a vector of words.

cs.matrix(
  x,
  words,
  word_embeddings,
  method = "max",
  parallel = FALSE,
  n.cluster = NULL,
  sparse = FALSE
)

Arguments

x

a tibble containing 2 columns; line and word. The 'line' column contains the observation number that the word from the 'word' column appears in. See 'Examples'.

words

a character vector of words that will represent the columns of the resulting matrix.

word_embeddings

named list of word embeddings. See formatWordEmbeddings.

method

function to apply across each column. Options include c("max", "sum", "mean").

parallel

logical, indicating if the matrix should be calculated in parallel.

n.cluster

integer, the number of clusters to use if parallel=TRUE.

sparse

logical, indicating if a sparse matrix should be returned.

Value

a (sparse) matrix of cosine similarities

Details

A function to create a design matrix of cosine similarities from textual observations and a vector of words. The resulting matrix will be of dimension \(unique(x$line) \times length(words)\).

Consider 2 words with word embedding representations \(a\) and \(b\). Then the cosine similarity is defined as $$sim_cos(a,b)=\frac{a \cdot b}{|| a ||_2 \cdot || b ||_2}$$.

If method = "max", for a given line with \(m\) words, each row of the returned matrix is defined as \(max_{i=1,...,m}(sim_cos(a_j, b_i))\). method = "sum" or method = "mean" are defined in a similar fashion.

References

Goldberg, Y. (2017) Neural Network Methods for Natural Language Processing. San Rafael, CA: Morgan & Claypool Publishers.

See also

Examples

if (FALSE) { require(dplyr) require(tidytext) word_embeddings <- formatWordEmbeddings(embedding_matrix_example, normalize = TRUE) sentences <- data.frame("Description" = c("Statistics is great!", "My dog is fluffy.", "What is your favorite class?"), stringsAsFactors = FALSE) x <- tibble(line = 1:nrow(sentences), text = sentences$Description) %>% unnest_tokens(word, text) cs.matrix(x, words = c("stats", "cat"), word_embeddings) }