source package¶
Submodules¶
source.cnn_embeddings module¶
source.emb_entropy module¶
Aanalytical vs estimated value is illustrated for normal random variables.
-
source.emb_entropy.
multivariate_normal
(mean, cov, size=None, check_valid='warn', tol=1e-8)¶ Draw random samples from a multivariate normal distribution.
The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal distribution to higher dimensions. Such a distribution is specified by its mean and covariance matrix. These parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,” squared) of the one-dimensional normal distribution.
- mean1-D array_like, of length N
Mean of the N-dimensional distribution.
- cov2-D array_like, of shape (N, N)
Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling.
- sizeint or tuple of ints, optional
Given a shape of, for example,
(m,n,k)
,m*n*k
samples are generated, and packed in an m-by-n-by-k arrangement. Because each sample is N-dimensional, the output shape is(m,n,k,N)
. If no shape is specified, a single (N-D) sample is returned.- check_valid{ ‘warn’, ‘raise’, ‘ignore’ }, optional
Behavior when the covariance matrix is not positive semidefinite.
- tolfloat, optional
Tolerance when checking the singular values in covariance matrix. cov is cast to double before the check.
- outndarray
The drawn samples, of shape size, if that was provided. If not, the shape is
(N,)
.In other words, each entry
out[i,j,...,:]
is an N-dimensional value drawn from the distribution.
The mean is a coordinate in N-dimensional space, which represents the location where samples are most likely to be generated. This is analogous to the peak of the bell curve for the one-dimensional or univariate normal distribution.
Covariance indicates the level to which two variables vary together. From the multivariate normal distribution, we draw N-dimensional samples, \(X = [x_1, x_2, ... x_N]\). The covariance matrix element \(C_{ij}\) is the covariance of \(x_i\) and \(x_j\). The element \(C_{ii}\) is the variance of \(x_i\) (i.e. its “spread”).
Instead of specifying the full covariance matrix, popular approximations include:
Spherical covariance (cov is a multiple of the identity matrix)
Diagonal covariance (cov has non-negative elements, and only on the diagonal)
This geometrical property can be seen in two dimensions by plotting generated data-points:
>>> mean = [0, 0] >>> cov = [[1, 0], [0, 100]] # diagonal covariance
Diagonal covariance means that points are oriented along x or y-axis:
>>> import matplotlib.pyplot as plt >>> x, y = np.random.multivariate_normal(mean, cov, 5000).T >>> plt.plot(x, y, 'x') >>> plt.axis('equal') >>> plt.show()
Note that the covariance matrix must be positive semidefinite (a.k.a. nonnegative-definite). Otherwise, the behavior of this method is undefined and backwards compatibility is not guaranteed.
- 1
Papoulis, A., “Probability, Random Variables, and Stochastic Processes,” 3rd ed., New York: McGraw-Hill, 1991.
- 2
Duda, R. O., Hart, P. E., and Stork, D. G., “Pattern Classification,” 2nd ed., New York: Wiley, 2001.
>>> mean = (1, 2) >>> cov = [[1, 0], [0, 1]] >>> x = np.random.multivariate_normal(mean, cov, (3, 3)) >>> x.shape (3, 3, 2)
The following is probably true, given that 0.6 is roughly twice the standard deviation:
>>> list((x[0,0,:] - mean) < 0.6) [True, True] # random
-
source.emb_entropy.
rand
(d0, d1, ..., dn)¶ Random values in a given shape.
Note
This is a convenience function for users porting code from Matlab, and wraps numpy.random.random_sample. That function takes a tuple to specify the size of the output, which is consistent with other NumPy functions like numpy.zeros and numpy.ones.
Create an array of the given shape and populate it with random samples from a uniform distribution over
[0, 1)
.- d0, d1, …, dnint, optional
The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.
- outndarray, shape
(d0, d1, ..., dn)
Random values.
random
>>> np.random.rand(3,2) array([[ 0.14022471, 0.96360618], #random [ 0.37601032, 0.25528411], #random [ 0.49313049, 0.94909878]]) #random
source.emb_information module¶
Module for (Shannon) mutual information estimators.
-
source.emb_information.
benchmark
(dim=10, cost_name='MIShannon_DKL', num_of_samples=-1, max_num_of_samples=10000)[source]¶ - Plot estimated vs analytical Mutual Information for random matrices.
- param dim
Data dimension (number of columns of the matrices)
- param cost_name
MI estimation algorithm, e.g, ‘BIHSIC_IChol’, ‘MIShannon_DKL’, ‘MIShannon_HS’ (for more see ite.cost)
- param num_of_samples
if -1 increases the number of data points by 1000 until max_num_of_samples, if >-1 it prints running time for this number of data points (matrix row num)
- param max_num_of_samples
maximum data point number in case of plotting for a series of sample nums.
-
source.emb_information.
estimate_embeddings_mi
(datadir: str, vecs_names=[], mm_embs_of=[], cost_name='MIShannon_DKL', pca_n_components=None)[source]¶ - Return estimated Mutual Information for a Embeddings with vecs_names in datadir.
- param datadir
Path to directory which contains embedding data.
- param vecs_names
List[str] Names of embeddings
- param mm_embs_of
List of str tuples, where the tuples contain names of embeddings which are to be concatenated into a multi-modal mid-fusion embedding.
- param cost_name
MI estimation algorithm, e.g, ‘BIHSIC_IChol’, ‘MIShannon_DKL’, ‘MIShannon_HS’ (for more see ite.cost)
-
source.emb_information.
multivariate_normal
(mean, cov, size=None, check_valid='warn', tol=1e-8)¶ Draw random samples from a multivariate normal distribution.
The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal distribution to higher dimensions. Such a distribution is specified by its mean and covariance matrix. These parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,” squared) of the one-dimensional normal distribution.
- mean1-D array_like, of length N
Mean of the N-dimensional distribution.
- cov2-D array_like, of shape (N, N)
Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling.
- sizeint or tuple of ints, optional
Given a shape of, for example,
(m,n,k)
,m*n*k
samples are generated, and packed in an m-by-n-by-k arrangement. Because each sample is N-dimensional, the output shape is(m,n,k,N)
. If no shape is specified, a single (N-D) sample is returned.- check_valid{ ‘warn’, ‘raise’, ‘ignore’ }, optional
Behavior when the covariance matrix is not positive semidefinite.
- tolfloat, optional
Tolerance when checking the singular values in covariance matrix. cov is cast to double before the check.
- outndarray
The drawn samples, of shape size, if that was provided. If not, the shape is
(N,)
.In other words, each entry
out[i,j,...,:]
is an N-dimensional value drawn from the distribution.
The mean is a coordinate in N-dimensional space, which represents the location where samples are most likely to be generated. This is analogous to the peak of the bell curve for the one-dimensional or univariate normal distribution.
Covariance indicates the level to which two variables vary together. From the multivariate normal distribution, we draw N-dimensional samples, \(X = [x_1, x_2, ... x_N]\). The covariance matrix element \(C_{ij}\) is the covariance of \(x_i\) and \(x_j\). The element \(C_{ii}\) is the variance of \(x_i\) (i.e. its “spread”).
Instead of specifying the full covariance matrix, popular approximations include:
Spherical covariance (cov is a multiple of the identity matrix)
Diagonal covariance (cov has non-negative elements, and only on the diagonal)
This geometrical property can be seen in two dimensions by plotting generated data-points:
>>> mean = [0, 0] >>> cov = [[1, 0], [0, 100]] # diagonal covariance
Diagonal covariance means that points are oriented along x or y-axis:
>>> import matplotlib.pyplot as plt >>> x, y = np.random.multivariate_normal(mean, cov, 5000).T >>> plt.plot(x, y, 'x') >>> plt.axis('equal') >>> plt.show()
Note that the covariance matrix must be positive semidefinite (a.k.a. nonnegative-definite). Otherwise, the behavior of this method is undefined and backwards compatibility is not guaranteed.
- 1
Papoulis, A., “Probability, Random Variables, and Stochastic Processes,” 3rd ed., New York: McGraw-Hill, 1991.
- 2
Duda, R. O., Hart, P. E., and Stork, D. G., “Pattern Classification,” 2nd ed., New York: Wiley, 2001.
>>> mean = (1, 2) >>> cov = [[1, 0], [0, 1]] >>> x = np.random.multivariate_normal(mean, cov, (3, 3)) >>> x.shape (3, 3, 2)
The following is probably true, given that 0.6 is roughly twice the standard deviation:
>>> list((x[0,0,:] - mean) < 0.6) [True, True] # random
-
source.emb_information.
plot_for_freqranges
(file_path, vis_names=['vecs3lem1', 'google_resnet152'], quantity=-1, legend=True, fname='', suffix='')[source]¶
-
source.emb_information.
plot_for_quantities
(file_path, vis_names=['vecs3lem1', 'google_resnet152'], legend=True, fname='', suffix='')[source]¶
-
source.emb_information.
plots
(file_pattern, vis_names=['vecs3lem1', 'google_resnet152'], fqrng_quantity=-1, legend=True, suffix='')[source]¶
-
source.emb_information.
rand
(d0, d1, ..., dn)¶ Random values in a given shape.
Note
This is a convenience function for users porting code from Matlab, and wraps numpy.random.random_sample. That function takes a tuple to specify the size of the output, which is consistent with other NumPy functions like numpy.zeros and numpy.ones.
Create an array of the given shape and populate it with random samples from a uniform distribution over
[0, 1)
.- d0, d1, …, dnint, optional
The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.
- outndarray, shape
(d0, d1, ..., dn)
Random values.
random
>>> np.random.rand(3,2) array([[ 0.14022471, 0.96360618], #random [ 0.37601032, 0.25528411], #random [ 0.49313049, 0.94909878]]) #random
source.embedding module¶
-
class
source.embedding.
LossLogger
(show=False)[source]¶ Bases:
gensim.models.callbacks.CallbackAny2Vec
Callback to print loss after each epoch.
-
on_batch_end
(model)[source]¶ Method called at the end of each batch.
- model
BaseWordEmbeddingsModel
Current model.
- model
-
on_epoch_begin
(model)[source]¶ Method called at the start of each epoch.
- model
BaseWordEmbeddingsModel
Current model.
- model
-
source.process_embeddings module¶
-
class
source.process_embeddings.
Embeddings
(datadir: str, vecs_names, ling_vecs_names=None)[source]¶ Bases:
object
Data class for storing embeddings.
-
embeddings
= typing.List[numpy.ndarray]¶
-
fasttext_vss
= {'crawl': 'crawl-300d-2M.vec', 'crawl_sub': 'crawl-300d-2M-subword', 'w2v13': '', 'wikinews': 'wiki-news-300d-1M.vec', 'wikinews_sub': 'wiki-news-300d-1M-subword.vec'}¶
-
load_vecs
(vecs_name: str, datadir: str, filter_vocab=[])[source]¶ Load npy vector files and vocab files. If they are not present load try loading gensim model.
-
vecs_labels
= typing.List[str]¶
-
vecs_names
= typing.List[str]¶
-
vocabs
= typing.List[typing.List[str]]¶
-
-
source.process_embeddings.
agg_img_embeddings
(values: dict, maxnum: int = 10) → numpy.ndarray[source]¶ - Aggregate image vectors from a dictionary of to numpy embeddings and vocabulary.
The embedding is a numpy array of shape(vocab size, vector dim) Vocabulary is a text file including words separated by new line.
-
source.process_embeddings.
divide_vocab_by_freqranges
(distribution_file, num_groups=3, save=False)[source]¶
-
source.process_embeddings.
filter_by_vocab
(vecs, vocab, filter_vocab)[source]¶ - Filter numpy array and corresponding vocab, so they contain words and vectors for
words in filter_vocab.
-
source.process_embeddings.
filter_for_freqranges
(datadir, fqvocabs_file, file_patterns=None)[source]¶ Filter embedding files with the given file pattern.
-
source.process_embeddings.
mid_fusion
(embeddings, vocabs, labels, padding: bool, combnum: int = 2) -> (typing.List[numpy.ndarray], typing.List[numpy.ndarray], typing.List[str])[source]¶ - Concatenate embeddings pairwise for words in the intersection or union (with padding) of their vocabulary.
- param embeddings
List[np.ndarray] or List[Tuple[np.ndarray]]
- param vocabs
List[np.ndarray] or List[Tuple[np.ndarray]]
- param labels
List[np.ndarray] or List[Tuple[np.ndarray]]
- param padding
If true, all the vectors are kept from the embeddings’ vocabularies. The vectors parts without a vector from another modality are padded with zeros.
- param combnum
number of modalities concatenated in the final multi-modal vector
-
source.process_embeddings.
serialize2npy
(filepath: str, savedir: str, maxnum: int = 10)[source]¶ - Save embedding files from pickle containing dictionary of {word: np.ndarray}
into embedding.npy, embedding.vocab, for eval. The embedding is a numpy array of shape(vocab size, vector dim) Vocabulary is a text file including words separated by new line. :param filepath: Path to a pickle file containing a dict of
either {word: <image embedding list>} or {word: <image embedding>} (‘descriptors’ suffix in mmfeat file names)
source.process_vg module¶
-
source.process_vg.
description_corpus
(region_descriptions, lemmatise)[source]¶ Return all descriptions as a corpus in form of list of strings (sentences).
-
source.process_vg.
vg_pmis
(words_file, datadir='/Users/anitavero/projects/data/visualgenome', bigram_file='bigram_vg.pkl', variants=['ppmi'])[source]¶ - Save PMI scores for bigrams including words in file word_list.
- param words_file
json file name in data_dir, consisting of an str list
- param datadir
path to directory with data
source.process_wiki module¶
- Module for processing a Wikipedia dump previously extracted using
WikiExtractor (https://github.com/attardi/wikiextractor)
-
source.process_wiki.
contexts_for_quantity
(data_dir, save_dir, num, filename_suffix='', contexts_pattern='', window=5, vocab=[], processes=1)[source]¶ - Loads randomly chosen, given number of context files and concatenates them into one file.
If there are no .contexts files under data_dir/* subdirectories, but one .contexts file exists under data_dir directly, it will just return this file name.
-
source.process_wiki.
create_context_files
(data_dir=None, jsons=None, window=5, vocab=[], processes=1, merge=False, filename_suffix='')[source]¶
-
source.process_wiki.
distribution
(data_dir, format='json', file_suffix='')[source]¶ Count word frequencies from text files or json files, containing list of str lists.
-
source.process_wiki.
get_pmi_for_words
(words_file, data_dir, process=False, bigram_file=None, variants=['pmi'])[source]¶ - Save PMI scores for bigrams including words in file word_list.
- param words_file
json file name in data_dir, consisting of an str list
- param data_dir
path to directory with data
- param process
bool, if True it preprocesses wiki files if False it loads preprocessed jsons.
-
source.process_wiki.
process_files
(data_dir)[source]¶ Sentence tokenize and stop word filter all text files and save the tokenized texts to json files into the ‘tokenized’ directory.
-
source.process_wiki.
w2v_for_quantities
(data_dir, save_dir, w2v_dir, sample_num, trfile_num, size=300, min_count=10, workers=4, negative=15, exp_name='', contexts_pattern='', window=5, vocab=[])[source]¶ - Train several Word2Vecs in parallel for the same data quantity, multiple times on random subsets.
- param data_dir
‘tokenized’ directory with subdirectories of jsons.
- param save_dir
directory where we save the model and log files.
- param sample_num
number of random trainings for the same number of files.
- param trfile_num
number of sampled training files. If num <= 0 we train on the whole corpus.
Rest are Word2Vec training parameters.
-
source.process_wiki.
w2v_for_quantity
(data_dir, save_dir, w2v_dir, num, size=300, min_count=10, workers=4, negative=15, filename_suffix='', contexts_pattern='', window=5, vocab=[])[source]¶ - Train Word2Vec on a random number of tokenized json files.
- param data_dir
‘tokenized’ directory with subdirectories of .context files.
source.run_infogain_analysis module¶
source.run_infogain_experiments module¶
source.task_eval module¶
-
class
source.task_eval.
DataSets
(datadir: str)[source]¶ Bases:
object
Class for storing evaluation datasets and linguistic embeddings.
-
datasets
= {}¶
-
fmri_vocab
= ['airplane', 'ant', 'apartment', 'arch', 'arm', 'barn', 'bear', 'bed', 'bee', 'beetle', 'bell', 'bicycle', 'bottle', 'butterfly', 'car', 'carrot', 'cat', 'celery', 'chair', 'chimney', 'chisel', 'church', 'closet', 'coat', 'corn', 'cow', 'cup', 'desk', 'dog', 'door', 'dress', 'dresser', 'eye', 'fly', 'foot', 'glass', 'hammer', 'hand', 'horse', 'house', 'igloo', 'key', 'knife', 'leg', 'lettuce', 'pants', 'pliers', 'refrigerator', 'saw', 'screwdriver', 'shirt', 'skirt', 'spoon', 'table', 'telephone', 'tomato', 'train', 'truck', 'watch', 'window']¶
-
normalizers
= {}¶
-
-
source.task_eval.
compute_correlations
(scores: (<class 'numpy.ndarray'>, <class 'list'>), name_pairs: List[Tuple[str, str]] = None, common_subset: bool = False, leave_out=False)[source]¶ - Compute correlation between score series.
- param scores
Structured array of scores with embedding/ground_truth names.
- param name_pairs
pairs of scores to correlate. If None, every pair will be computed. if ‘gt’, everything will be plot against the ground_truth.
- param leave_out
Leave out 1/leave_out portion of pairs, chosen randomly. Does not leave out if it is False.
-
source.task_eval.
compute_scores
(actions, embeddings, scores, datasets, pairs, brain_scores=None, pre_score_files: str = None, ling_vecs_names=[], vecs_names=[], mm_lingvis=False, mm_embs_of: List[Tuple[str]] = None, mm_padding=False, common_subset=False)[source]¶ Compute scores on all evaluation datasets.
-
source.task_eval.
divide_eval_vocab_by_freqranges
(distribution_file, eval_data_dir, dataset_name, num_groups=3, save=False)[source]¶
-
source.task_eval.
eval_concreteness
(scores: numpy.ndarray, word_pairs, num=100, gt_divisor=10, vecs_names=None, tablefmt='simple')[source]¶ Eval dataset instances based on WordNet synsets.
-
source.task_eval.
eval_dataset
(dataset: List[Tuple[str, str, float]], dataset_name: str, embeddings: List[numpy.ndarray], vocabs: List[List[str]], labels: List[str]) -> (<class 'numpy.ndarray'>, <class 'list'>)[source]¶
-
source.task_eval.
highlight
(val, conditions: dict, tablefmt)[source]¶ - Highlight value in a table column.
- param val
number, value
- param conditions
dict of {colour: condition}
- param tablefmt
‘simple’ is terminal, ‘latex-raw’ is LaTeX
-
source.task_eval.
main
(datadir, embdir: str = None, vecs_names=[], savepath=None, loadpath=None, actions=['plotcorr'], plot_orders=['ground_truth'], plot_vecs=[], ling_vecs_names=[], pre_score_files: str = None, mm_embs_of: List[Tuple[str]] = None, mm_lingvis=False, mm_padding=False, print_corr_for=None, common_subset=False, tablefmt: str = 'simple', concrete_num=100, pair_score_agg='sum', quantity=-1)[source]¶ - Parameters
actions –
Choose from the following: ‘printcorr’: Print correlation in tables on MEN and SimLex. ‘plotscores’: Plot correlations on MEN and SimLex. ‘concreteness’: Scores on caption_comsub Semantic Similarity dataset splits, ordered by pair_score_agg of
WordNet concreteness scores of the two words in every word pair. Optional: mm_padding
’coverage’: Save coverages on similarity/relatedness/brain data. ‘compscores’: Compute scores on similarity/relatedness evaluation datasets. ‘compbrain’: Compute scores on brain evaluation datasets. ‘brainwords’: Plot Qualitative analysis on words in the brain data. ‘printbraincorr’: Print correlations on brain data. ‘plot_quantity’: Plot similarity/relatedness result for text quantity ranges. ‘plot_freqrange’: Plot similarity/relatedness result for wor frequency ranges.
pair_score_agg – ‘sum’ or ‘diff’ of concreteness scores of word pairs.
mm_lingvis – if True, create multi-modal embeddings, otherwise specific embedding pairs should be given.
tablefmt – printed table format. ‘simple’ - terminal, ‘latex_raw’ - latex table.
concrete_num – Plot of WordNet concreteness splits by concrete_num number of pairs.
datadir – Path to directory which contains evaluation data (and embedding data if embdir is not given)
vecs_names – List[str] Names of embeddings
embdir – Path to directory which contains embedding files.
savepath – Full path to the file to save scores without extension. None if there’s no saving.
loadpath – Full path to the files to load scores and brain results from without extension. If None, they’ll be computed.
plot_orders – Performance plot ordered by similarity scores of these datasets or embeddings.
plot_vecs – List[str] Names of embeddings to plot scores for.
ling_vecs_names – List[str] Names of linguistic embeddings.
pre_score_files – Previously saved score file path without extension, which the new scores will be merged with
mm_embs_of – List of str tuples, where the tuples contain names of embeddings which are to be concatenated into a multi-modal mid-fusion embedding.
mm_padding – Default False. Multi-modal mid-fusion method. If true, all the vectors are kept from the embeddings’ vocabularies. Vector representations without a vector from another modality are padded with zeros.
print_corr_for – ‘gt’ prints correlations scores for ground truth, ‘all’ prints scores between all pairs of scores.
common_subset – action printcorr: Print results for subests of the eval datasets which are covered by all embeddings’ vocabularies. action compbarin: Compute brain scores for interection of vocabularies.
-
source.task_eval.
plot_brain_words
(brain_scores, plot_order)[source]¶ - Plot hit counts for word in Brain data.
- param brain_scores
brain score dict
- param plot_order
‘concreteness’ orders words for Wordnet conreteness <emb_name> orders plot for an embedding’s scores
-
source.task_eval.
plot_by_concreteness
(scores: numpy.ndarray, word_pairs, ax1, ax2, common_subset=False, vecs_names=None, concrete_num=100, title_prefix='', pair_score_agg='sum', show=False)[source]¶ Plot scores for data splits with increasing concreteness.
-
source.task_eval.
plot_for_freqranges
(scores: numpy.ndarray, gt_divisor, quantity=-1, common_subset=False, pair_num=None, split_num=None, ds_name=None)[source]¶
-
source.task_eval.
plot_for_quantities
(scores: numpy.ndarray, gt_divisor, common_subset=False, legend=False, pair_num=None)[source]¶
-
source.task_eval.
plot_scores
(scores: numpy.ndarray, gt_divisor=10, vecs_names=None, labels=None, colours=None, linestyles=None, title=None, type='plot', alphas=None, xtick_labels=None, ax=None, show=True, swapaxes=False)[source]¶ Scatter plot of a structured array.
-
source.task_eval.
print_brain_scores
(brain_scores, tablefmt: str = 'simple', caption='', suffix='', label='')[source]¶
-
source.task_eval.
print_correlations
(scores: numpy.ndarray, name_pairs='gt', common_subset: bool = False, tablefmt: str = 'simple', caption='', label='')[source]¶
-
source.task_eval.
print_subsampled_correlations
(scores: numpy.ndarray, name_pairs='gt', common_subset: bool = False, tablefmt: str = 'simple', caption='', label='', n_sample=3)[source]¶
-
source.task_eval.
wn_concreteness
(word, similarity_fn=<bound method WordNetCorpusReader.path_similarity of <WordNetCorpusReader in '/Users/anitavero/nltk_data/corpora/wordnet'>>)[source]¶ WordNet distance of a word from its root hypernym.
-
source.task_eval.
wn_concreteness_for_pairs
(word_pairs, synset_agg: str, similarity_fn=<bound method WordNetCorpusReader.path_similarity of <WordNetCorpusReader in '/Users/anitavero/nltk_data/corpora/wordnet'>>, pair_score_agg='sum') -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]¶ - Sort scores by first and second word’s concreteness scores.
- param pair_score_agg
‘sum’ adds scores for the two words, ‘diff’ computes their absolute difference.
- return (ids, scores)
sorted score indices and concreteness scores.
source.text_process module¶
-
class
source.text_process.
BigramPMIVariants
[source]¶ Bases:
nltk.metrics.association.BigramAssocMeasures
-
source.text_process.
concatenate_files
(data_dir, file_pattern, outfile)[source]¶ Concatenate files into one big file.
-
source.text_process.
context_pairs
(text, contexts_file, lang='english')[source]¶ - Prepare contexts word2vecf without their context format:
textual file of word-context pairs. each pair takes a separate line. the format of a pair is “<word> <context>”, i.e. space delimited, where <word> and <context> are strings. The context is all non stop words in the same sentence.
-
source.text_process.
hapax_legomena
(text)[source]¶ - Return words that occur only once within a text.
- param text
str list or Counter
-
source.text_process.
pmi_for_words
(words, finder_file, token_list=None, document_list=None, variants=['pmi'])[source]¶ - Return PMI scores for words in a given tokenized corpus.
- param words
string list.
- param token_list
string list.
- param document_list
list of string lists
-
source.text_process.
text2gensim
(text, lang)[source]¶ - Tokenize and filter stop words. Return list of str lists (std for gensim)
where each str list is a sentence and each text is a list of these lists.
-
source.text_process.
text2w2vf
(corpus_tup, data_dir, window=5, vocab=[], processes=1, merge=False, filename_suffix='')[source]¶ Prepare contexts for word2vecf using their context format: Textual file of word-context pairs. Each pair takes a separate line. The format of a pair is “<word> <context>”, i.e. space delimited, where <word> and <context> are strings. The context is all non stop words in the same sentence or around the token if it’s not sent_tokenized.
- param corpus_tup
list with elements of: token (str) list or sentence list (list of str lists)
- param data_dir
directory to write context pairs to
- param window
Window for w2v. If 0 and text is a sentence list the context of all words are all the other words in the same sentence.
- param vocab
list of str, vocab to filter with in extract_neighbours.
source.train_word2vecf module¶
-
source.train_word2vecf.
train
(contexts_file, save_dir, w2v_dir, filename_suffix='', min_count=10, size=300, negative=15, threads=4)[source]¶ Perform the stepst to train word2vecf on a given corpus:
Create word and context vocabularies:
./myword2vec/count_and_filter -train dep.contexts -cvocab cv -wvocab wv -min-count 100
This will count the words and contexts in dep.contexts, discard either words or contexts appearing < 100 times, and write the counted words to wv and the counted contexts to cv.
Train the embeddings:
./myword2vec/word2vecf -train dep.contexts -wvocab wv -cvocab cv -output dim200vecs -size 200 -negative 15 -threads 10
This will train 200-dim embeddings based on dep.contexts, wv and cv (lines in dep.contexts with word not in wv or context not in cv are ignored).
The -dumpcv flag can be used in order to dump the trained context-vectors as well.
./myword2vec/word2vecf -train dep.contexts -wvocab wv -cvocab cv -output dim200vecs -size 200 -negative 15 -threads 10 -dumpcv dim200context-vecs
convert the embeddings to numpy-readable format.
source.unsupervised_metrics module¶
-
source.unsupervised_metrics.
avg_cluster_wordfrequency
(datadir='/Users/anitavero/projects/data/', clmethod='agglomerative')[source]¶
-
source.unsupervised_metrics.
cluster_similarities
(order='default', clmethod='agglomerative', plot=True)[source]¶
-
source.unsupervised_metrics.
cluster_sizes_avgfreq
(clusters, cl_freqs, embtype=None, method=None, barfontsize=20, suffix='')[source]¶ Historgram of cluster sizes
-
source.unsupervised_metrics.
compute_cluster_similarities
(emb_clusters1, emb_clusters2, compare, order, clmethod, plot)[source]¶ - Compute cluster similarities between two embedding cluster structure.
- param emb_clusters1
- param emb_clusters2
- param compare
comparison based on ‘cross’ or ‘dot’ product.
- param order
‘clustermap’ or ‘avgfreq’(average corpus frequency of cluster element words). Default: ‘avgfreq’.
- param clmethod
‘kmeans’ or ‘agglomerative’.
- param plot
bool. If True similarity plot is created.
- return
jaccard_similarities: dict {<embedding pair label>: similarity matrix}.
-
source.unsupervised_metrics.
get_clustering_labels_metrics
(vecs_names=[], datadir='/anfs/bigdisc/alv34/wikidump/extracted/models/', savedir='/anfs/bigdisc/alv34/wikidump/extracted/models/results/', cluster_method='kmeans', n_clusters=3, random_state=1, eps=0.5, min_samples=90, workers=4, suffix='', linkage='ward')[source]¶
-
source.unsupervised_metrics.
get_n_nearest_neighbors
(words: numpy.ndarray, E: numpy.ndarray, vocab: numpy.ndarray, n: int = 10)[source]¶ n nearest neighbors for words based on cosine distance in Embedding E.
-
source.unsupervised_metrics.
inspect_clusters
(cluster_label_filepath)[source]¶ - Convert cluster label file containing {word: label} dict to {cluster_id: wordlist} dict,
ordered by the number of cluster members.
- Parameters
cluster_label_filepath – Path to cluster label file.
-
source.unsupervised_metrics.
jaccard_similarity_score
(x, y)[source]¶ Jaccard Similarity J (A,B) = | Intersection (A,B) | / | Union (A,B) |
-
source.unsupervised_metrics.
label_clusters_with_wordnet
(depth=3, max_label_num=3)[source]¶ First max_label_num most common synset names.
-
source.unsupervised_metrics.
n_nearest_neighbors
(data_dir, model_name, words=[], n: int = 10)[source]¶ n nearest neighbors for words based on model <vecs_names>.
-
source.unsupervised_metrics.
order_words_by_centroid_distance
(clusters, cluster_label_filepath)[source]¶ Order words by their distance from the centroid
-
source.unsupervised_metrics.
plot_cluster_results
(resdir='/Users/anitavero/projects/data/wikidump/models/')[source]¶
-
source.unsupervised_metrics.
pmi_comparison
(datadir='/Users/anitavero/projects/data/wikidump/models/results/', pmi_th=5, variants='ppmi', format='latex')[source]¶
-
source.unsupervised_metrics.
print_cluster_results
(resdir='/Users/anitavero/projects/data/wikidump/models/')[source]¶
-
source.unsupervised_metrics.
print_clusters
(clusters_WN_filepath, tablefmt, barfontsize=20)[source]¶ - Parameters
clusters_WN_filepath –
tablefmt – printed table format. ‘simple’ - terminal, ‘latex_raw’ - latex table.
barfontsize – font size in the figure.
- Returns
clusters, printed table
-
source.unsupervised_metrics.
run_clustering
(model, cluster_method, n_clusters=3, random_state=1, eps=0.5, min_samples=5, workers=4, linkage='ward')[source]¶
-
source.unsupervised_metrics.
run_clustering_experiments
(datadir='/anfs/bigdisc/alv34/wikidump/extracted/models/', savedir='/anfs/bigdisc/alv34/wikidump/extracted/models/results/', vecs_names=[], mm_embs_of=[], cluster_method='dbscan', n_clusters=-1, random_state=1, eps=0.5, min_samples=90, workers=4, suffix='', linkage='ward')[source]¶
-
source.unsupervised_metrics.
save_closest_words_to_centroids
()[source]¶ Save words from each cluster, which are closest to the centroid.
source.utils module¶
-
source.utils.
latex_table_post_process
(table, bottomrule_row_ids: List[int] = [], title='', fit_to_page=False, label='')[source]¶ - Add separator lines and align width to page.
- param bottomrule_row_ids
Row indices (without header) below which we put a separator line.
source.vecs2nps module¶
Script to create vecs.npy and vecs.vocab from files with the following format: <row_num> <dim> <word_1> <vector_1> … <word_n> <vector_n>
source.visualise module¶
-
source.visualise.
tensorboard_emb
(data_dir, model_name, output_path, tn_label='wn_clusters', label_name='clusters')[source]¶ Visualise embeddings using TensorBoard. Code from: https://gist.github.com/BrikerMan/7bd4e4bd0a00ac9076986148afc06507 :param model_name: name of numpy array files: embedding (.npy) and vocab (.vocab) :param output_path: str, directory :param tn_label: label dictionary file path or options: {“wn_clusters”, “None”} :param label_name: str, title for the labeling (e.g.: Cluster)
- Usage on remote server with port forwarding:
when you ssh into the machine, you use the option -L to transfer the port 6006 of the remote server into the port 16006 of my machine (for instance):
ssh -L 16006:127.0.0.1:6006 alv34@yellowhammer What it does is that everything on the port 6006 of the server (in 127.0.0.1:6006) will be forwarded to my machine on the port 16006.
You can then launch tensorboard on the remote machine using a standard tensorboard –logdir log with the default 6006 port
On your local machine, go to http://127.0.0.1:16006 and enjoy your remote TensorBoard.