source package

Submodules

source.cnn_embeddings module

source.cnn_embeddings.create_index_from_fnames(image_dir, savepath)[source]
source.cnn_embeddings.get_cnn(image_dir, word_index_file, savedir=None, cnn_model='resnet18', agg_maxnum=10, gpu=False, filename_prefix='')[source]

Extract CNN representations for images in a directory and saves it into a dictionary file.

source.emb_entropy module

Aanalytical vs estimated value is illustrated for normal random variables.

source.emb_entropy.benchmark(dim, round_num, num_of_samples=10000)[source]
source.emb_entropy.multivariate_normal(mean, cov, size=None, check_valid='warn', tol=1e-8)

Draw random samples from a multivariate normal distribution.

The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal distribution to higher dimensions. Such a distribution is specified by its mean and covariance matrix. These parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,” squared) of the one-dimensional normal distribution.

mean1-D array_like, of length N

Mean of the N-dimensional distribution.

cov2-D array_like, of shape (N, N)

Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling.

sizeint or tuple of ints, optional

Given a shape of, for example, (m,n,k), m*n*k samples are generated, and packed in an m-by-n-by-k arrangement. Because each sample is N-dimensional, the output shape is (m,n,k,N). If no shape is specified, a single (N-D) sample is returned.

check_valid{ ‘warn’, ‘raise’, ‘ignore’ }, optional

Behavior when the covariance matrix is not positive semidefinite.

tolfloat, optional

Tolerance when checking the singular values in covariance matrix. cov is cast to double before the check.

outndarray

The drawn samples, of shape size, if that was provided. If not, the shape is (N,).

In other words, each entry out[i,j,...,:] is an N-dimensional value drawn from the distribution.

The mean is a coordinate in N-dimensional space, which represents the location where samples are most likely to be generated. This is analogous to the peak of the bell curve for the one-dimensional or univariate normal distribution.

Covariance indicates the level to which two variables vary together. From the multivariate normal distribution, we draw N-dimensional samples, \(X = [x_1, x_2, ... x_N]\). The covariance matrix element \(C_{ij}\) is the covariance of \(x_i\) and \(x_j\). The element \(C_{ii}\) is the variance of \(x_i\) (i.e. its “spread”).

Instead of specifying the full covariance matrix, popular approximations include:

  • Spherical covariance (cov is a multiple of the identity matrix)

  • Diagonal covariance (cov has non-negative elements, and only on the diagonal)

This geometrical property can be seen in two dimensions by plotting generated data-points:

>>> mean = [0, 0]
>>> cov = [[1, 0], [0, 100]]  # diagonal covariance

Diagonal covariance means that points are oriented along x or y-axis:

>>> import matplotlib.pyplot as plt
>>> x, y = np.random.multivariate_normal(mean, cov, 5000).T
>>> plt.plot(x, y, 'x')
>>> plt.axis('equal')
>>> plt.show()

Note that the covariance matrix must be positive semidefinite (a.k.a. nonnegative-definite). Otherwise, the behavior of this method is undefined and backwards compatibility is not guaranteed.

1

Papoulis, A., “Probability, Random Variables, and Stochastic Processes,” 3rd ed., New York: McGraw-Hill, 1991.

2

Duda, R. O., Hart, P. E., and Stork, D. G., “Pattern Classification,” 2nd ed., New York: Wiley, 2001.

>>> mean = (1, 2)
>>> cov = [[1, 0], [0, 1]]
>>> x = np.random.multivariate_normal(mean, cov, (3, 3))
>>> x.shape
(3, 3, 2)

The following is probably true, given that 0.6 is roughly twice the standard deviation:

>>> list((x[0,0,:] - mean) < 0.6)
[True, True] # random
source.emb_entropy.rand(d0, d1, ..., dn)

Random values in a given shape.

Note

This is a convenience function for users porting code from Matlab, and wraps numpy.random.random_sample. That function takes a tuple to specify the size of the output, which is consistent with other NumPy functions like numpy.zeros and numpy.ones.

Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).

d0, d1, …, dnint, optional

The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.

outndarray, shape (d0, d1, ..., dn)

Random values.

random

>>> np.random.rand(3,2)
array([[ 0.14022471,  0.96360618],  #random
       [ 0.37601032,  0.25528411],  #random
       [ 0.49313049,  0.94909878]]) #random
source.emb_entropy.run_benchmark(dim, k, num_of_samples=10000)[source]
Parameters
  • dim – dimension of the distribution

  • k – number of nearest neighbours

  • num_of_samples – number of data points

source.emb_information module

Module for (Shannon) mutual information estimators.

source.emb_information.benchmark(dim=10, cost_name='MIShannon_DKL', num_of_samples=-1, max_num_of_samples=10000)[source]
Plot estimated vs analytical Mutual Information for random matrices.
param dim

Data dimension (number of columns of the matrices)

param cost_name

MI estimation algorithm, e.g, ‘BIHSIC_IChol’, ‘MIShannon_DKL’, ‘MIShannon_HS’ (for more see ite.cost)

param num_of_samples

if -1 increases the number of data points by 1000 until max_num_of_samples, if >-1 it prints running time for this number of data points (matrix row num)

param max_num_of_samples

maximum data point number in case of plotting for a series of sample nums.

source.emb_information.estimate_embeddings_mi(datadir: str, vecs_names=[], mm_embs_of=[], cost_name='MIShannon_DKL', pca_n_components=None)[source]
Return estimated Mutual Information for a Embeddings with vecs_names in datadir.
param datadir

Path to directory which contains embedding data.

param vecs_names

List[str] Names of embeddings

param mm_embs_of

List of str tuples, where the tuples contain names of embeddings which are to be concatenated into a multi-modal mid-fusion embedding.

param cost_name

MI estimation algorithm, e.g, ‘BIHSIC_IChol’, ‘MIShannon_DKL’, ‘MIShannon_HS’ (for more see ite.cost)

source.emb_information.multivariate_normal(mean, cov, size=None, check_valid='warn', tol=1e-8)

Draw random samples from a multivariate normal distribution.

The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal distribution to higher dimensions. Such a distribution is specified by its mean and covariance matrix. These parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,” squared) of the one-dimensional normal distribution.

mean1-D array_like, of length N

Mean of the N-dimensional distribution.

cov2-D array_like, of shape (N, N)

Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling.

sizeint or tuple of ints, optional

Given a shape of, for example, (m,n,k), m*n*k samples are generated, and packed in an m-by-n-by-k arrangement. Because each sample is N-dimensional, the output shape is (m,n,k,N). If no shape is specified, a single (N-D) sample is returned.

check_valid{ ‘warn’, ‘raise’, ‘ignore’ }, optional

Behavior when the covariance matrix is not positive semidefinite.

tolfloat, optional

Tolerance when checking the singular values in covariance matrix. cov is cast to double before the check.

outndarray

The drawn samples, of shape size, if that was provided. If not, the shape is (N,).

In other words, each entry out[i,j,...,:] is an N-dimensional value drawn from the distribution.

The mean is a coordinate in N-dimensional space, which represents the location where samples are most likely to be generated. This is analogous to the peak of the bell curve for the one-dimensional or univariate normal distribution.

Covariance indicates the level to which two variables vary together. From the multivariate normal distribution, we draw N-dimensional samples, \(X = [x_1, x_2, ... x_N]\). The covariance matrix element \(C_{ij}\) is the covariance of \(x_i\) and \(x_j\). The element \(C_{ii}\) is the variance of \(x_i\) (i.e. its “spread”).

Instead of specifying the full covariance matrix, popular approximations include:

  • Spherical covariance (cov is a multiple of the identity matrix)

  • Diagonal covariance (cov has non-negative elements, and only on the diagonal)

This geometrical property can be seen in two dimensions by plotting generated data-points:

>>> mean = [0, 0]
>>> cov = [[1, 0], [0, 100]]  # diagonal covariance

Diagonal covariance means that points are oriented along x or y-axis:

>>> import matplotlib.pyplot as plt
>>> x, y = np.random.multivariate_normal(mean, cov, 5000).T
>>> plt.plot(x, y, 'x')
>>> plt.axis('equal')
>>> plt.show()

Note that the covariance matrix must be positive semidefinite (a.k.a. nonnegative-definite). Otherwise, the behavior of this method is undefined and backwards compatibility is not guaranteed.

1

Papoulis, A., “Probability, Random Variables, and Stochastic Processes,” 3rd ed., New York: McGraw-Hill, 1991.

2

Duda, R. O., Hart, P. E., and Stork, D. G., “Pattern Classification,” 2nd ed., New York: Wiley, 2001.

>>> mean = (1, 2)
>>> cov = [[1, 0], [0, 1]]
>>> x = np.random.multivariate_normal(mean, cov, (3, 3))
>>> x.shape
(3, 3, 2)

The following is probably true, given that 0.6 is roughly twice the standard deviation:

>>> list((x[0,0,:] - mean) < 0.6)
[True, True] # random
source.emb_information.plot_for_freqranges(file_path, vis_names=['vecs3lem1', 'google_resnet152'], quantity=-1, legend=True, fname='', suffix='')[source]
source.emb_information.plot_for_quantities(file_path, vis_names=['vecs3lem1', 'google_resnet152'], legend=True, fname='', suffix='')[source]
source.emb_information.plots(file_pattern, vis_names=['vecs3lem1', 'google_resnet152'], fqrng_quantity=-1, legend=True, suffix='')[source]
source.emb_information.rand(d0, d1, ..., dn)

Random values in a given shape.

Note

This is a convenience function for users porting code from Matlab, and wraps numpy.random.random_sample. That function takes a tuple to specify the size of the output, which is consistent with other NumPy functions like numpy.zeros and numpy.ones.

Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).

d0, d1, …, dnint, optional

The dimensions of the returned array, must be non-negative. If no argument is given a single Python float is returned.

outndarray, shape (d0, d1, ..., dn)

Random values.

random

>>> np.random.rand(3,2)
array([[ 0.14022471,  0.96360618],  #random
       [ 0.37601032,  0.25528411],  #random
       [ 0.49313049,  0.94909878]]) #random
source.emb_information.run_mi_experiments(exp_names='quantity', cost_name='MIShannon_DKL', pca_n_components=None, exp_suffix='', blas_n_threads=None)[source]
Parameters

cost_name – MI estimation algorithm, e.g, HSIC kernel method: ‘BIHSIC_IChol’, KNN based linear: ‘MIShannon_DKL’

source.emb_information.run_pca(X, n_components)[source]

source.embedding module

class source.embedding.LossLogger(show=False)[source]

Bases: gensim.models.callbacks.CallbackAny2Vec

Callback to print loss after each epoch.

on_batch_end(model)[source]

Method called at the end of each batch.

modelBaseWordEmbeddingsModel

Current model.

on_epoch_begin(model)[source]

Method called at the start of each epoch.

modelBaseWordEmbeddingsModel

Current model.

on_epoch_end(model)[source]

Method called at the end of each epoch.

modelBaseWordEmbeddingsModel

Current model.

on_train_end(model)[source]

Method called at the end of the training process.

modelBaseWordEmbeddingsModel

Current model.

source.embedding.train(corpus, save_path, load_path=None, size=300, window=5, min_count=10, workers=4, epochs=5, max_vocab_size=None, show_loss=False, save_loss=False)[source]
Train w2v.
param corpus

list of list strings

param save_path

Model file path

return

trained model

source.process_embeddings module

class source.process_embeddings.Embeddings(datadir: str, vecs_names, ling_vecs_names=None)[source]

Bases: object

Data class for storing embeddings.

embeddings = typing.List[numpy.ndarray]
fasttext_vss = {'crawl': 'crawl-300d-2M.vec', 'crawl_sub': 'crawl-300d-2M-subword', 'w2v13': '', 'wikinews': 'wiki-news-300d-1M.vec', 'wikinews_sub': 'wiki-news-300d-1M-subword.vec'}
static get_emb_type_label(fn)[source]
static get_label(name)[source]

Return a printable label for embedding names.

static get_labels(name_list)[source]
load_fasttext(fname: str) → Tuple[numpy.ndarray, numpy.ndarray][source]
load_vecs(vecs_name: str, datadir: str, filter_vocab=[])[source]

Load npy vector files and vocab files. If they are not present load try loading gensim model.

vecs_labels = typing.List[str]
vecs_names = typing.List[str]
vocabs = typing.List[typing.List[str]]
source.process_embeddings.agg_img_embeddings(values: dict, maxnum: int = 10) → numpy.ndarray[source]
Aggregate image vectors from a dictionary of to numpy embeddings and vocabulary.

The embedding is a numpy array of shape(vocab size, vector dim) Vocabulary is a text file including words separated by new line.

source.process_embeddings.divide_vocab_by_freqranges(distribution_file, num_groups=3, save=False)[source]
source.process_embeddings.filter_by_vocab(vecs, vocab, filter_vocab)[source]
Filter numpy array and corresponding vocab, so they contain words and vectors for

words in filter_vocab.

source.process_embeddings.filter_for_freqranges(datadir, fqvocabs_file, file_patterns=None)[source]

Filter embedding files with the given file pattern.

source.process_embeddings.mid_fusion(embeddings, vocabs, labels, padding: bool, combnum: int = 2) -> (typing.List[numpy.ndarray], typing.List[numpy.ndarray], typing.List[str])[source]
Concatenate embeddings pairwise for words in the intersection or union (with padding) of their vocabulary.
param embeddings

List[np.ndarray] or List[Tuple[np.ndarray]]

param vocabs

List[np.ndarray] or List[Tuple[np.ndarray]]

param labels

List[np.ndarray] or List[Tuple[np.ndarray]]

param padding

If true, all the vectors are kept from the embeddings’ vocabularies. The vectors parts without a vector from another modality are padded with zeros.

param combnum

number of modalities concatenated in the final multi-modal vector

source.process_embeddings.serialize2npy(filepath: str, savedir: str, maxnum: int = 10)[source]
Save embedding files from pickle containing dictionary of {word: np.ndarray}

into embedding.npy, embedding.vocab, for eval. The embedding is a numpy array of shape(vocab size, vector dim) Vocabulary is a text file including words separated by new line. :param filepath: Path to a pickle file containing a dict of

either {word: <image embedding list>} or {word: <image embedding>} (‘descriptors’ suffix in mmfeat file names)

source.process_vg module

source.process_vg.description_corpus(region_descriptions, lemmatise)[source]

Return all descriptions as a corpus in form of list of strings (sentences).

source.process_vg.save_description_corpus(datadir, lemmatise=True)[source]
source.process_vg.vg_dists(datadir='/Users/anitavero/projects/data/visualgenome')[source]
source.process_vg.vg_pmis(words_file, datadir='/Users/anitavero/projects/data/visualgenome', bigram_file='bigram_vg.pkl', variants=['ppmi'])[source]
Save PMI scores for bigrams including words in file word_list.
param words_file

json file name in data_dir, consisting of an str list

param datadir

path to directory with data

source.process_wiki module

Module for processing a Wikipedia dump previously extracted using

WikiExtractor (https://github.com/attardi/wikiextractor)

source.process_wiki.contexts_for_quantity(data_dir, save_dir, num, filename_suffix='', contexts_pattern='', window=5, vocab=[], processes=1)[source]
Loads randomly chosen, given number of context files and concatenates them into one file.

If there are no .contexts files under data_dir/* subdirectories, but one .contexts file exists under data_dir directly, it will just return this file name.

source.process_wiki.create_context_files(data_dir=None, jsons=None, window=5, vocab=[], processes=1, merge=False, filename_suffix='')[source]
source.process_wiki.distribution(data_dir, format='json', file_suffix='')[source]

Count word frequencies from text files or json files, containing list of str lists.

source.process_wiki.get_pmi_for_words(words_file, data_dir, process=False, bigram_file=None, variants=['pmi'])[source]
Save PMI scores for bigrams including words in file word_list.
param words_file

json file name in data_dir, consisting of an str list

param data_dir

path to directory with data

param process

bool, if True it preprocesses wiki files if False it loads preprocessed jsons.

source.process_wiki.plot_distribution(dist_file, logscale=True)[source]
source.process_wiki.process_files(data_dir)[source]

Sentence tokenize and stop word filter all text files and save the tokenized texts to json files into the ‘tokenized’ directory.

source.process_wiki.w2v_for_quantities(data_dir, save_dir, w2v_dir, sample_num, trfile_num, size=300, min_count=10, workers=4, negative=15, exp_name='', contexts_pattern='', window=5, vocab=[])[source]
Train several Word2Vecs in parallel for the same data quantity, multiple times on random subsets.
param data_dir

‘tokenized’ directory with subdirectories of jsons.

param save_dir

directory where we save the model and log files.

param sample_num

number of random trainings for the same number of files.

param trfile_num

number of sampled training files. If num <= 0 we train on the whole corpus.

Rest are Word2Vec training parameters.

source.process_wiki.w2v_for_quantity(data_dir, save_dir, w2v_dir, num, size=300, min_count=10, workers=4, negative=15, filename_suffix='', contexts_pattern='', window=5, vocab=[])[source]
Train Word2Vec on a random number of tokenized json files.
param data_dir

‘tokenized’ directory with subdirectories of .context files.

source.run_analysis module

source.run_analysis.main(action)[source]

source.run_experiments module

source.run_experiments.main(exp_name)[source]

source.run_infogain_analysis module

source.run_infogain_analysis.main(actions='printcorr', name='', tablefmt='simple')[source]

source.run_infogain_experiments module

source.run_infogain_experiments.main(exp_name, filter_pattern='', pre_score_files=None, subdir='')[source]

source.task_eval module

class source.task_eval.DataSets(datadir: str)[source]

Bases: object

Class for storing evaluation datasets and linguistic embeddings.

datasets = {}
fmri_vocab = ['airplane', 'ant', 'apartment', 'arch', 'arm', 'barn', 'bear', 'bed', 'bee', 'beetle', 'bell', 'bicycle', 'bottle', 'butterfly', 'car', 'carrot', 'cat', 'celery', 'chair', 'chimney', 'chisel', 'church', 'closet', 'coat', 'corn', 'cow', 'cup', 'desk', 'dog', 'door', 'dress', 'dresser', 'eye', 'fly', 'foot', 'glass', 'hammer', 'hand', 'horse', 'house', 'igloo', 'key', 'knife', 'leg', 'lettuce', 'pants', 'pliers', 'refrigerator', 'saw', 'screwdriver', 'shirt', 'skirt', 'spoon', 'table', 'telephone', 'tomato', 'train', 'truck', 'watch', 'window']
normalizers = {}
class source.task_eval.PlotColour[source]

Bases: object

static colour_by_modality(labels)[source]
static get_legend()[source]
source.task_eval.compute_correlations(scores: (<class 'numpy.ndarray'>, <class 'list'>), name_pairs: List[Tuple[str, str]] = None, common_subset: bool = False, leave_out=False)[source]
Compute correlation between score series.
param scores

Structured array of scores with embedding/ground_truth names.

param name_pairs

pairs of scores to correlate. If None, every pair will be computed. if ‘gt’, everything will be plot against the ground_truth.

param leave_out

Leave out 1/leave_out portion of pairs, chosen randomly. Does not leave out if it is False.

source.task_eval.compute_dists(vecs)[source]
source.task_eval.compute_scores(actions, embeddings, scores, datasets, pairs, brain_scores=None, pre_score_files: str = None, ling_vecs_names=[], vecs_names=[], mm_lingvis=False, mm_embs_of: List[Tuple[str]] = None, mm_padding=False, common_subset=False)[source]

Compute scores on all evaluation datasets.

source.task_eval.coverage(vocabulary, data)[source]
source.task_eval.covered(dataset, vocab)[source]
source.task_eval.dataset_vocab(dataset: str) → list[source]
source.task_eval.divide_eval_vocab_by_freqranges(distribution_file, eval_data_dir, dataset_name, num_groups=3, save=False)[source]
source.task_eval.eval_concreteness(scores: numpy.ndarray, word_pairs, num=100, gt_divisor=10, vecs_names=None, tablefmt='simple')[source]

Eval dataset instances based on WordNet synsets.

source.task_eval.eval_dataset(dataset: List[Tuple[str, str, float]], dataset_name: str, embeddings: List[numpy.ndarray], vocabs: List[List[str]], labels: List[str]) -> (<class 'numpy.ndarray'>, <class 'list'>)[source]
source.task_eval.highlight(val, conditions: dict, tablefmt)[source]
Highlight value in a table column.
param val

number, value

param conditions

dict of {colour: condition}

param tablefmt

‘simple’ is terminal, ‘latex-raw’ is LaTeX

source.task_eval.latex_escape(string)[source]
source.task_eval.main(datadir, embdir: str = None, vecs_names=[], savepath=None, loadpath=None, actions=['plotcorr'], plot_orders=['ground_truth'], plot_vecs=[], ling_vecs_names=[], pre_score_files: str = None, mm_embs_of: List[Tuple[str]] = None, mm_lingvis=False, mm_padding=False, print_corr_for=None, common_subset=False, tablefmt: str = 'simple', concrete_num=100, pair_score_agg='sum', quantity=-1)[source]
Parameters
  • actions

    Choose from the following: ‘printcorr’: Print correlation in tables on MEN and SimLex. ‘plotscores’: Plot correlations on MEN and SimLex. ‘concreteness’: Scores on caption_comsub Semantic Similarity dataset splits, ordered by pair_score_agg of

    WordNet concreteness scores of the two words in every word pair. Optional: mm_padding

    ’coverage’: Save coverages on similarity/relatedness/brain data. ‘compscores’: Compute scores on similarity/relatedness evaluation datasets. ‘compbrain’: Compute scores on brain evaluation datasets. ‘brainwords’: Plot Qualitative analysis on words in the brain data. ‘printbraincorr’: Print correlations on brain data. ‘plot_quantity’: Plot similarity/relatedness result for text quantity ranges. ‘plot_freqrange’: Plot similarity/relatedness result for wor frequency ranges.

  • pair_score_agg – ‘sum’ or ‘diff’ of concreteness scores of word pairs.

  • mm_lingvis – if True, create multi-modal embeddings, otherwise specific embedding pairs should be given.

  • tablefmt – printed table format. ‘simple’ - terminal, ‘latex_raw’ - latex table.

  • concrete_num – Plot of WordNet concreteness splits by concrete_num number of pairs.

  • datadir – Path to directory which contains evaluation data (and embedding data if embdir is not given)

  • vecs_names – List[str] Names of embeddings

  • embdir – Path to directory which contains embedding files.

  • savepath – Full path to the file to save scores without extension. None if there’s no saving.

  • loadpath – Full path to the files to load scores and brain results from without extension. If None, they’ll be computed.

  • plot_orders – Performance plot ordered by similarity scores of these datasets or embeddings.

  • plot_vecs – List[str] Names of embeddings to plot scores for.

  • ling_vecs_names – List[str] Names of linguistic embeddings.

  • pre_score_files – Previously saved score file path without extension, which the new scores will be merged with

  • mm_embs_of – List of str tuples, where the tuples contain names of embeddings which are to be concatenated into a multi-modal mid-fusion embedding.

  • mm_padding – Default False. Multi-modal mid-fusion method. If true, all the vectors are kept from the embeddings’ vocabularies. Vector representations without a vector from another modality are padded with zeros.

  • print_corr_for – ‘gt’ prints correlations scores for ground truth, ‘all’ prints scores between all pairs of scores.

  • common_subset – action printcorr: Print results for subests of the eval datasets which are covered by all embeddings’ vocabularies. action compbarin: Compute brain scores for interection of vocabularies.

source.task_eval.mm_over_uni(name, score_dict)[source]
source.task_eval.neighbors(words, vocab, vecs, n=10)[source]
source.task_eval.plot_brain_words(brain_scores, plot_order)[source]
Plot hit counts for word in Brain data.
param brain_scores

brain score dict

param plot_order

‘concreteness’ orders words for Wordnet conreteness <emb_name> orders plot for an embedding’s scores

source.task_eval.plot_by_concreteness(scores: numpy.ndarray, word_pairs, ax1, ax2, common_subset=False, vecs_names=None, concrete_num=100, title_prefix='', pair_score_agg='sum', show=False)[source]

Plot scores for data splits with increasing concreteness.

source.task_eval.plot_for_freqranges(scores: numpy.ndarray, gt_divisor, quantity=-1, common_subset=False, pair_num=None, split_num=None, ds_name=None)[source]
source.task_eval.plot_for_quantities(scores: numpy.ndarray, gt_divisor, common_subset=False, legend=False, pair_num=None)[source]
source.task_eval.plot_scores(scores: numpy.ndarray, gt_divisor=10, vecs_names=None, labels=None, colours=None, linestyles=None, title=None, type='plot', alphas=None, xtick_labels=None, ax=None, show=True, swapaxes=False)[source]

Scatter plot of a structured array.

source.task_eval.print_brain_scores(brain_scores, tablefmt: str = 'simple', caption='', suffix='', label='')[source]
source.task_eval.print_correlations(scores: numpy.ndarray, name_pairs='gt', common_subset: bool = False, tablefmt: str = 'simple', caption='', label='')[source]
source.task_eval.print_subsampled_correlations(scores: numpy.ndarray, name_pairs='gt', common_subset: bool = False, tablefmt: str = 'simple', caption='', label='', n_sample=3)[source]
source.task_eval.wn_concreteness(word, similarity_fn=<bound method WordNetCorpusReader.path_similarity of <WordNetCorpusReader in '/Users/anitavero/nltk_data/corpora/wordnet'>>)[source]

WordNet distance of a word from its root hypernym.

source.task_eval.wn_concreteness_for_pairs(word_pairs, synset_agg: str, similarity_fn=<bound method WordNetCorpusReader.path_similarity of <WordNetCorpusReader in '/Users/anitavero/nltk_data/corpora/wordnet'>>, pair_score_agg='sum') -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Sort scores by first and second word’s concreteness scores.
param pair_score_agg

‘sum’ adds scores for the two words, ‘diff’ computes their absolute difference.

return (ids, scores)

sorted score indices and concreteness scores.

source.text_process module

class source.text_process.BigramPMIVariants[source]

Bases: nltk.metrics.association.BigramAssocMeasures

classmethod ppmi(*marginals)[source]

Scores ngrams by positive pointwise mutual information.

classmethod w_ppmi(*marginals, alpha=0.75)[source]

Scores ngrams by weighted positive pointwise mutual information.

source.text_process.concatenate_files(data_dir, file_pattern, outfile)[source]

Concatenate files into one big file.

source.text_process.context_pairs(text, contexts_file, lang='english')[source]
Prepare contexts word2vecf without their context format:

textual file of word-context pairs. each pair takes a separate line. the format of a pair is “<word> <context>”, i.e. space delimited, where <word> and <context> are strings. The context is all non stop words in the same sentence.

source.text_process.extract_neighbours(tokens, contexts_file, vocab=[], window=5)[source]
source.text_process.hapax_legomena(text)[source]
Return words that occur only once within a text.
param text

str list or Counter

source.text_process.pmi_for_words(words, finder_file, token_list=None, document_list=None, variants=['pmi'])[source]
Return PMI scores for words in a given tokenized corpus.
param words

string list.

param token_list

string list.

param document_list

list of string lists

source.text_process.text2gensim(text, lang)[source]
Tokenize and filter stop words. Return list of str lists (std for gensim)

where each str list is a sentence and each text is a list of these lists.

source.text_process.text2w2vf(corpus_tup, data_dir, window=5, vocab=[], processes=1, merge=False, filename_suffix='')[source]

Prepare contexts for word2vecf using their context format: Textual file of word-context pairs. Each pair takes a separate line. The format of a pair is “<word> <context>”, i.e. space delimited, where <word> and <context> are strings. The context is all non stop words in the same sentence or around the token if it’s not sent_tokenized.

param corpus_tup

list with elements of: token (str) list or sentence list (list of str lists)

param data_dir

directory to write context pairs to

param window

Window for w2v. If 0 and text is a sentence list the context of all words are all the other words in the same sentence.

param vocab

list of str, vocab to filter with in extract_neighbours.

source.text_process.tokenize(text, lang)[source]
Lower, tokenize, filter punctuation and stopwords.
param text

str

param lang

{hungarian|english|hunglish}

return

str list iterator

source.train_word2vecf module

source.train_word2vecf.train(contexts_file, save_dir, w2v_dir, filename_suffix='', min_count=10, size=300, negative=15, threads=4)[source]

Perform the stepst to train word2vecf on a given corpus:

  1. Create word and context vocabularies:

    ./myword2vec/count_and_filter -train dep.contexts -cvocab cv -wvocab wv -min-count 100

This will count the words and contexts in dep.contexts, discard either words or contexts appearing < 100 times, and write the counted words to wv and the counted contexts to cv.

  1. Train the embeddings:

    ./myword2vec/word2vecf -train dep.contexts -wvocab wv -cvocab cv -output dim200vecs -size 200 -negative 15 -threads 10

This will train 200-dim embeddings based on dep.contexts, wv and cv (lines in dep.contexts with word not in wv or context not in cv are ignored).

The -dumpcv flag can be used in order to dump the trained context-vectors as well.

./myword2vec/word2vecf -train dep.contexts -wvocab wv -cvocab cv -output dim200vecs -size 200 -negative 15 -threads 10 -dumpcv dim200context-vecs

  1. convert the embeddings to numpy-readable format.

source.unsupervised_metrics module

source.unsupervised_metrics.agglomerative_clustering(model, n_clusters=3, linkage='ward')[source]
source.unsupervised_metrics.avg_cluster_wordfrequency(datadir='/Users/anitavero/projects/data/', clmethod='agglomerative')[source]
source.unsupervised_metrics.cluster_eval(vectors, labels)[source]

Unsupervised clustering metrics.

source.unsupervised_metrics.cluster_method_from_filename(fn)[source]
source.unsupervised_metrics.cluster_similarities(order='default', clmethod='agglomerative', plot=True)[source]
source.unsupervised_metrics.cluster_sizes_avgfreq(clusters, cl_freqs, embtype=None, method=None, barfontsize=20, suffix='')[source]

Historgram of cluster sizes

source.unsupervised_metrics.compute_cluster_similarities(emb_clusters1, emb_clusters2, compare, order, clmethod, plot)[source]
Compute cluster similarities between two embedding cluster structure.
param emb_clusters1

param emb_clusters2

param compare

comparison based on ‘cross’ or ‘dot’ product.

param order

‘clustermap’ or ‘avgfreq’(average corpus frequency of cluster element words). Default: ‘avgfreq’.

param clmethod

‘kmeans’ or ‘agglomerative’.

param plot

bool. If True similarity plot is created.

return

jaccard_similarities: dict {<embedding pair label>: similarity matrix}.

source.unsupervised_metrics.dbscan_clustering(model, eps=0.5, min_samples=90, n_jobs=4)[source]
source.unsupervised_metrics.distances_from_centroids(emb, vocab, label_dict, centroids)[source]
source.unsupervised_metrics.emb_labels(fn)[source]
source.unsupervised_metrics.get_clustering_labels_metrics(vecs_names=[], datadir='/anfs/bigdisc/alv34/wikidump/extracted/models/', savedir='/anfs/bigdisc/alv34/wikidump/extracted/models/results/', cluster_method='kmeans', n_clusters=3, random_state=1, eps=0.5, min_samples=90, workers=4, suffix='', linkage='ward')[source]
source.unsupervised_metrics.get_n_nearest_neighbors(words: numpy.ndarray, E: numpy.ndarray, vocab: numpy.ndarray, n: int = 10)[source]

n nearest neighbors for words based on cosine distance in Embedding E.

source.unsupervised_metrics.inspect_clusters(cluster_label_filepath)[source]
Convert cluster label file containing {word: label} dict to {cluster_id: wordlist} dict,

ordered by the number of cluster members.

Parameters

cluster_label_filepath – Path to cluster label file.

source.unsupervised_metrics.jaccard_similarity_score(x, y)[source]

Jaccard Similarity J (A,B) = | Intersection (A,B) | / | Union (A,B) |

source.unsupervised_metrics.kmeans(model, n_clusters=3, random_state=1, n_jobs=4)[source]
source.unsupervised_metrics.label_clusters_with_wordnet(depth=3, max_label_num=3)[source]

First max_label_num most common synset names.

source.unsupervised_metrics.n_nearest_neighbors(data_dir, model_name, words=[], n: int = 10)[source]

n nearest neighbors for words based on model <vecs_names>.

source.unsupervised_metrics.order_clusters_by_avgfreq(clusters, datapath, clfile)[source]
source.unsupervised_metrics.order_words_by_centroid_distance(clusters, cluster_label_filepath)[source]

Order words by their distance from the centroid

source.unsupervised_metrics.plot_cluster_results(resdir='/Users/anitavero/projects/data/wikidump/models/')[source]
source.unsupervised_metrics.pmi_comparison(datadir='/Users/anitavero/projects/data/wikidump/models/results/', pmi_th=5, variants='ppmi', format='latex')[source]
source.unsupervised_metrics.print_cluster_results(resdir='/Users/anitavero/projects/data/wikidump/models/')[source]
source.unsupervised_metrics.print_clusters(clusters_WN_filepath, tablefmt, barfontsize=20)[source]
Parameters
  • clusters_WN_filepath

  • tablefmt – printed table format. ‘simple’ - terminal, ‘latex_raw’ - latex table.

  • barfontsize – font size in the figure.

Returns

clusters, printed table

source.unsupervised_metrics.run_clustering(model, cluster_method, n_clusters=3, random_state=1, eps=0.5, min_samples=5, workers=4, linkage='ward')[source]
source.unsupervised_metrics.run_clustering_experiments(datadir='/anfs/bigdisc/alv34/wikidump/extracted/models/', savedir='/anfs/bigdisc/alv34/wikidump/extracted/models/results/', vecs_names=[], mm_embs_of=[], cluster_method='dbscan', n_clusters=-1, random_state=1, eps=0.5, min_samples=90, workers=4, suffix='', linkage='ward')[source]
source.unsupervised_metrics.run_inspect_clusters()[source]
source.unsupervised_metrics.run_print_clusters(barfontsize=25)[source]
source.unsupervised_metrics.save_closest_words_to_centroids()[source]

Save words from each cluster, which are closest to the centroid.

source.unsupervised_metrics.similar_cluster_nums(clmethod='agglomerative')[source]
source.unsupervised_metrics.similarity_clustermap(V, xticks, yticks, title_embs)[source]
source.unsupervised_metrics.similarity_heatmap(V, xticks, yticks, title_embs, order)[source]
source.unsupervised_metrics.synset_closures(word, depth=3, get_names=False)[source]
source.unsupervised_metrics.wn_category(word)[source]

Map a word to categories based on WordNet closures.

source.unsupervised_metrics.wn_label_for_words(words, depth=3)[source]

source.utils module

source.utils.create_dir(directory)[source]
source.utils.dict2struct_array(d)[source]

Convert dict to structured array.

source.utils.get_file_name(path)[source]
source.utils.get_vec(word, embeddings, vocab)[source]
source.utils.hr_time(time, round_n=2)[source]

Human readable time.

source.utils.join_struct_arrays(arrays)[source]
source.utils.latex_table_post_process(table, bottomrule_row_ids: List[int] = [], title='', fit_to_page=False, label='')[source]
Add separator lines and align width to page.
param bottomrule_row_ids

Row indices (without header) below which we put a separator line.

source.utils.latex_table_wrapper(table, title, fit_to_page, label)[source]
source.utils.pfont(fonts: List[str], value: str, format)[source]
Wrap string in font code.
param format

PrintFont or LaTeXFont

param fonts

list of font names, eg. [‘red’, ‘bold’]

param value

string to wrap in font

source.utils.pkl2json(pkl_file, savedir)[source]
source.utils.read_jl(path)[source]
source.utils.suffixate(s)[source]
source.utils.tuple_list(arg)[source]
List[Tuple[str]] argument type.

format: whitespace separated str lists, separated by |. eg. ‘embs1 embs2 | embs2 embs3 embs4’

source.vecs2nps module

Script to create vecs.npy and vecs.vocab from files with the following format: <row_num> <dim> <word_1> <vector_1> … <word_n> <vector_n>

source.vecs2nps.main(input_file, output_file)[source]

source.visualise module

source.visualise.tensorboard_emb(data_dir, model_name, output_path, tn_label='wn_clusters', label_name='clusters')[source]

Visualise embeddings using TensorBoard. Code from: https://gist.github.com/BrikerMan/7bd4e4bd0a00ac9076986148afc06507 :param model_name: name of numpy array files: embedding (.npy) and vocab (.vocab) :param output_path: str, directory :param tn_label: label dictionary file path or options: {“wn_clusters”, “None”} :param label_name: str, title for the labeling (e.g.: Cluster)

Usage on remote server with port forwarding:
  • when you ssh into the machine, you use the option -L to transfer the port 6006 of the remote server into the port 16006 of my machine (for instance):

  • ssh -L 16006:127.0.0.1:6006 alv34@yellowhammer What it does is that everything on the port 6006 of the server (in 127.0.0.1:6006) will be forwarded  to my machine on the port 16006.

  • You can then launch tensorboard on the remote machine using a standard tensorboard –logdir log with the default 6006 port

  • On your local machine, go to http://127.0.0.1:16006 and enjoy your remote TensorBoard.

source.visutils module

source.visutils.crop_bbox(image, x, y, w, h)[source]
Crops out a bounding box from image.
param image

PIL Image

param x, y, w, h

left, upper, right, lower coordinates

return

PIL Image

source.visutils.save_crop(image, x, y, w, h, fname, savedir, skip_existing=True)[source]