Using TensorBoard Projector with Non-TensorFlow Data

TensorBoard allows us to visualize a TensorFlow model. Quite possibly the most interesting part of TensorBoard is the ability to visualize embeddings. But the most unfortunate part is that all the current tutorials require you to have your data in TensorFlow tensor format. I’m going to show you how to visualize external vector data using TensorBoard projector.

Actually, the TensorBoard projector is available standalone as well.

Including it on your website is as simple as copying the source code of this page and adjusting some paths, and converting your word vectors to the simple format.

Typically there are two files used per model:

  • tensors.bytes: for storing raw float32 bytes
  • labels.tsv: for storing labels of the data (can be multiple columns)

There are other possible files such as bookmarks as well, but I have not experimented with these yet.

Projector demo-mode global configuration file:

  • oss_demo_projector_config.json: used to populate the listbox of all possible models

If we look at the source code for how the tensors are read, we can determine that the tensors are stored as a series of float32 in Little-Endian format.

Also, the metadata TSV file needs to be formatted as described on the TensorBoard Embedding Visualization page. Basically, if single column of data, no header, and if multiple columns of data, include a header. Format is TSV (tab-delimited).

Converting your word vectors

Gensim Binary Format

Please refer to this sample code to read a gensim word2vec/doc2vec file and convert to Google format. Because Python2-created word2vec models don’t load in Python3, this script was made to be compatible with both Python2 and Python3 (hence the messy try/catch). Adjust as necessary. Output labels include word and word frequency.

'''
Output Little-Endian bytes and labels file from gensim model
Also outputs necessary json config file portion
For use with TensorBoard
'''

import struct
from gensim.models import Word2Vec, Doc2Vec

model = Word2Vec.load('/mnt/deeplearn/corpus/wordvec')
# have to use python2 for some old models
#model = Doc2Vec.load('/mnt/deeplearn/corpus/wiki/enwiki_dbow/doc2vec.bin')

num_rows = len(model.vocab)
dim = model.vector_size

tensor_out_fn = 'enwiki_wordvec_%d_%dd_tensors.bytes' % (num_rows, dim)
labels_out_fn = 'enwiki_wordvec_%d_%dd_labels.tsv' % (num_rows, dim)

tensor_out = open(tensor_out_fn, 'wb')

try:
    labels_out = open(labels_out_fn, 'w', encoding='utf-8')
except:
    labels_out = open(labels_out_fn, 'w')

labels_out.write('word\tcount\n')

for wd in model.vocab:
    floatvals = model[wd].tolist()
    assert dim == len(floatvals)
    assert '\t' not in wd

    for f in floatvals:
        tensor_out.write(struct.pack('<f', f))

    try:
        labels_out.write('%s\t%s\n' % (wd, model.vocab[wd].count))
    except:
        labels_out.write(('%s\t%s\n' % (wd, model.vocab[wd].count)).encode('utf-8'))

tensor_out.close()
labels_out.close()

print('''{
  "embeddings": [
    {
      "tensorName": "EnWiki WordVec",
      "tensorShape": [%d, %d],
      "tensorPath": "%s",
      "metadataPath": "%s"
    }
  ],
  "modelCheckpointPath": "Demo datasets"
}''' % (num_rows, dim, tensor_out_fn, labels_out_fn))

 

Mikolov Text Format

Original word2vec C output looks like the following:

3640 200
the -0.204804 -0.052040 0.043982 -0.004476 0.047488 -0.117330 0.098833 0.057791 0.034016 -0.029288 0.033779 0.019669 0.039926 0.013505 -0.028944 -0.046688 0.051126 -0.179796 -0.038449 0.105646 -0.062454 0.159780 0.098374 -0.066799 0.043988 0.078421 -0.023859 -0.064381 -0.049186 0.098745 -0.065501 -0.064357 -0.033404 -0.075993 -0.155364 -0.141063 -0.112118 -0.083261 -0.042881 -0.161020 -0.065910 -0.006419 0.122346 0.029589 0.005771 0.034997 0.132771 0.133731 -0.236707 0.003596 0.065436 -0.029172 0.092861 0.000899 -0.044415 0.044571 -0.130939 -0.009306 -0.013035 0.041786 -0.140875 0.068760 0.015598 -0.159848 -0.002214 0.091743 0.160617 -0.062522 -0.154163 0.051791 -0.017152 -0.072281 -0.004665 0.098094 -0.082602 0.129818 -0.034611 0.095771 -0.158212 -0.118649 -0.018298 0.057409 0.035082 0.101609 0.137820 0.119146 0.022341 0.088665 0.063133 -0.093622 -0.199239 0.090779 0.029341 0.339419 0.036050 0.055952 0.001389 -0.120094 0.040512 0.008463 0.019696 -0.016201 -0.046048 -0.041638 0.042387 0.143988 0.032240 0.026590 -0.019424 0.252759 -0.059701 0.102749 -0.020249 0.017367 -0.052863 0.016690 0.051070 0.092457 -0.022582 -0.071467 -0.055920 0.102884 -0.032308 0.003930 0.113994 -0.089827 -0.047654 -0.019138 0.112558 0.003431 -0.017003 -0.026772 -0.143089 -0.029886 -0.012663 0.249865 0.031429 0.053664 0.082787 0.022943 0.017508 0.066460 -0.099233 0.022793 0.198909 -0.023253 0.037466 0.040887 -0.021186 -0.094322 -0.034299 0.029712 0.077449 0.081060 -0.154708 0.037450 -0.108886 0.121766 -0.087976 -0.079284 -0.089787 -0.103210 -0.011010 -0.112984 0.060790 0.226235 -0.139126 -0.009340 0.156620 -0.035243 -0.043227 0.065795 0.131128 -0.071456 0.010347 0.078871 0.013810 -0.150217 -0.040919 -0.026854 0.037979 0.092218 -0.074310 -0.024670 -0.057915 -0.048576 0.013131 0.127952 0.036090 -0.167510 -0.175556 -0.051837 -0.105541 -0.025793 0.066858 -0.094733 -0.029453 0.099344 -0.205130 0.075324

First line is <number of word> <number of dimensions per word>. Number of words corresponding to number of lines in the file.

To convert it to Google format, we can use the following script:

'''
Output Little-Endian bytes and labels file from Mikolov-style text-output word2vec model
Also outputs necessary json config file portion
For use with TensorBoard
'''

import struct

input_file = open('/home/andy/Downloads/XlingualEmb-master/en.it.word.emb', 'r', encoding='utf-8')

input_string = input_file.read()
input_file.close()

ln_num = 0

num_rows = None
dim = None

for ln in input_string.split('\n'):
    ln = ln.strip()
    if not ln:
        continue
    ln_num += 1

    if ln_num == 1:
        num_rows = int(ln.split()[0])
        dim = int(ln.split()[1])

        tensor_out_fn = 'enit_xling_%d_%dd_tensors.bytes' % (num_rows, dim)
        labels_out_fn = 'enit_xling_%d_%dd_labels.tsv' % (num_rows, dim)

        tensor_out = open(tensor_out_fn, 'wb')
        labels_out = open(labels_out_fn, 'w', encoding='utf-8')
        # if no tab/only single column, assumed to be just list itself
        # must omit header for TensorBoard
        # Mikolov files don't include word count
        #labels_out.write('word\n')

    else:
        word = ln.rsplit(None, dim)[0]
        floatvals = [float(x) for x in ln.rsplit(None, dim)[1:]]
        assert len(floatvals) == dim
        # in case word has tab
        #assert '\t' not in word

        for f in floatvals:
            tensor_out.write(struct.pack('<f', f))

        labels_out.write('%s\n' % (word))

tensor_out.close()
labels_out.close()

print('''{
  "embeddings": [
    {
      "tensorName": "En-It XLing WordVec",
      "tensorShape": [%d, %d],
      "tensorPath": "%s",
      "metadataPath": "%s"
    }
  ],
  "modelCheckpointPath": "Demo datasets"
}''' % (num_rows, dim, tensor_out_fn, labels_out_fn))

# line num should be number of rows plus header
assert ln_num == num_rows+1

Configuring Projector

The above tools also output a sample clause to place in the configuration file.

Here’s the configuration file I use.

{
  "embeddings": [
    {
      "tensorName": "KoWiki Word2Vec",
      "tensorShape": [424786, 100],
      "tensorPath": "oss_data/kowiki_wordvec_424786_100d_tensors.bytes",
      "metadataPath": "oss_data/kowiki_wordvec_424786_100d_labels.tsv"
    },
    {
      "tensorName": "Word2Vec 10K",
      "tensorShape": [10000, 200],
      "tensorPath": "oss_data/word2vec_10000_200d_tensors.bytes",
      "metadataPath": "oss_data/word2vec_10000_200d_labels.tsv"
    },
    {
      "tensorName": "Word2Vec All",
      "tensorShape": [71291, 200],
      "tensorPath": "oss_data/word2vec_full_200d_tensors.bytes",
      "metadataPath": "oss_data/word2vec_full_200d_labels.tsv",
      "bookmarksPath": "oss_data/word2vec_full_bookmarks.txt"
    },
    {
      "tensorName": "Mnist with images",
      "tensorShape": [10000, 784],
      "tensorPath": "oss_data/mnist_10k_784d_tensors.bytes",
      "metadataPath": "oss_data/mnist_10k_784d_labels.tsv",
      "sprite": {
        "imagePath": "oss_data/mnist_10k_sprite.png",
        "singleImageDim": [28, 28]
      }
    },
    {
      "tensorName": "Iris",
      "tensorShape": [150, 4],
      "tensorPath": "oss_data/iris_tensors.bytes",
      "metadataPath": "oss_data/iris_labels.tsv"
    }
  ],
  "modelCheckpointPath": "Demo datasets"
}

(I am not sure what the last line modelCheckpointPath is used for.)

If you still can’t figure out how the paths are set up, feel free to use Chrome Web Inspector to examine how my sample site works. You can also debug path issues on your site using the Inspector. For reference, my directory structure is as follows (under apache www area):

projector$ find
.
./oss_demo_bin.js
./oss_data
./oss_data/mnist_10k_784d_tensors.bytes
./oss_data/iris_tensors.bytes
./oss_data/mnist_10k_784d_labels.tsv
./oss_data/oss_demo_projector_config.json
./oss_data/kowiki_wordvec_424786_100d_tensors.bytes
./oss_data/word2vec_10000_200d_labels.tsv
./oss_data/word2vec_full_bookmarks.txt
./oss_data/iris_labels.tsv
./oss_data/word2vec_10000_200d_tensors.bytes
./oss_data/kowiki_wordvec_424786_100d_labels.tsv
./oss_data/mnist_10k_sprite.png
./index.html

Inspector:

Basically, just copy the .bytes and .tsv file and adjust the configuration file as necessary. Here’s a working example.

 

Sorry, but due to stringent bandwidth usage limitations I could only leave up the iris example.

Leave a Reply

Your email address will not be published. Required fields are marked *