Separate large pure text stream into HDF5 format
”’ Separate text stream into hdf5 array format for far more efficient processing Memory-efficient Example: $ python3 text_to_hdf5.py /tmp/wikicomp-2014_arko.xml.bz2 “<articlePair id” –split-token-end “</articlePair>” ”’ import os import psutil import sys import h5py import gzip import bz2 import logging import argparse process = psutil.Process(os.getpid()) logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, \ level=logging.INFO) Read More