Month: April 2018

Separate large pure text stream into HDF5 format

”’ Separate text stream into hdf5 array format for far more efficient processing Memory-efficient Example: $ python3 text_to_hdf5.py /tmp/wikicomp-2014_arko.xml.bz2 “<articlePair id” –split-token-end “</articlePair>” ”’ import os import psutil import sys import h5py import gzip import bz2 import logging import argparse process = psutil.Process(os.getpid()) logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, \ level=logging.INFO) Read More

Separate large JSON text streams into HDF5 format

”’ Separate json stream(s) into hdf5 array format for far more efficient processing Example: $ python3 json_to_hdf5.py /tmp/jsonfile1.json.gz /tmp/jsonfile2.json.gz /tmp/jsonfile3.json.gz ”’ import os import json import psutil import sys import h5py import gzip import bz2 import logging import argparse process = psutil.Process(os.getpid()) logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, \ level=logging.INFO) logger Read More

Fix broken Korean filenames

Fix broken Korean filenames (e.g., Áö´ÉÇüÇüżҺм®±â.zip -> 지능형형태소분석기.zip) Fixes all filenames in current directory #!/usr/bin/python3 from os import listdir, rename from os.path import isfile, isdir, join, exists mypath = ‘.’ onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f)) or isdir(join(mypath, f))] for f in onlyfiles: origF = f Read More