April 21, 2018 – Andrew Matteson

Convert huge SQL files to CSV files (experimental)

April 21, 2018Uncategorized

Experimental code to convert sql.gz files full of SQL commands into usable CSV files, memory-efficiently. ”’ Convert humongous compressed sql.gz files full of SQL commands into usable CSV files, memory-efficiently. ”’ import gzip import html import sys import string import unicodedata import pickle from html.parser import HTMLParser #gzfile = Read More

Train word embeddings from CoNLL corpus file

April 21, 2018April 21, 2018Uncategorized

Train word embeddings using CoNLL corpus as input. Depends on: CoNLL Utils train_word_embeddings # take a CoNLL corpus and train word/doc embeddings import argparse import os import sys from conll_utils import * from gensim.models.word2vec import * # random from random import shuffle # necessary for seeing logs import logging logging.basicConfig(format=’%(asctime)s Read More

CoNLL File Parsing Module

April 21, 2018April 21, 2018Uncategorized

Module to help parse CoNLL (including optional special functionality for handling Korean characters). For use with POS tagging and dependency parsing. Depends on: Well-formed/projectivize filters conll_utils ”’ A set of classes to handle input and output of CoNLL-U files http://universaldependencies.org/docs/format.html The Parsed* classes are useful to store extra properties needed during Read More

Arc projectivize and well-formed filters for dependency parsing in pure Python (translated from SyntaxNet code)

April 21, 2018April 21, 2018Uncategorized

Projectivize and well-formed filters translated to pure Python (translated from SyntaxNet code: document_filters.cc) ‘sentence’ object should be similar to that in SyntaxNet, with HEAD and other properties coming from the CoNLL file. See: CoNLL File Parsing Module for definition of the ‘sentence’ structure. Projectivize filter Check whether the given sentence is projective Read More

Convert Sejong POS-tagged corpus format to CoNLL-U format

April 21, 2018Uncategorized

Convert Sejong POS-tagged corpus format to CoNLL-U format (useful for training Google SyntaxNet) #!/usr/bin/python3 # -*- coding: utf-8 -*- ”’ Convert Sejong POS-tagged corpus format to CoNLL-U format for use with Google SyntaxNet http://universaldependencies.org/docs/format.html Outputs training, testing, and tuning sets (60-20-20 ratio, randomly chosen) Arguments: ./corpus_to_conll.py <sejong_corpus_dir> <output_dir> sejong_corpus_dir should Read More

Disable light on ASUS ROG STRIX IMPACT Mouse under Linux

April 21, 2018April 21, 2018Uncategorized

This is some code to interface with the ASUS ROG STRIX IMPACT Mouse on Linux. Attempts were made to perform DPI adjustment as well. These commands were inferred by using a USB packet sniffer on Windows. Please use at your own risk. This code may only work for the following Read More

Align TED English and Korean xml corpuses (requires preprocessing/normalizing/tokenizing code)

April 21, 2018April 21, 2018Uncategorized

This is a script for aligning English and Korean corpuses obtainable below. It also outputs vocab files for the NMT training TensorFlow example. https://wit3.fbk.eu/mono.php?release=XML_releases&tinfo=cleanedhtml_ted #!/usr/bin/python3 import random from normalizer import Normalizer from tokenizer import Tokenizer ”’ Align TED English and Korean xml corpuses: https://wit3.fbk.eu/mono.php?release=XML_releases&tinfo=cleanedhtml_ted ”’ # character threshold # Read More

Auto-detect different Korean encodings in current directory (UTF-8, CP949, EUC-KR)

April 21, 2018April 21, 2018Uncategorized

Auto-detect different Korean encodings in current directory (UTF-8, CP949, EUC-KR). Helpful when processing large amounts of Korean data with inconsistent encoding. Probably should be run recursively. #!/usr/bin/python3 import os from os import listdir, mkdir from os.path import isfile, join mypath = ‘.’ onlyfiles = [f for f in listdir(mypath) if Read More

Compress large JSON text streams into one indexed zip file

April 21, 2018Uncategorized

This may help with processing of multiple large JSON files. It creates an indexed zip file containing all JSON files that can then be read with Python. ”’ Compress json stream(s) into indexed compressed format (zip file) for far more efficient processing Example: $ python3 json_to_zippack.py /tmp/jsonfile1.json /tmp/jsonfile2.json /tmp/jsonfile3.json ”’ Read More

Force convert EUC-KR encoded files to UTF-8 and save in separate folder (WARNING: destructive)

April 21, 2018April 21, 2018Uncategorized

WARNING: All files in the current directory will be forcefully converted to UTF-8 while ignoring errors. However, they will all be stored in a separate utf8 folder. Please don’t use this script unless you know what you’re doing. I’m providing it for convenience. #!/usr/bin/python3 import os from os import listdir, Read More

Day: April 21, 2018