Scripts

I provide various sample code and other goodies below. Not all code is in production-ready format and may require small environmental tweaks or changing of certain hard-coded values to run, however scripts on git should be cleaner.

My GitHub

ash-parser: SyntaxNet in pure Python with GPU support

nn-nlp-skeleton: Foundational code for NLP and Neural Network training tasks

nmt: Add GPU support to encoder portion of TensorFlow NMT sample code

mt7610u-linksys-ae6000-wifi-fixes: Update of MT7610U driver for modern Linux kernels

NLP-Related

Fix broken Korean filenames (e.g., Áö´ÉÇüÇüżҺм®±â.zip -> 지능형형태소분석기.zip)

Force convert all legacy-encoded Korean EUC-KR files to UTF-8 and save in separate folder (WARNING: destructive)

Auto-detect different Korean encodings in current directory (UTF-8, CP949, EUC-KR)

Visualize Gensim word2vec model in TensorBoard

Compress large JSON text streams into one indexed zip file for efficiency

Separate large JSON text streams into HDF5 format

Separate large pure text stream into HDF5 format

Preprocessing, normalizing, and tokenizing dirty Unicode input text

Align TED English and Korean xml corpuses (requires preprocessing/normalizing/tokenizing code)

Convert Sejong POS-tagged corpus format to CoNLL-U format

Arc projectivize and well-formed filters for dependency parsing in pure Python (translated from SyntaxNet code)

CoNLL file parsing module

Train word embeddings from CoNLL corpus file

Convert huge SQL files to CSV files (experimental)

Miscellaneous

Disable light on ASUS ROG STRIX IMPACT mouse under Linux