Day: April 21, 2018

Convert huge SQL files to CSV files (experimental)

Experimental code to convert sql.gz files full of SQL commands into usable CSV files, memory-efficiently.   ”’ Convert humongous compressed sql.gz files full of SQL commands into usable CSV files, memory-efficiently. ”’ import gzip import html import sys import string import unicodedata import pickle from html.parser import HTMLParser #gzfile = Read More

CoNLL File Parsing Module

Module to help parse CoNLL (including optional special functionality for handling Korean characters). For use with POS tagging and dependency parsing. Depends on: Well-formed/projectivize filters conll_utils ”’ A set of classes to handle input and output of CoNLL-U files http://universaldependencies.org/docs/format.html The Parsed* classes are useful to store extra properties needed during Read More

Arc projectivize and well-formed filters for dependency parsing in pure Python (translated from SyntaxNet code)

Projectivize and well-formed filters translated to pure Python (translated from SyntaxNet code: document_filters.cc) ‘sentence’ object should be similar to that in SyntaxNet, with HEAD and other properties coming from the CoNLL file. See: CoNLL File Parsing Module for definition of the ‘sentence’ structure. Projectivize filter Check whether the given sentence is projective Read More

Convert Sejong POS-tagged corpus format to CoNLL-U format

Convert Sejong POS-tagged corpus format to CoNLL-U format (useful for training Google SyntaxNet) #!/usr/bin/python3 # -*- coding: utf-8 -*- ”’ Convert Sejong POS-tagged corpus format to CoNLL-U format for use with Google SyntaxNet http://universaldependencies.org/docs/format.html Outputs training, testing, and tuning sets (60-20-20 ratio, randomly chosen) Arguments: ./corpus_to_conll.py <sejong_corpus_dir> <output_dir> sejong_corpus_dir should Read More

Align TED English and Korean xml corpuses (requires preprocessing/normalizing/tokenizing code)

This is a script for aligning English and Korean corpuses obtainable below. It also outputs vocab files for the NMT training TensorFlow example. https://wit3.fbk.eu/mono.php?release=XML_releases&tinfo=cleanedhtml_ted   #!/usr/bin/python3 import random from normalizer import Normalizer from tokenizer import Tokenizer ”’ Align TED English and Korean xml corpuses: https://wit3.fbk.eu/mono.php?release=XML_releases&tinfo=cleanedhtml_ted ”’ # character threshold # Read More