Home

 

Korean Dependency Parsing Demonstration

This showcases the combination of Komoran3 POS tagger and Google SyntaxNet dependency parser.

Demo here

About

I’m Andrew, currently attending Korea University researching neural networks and language understanding technology. I focus on technology for parsing morphologically complex languages like Korean. I also assist with side projects in the lab such as Computational Thinking.¬†This site showcases my various experiments in the field of neural networks and language processing.

Blog

Convert huge SQL files to CSV files (experimental)

Experimental code to convert sql.gz files full of SQL commands into usable CSV files, memory-efficiently.   ”’ Convert humongous compressed sql.gz files full of SQL commands into usable CSV files, memory-efficiently. ”’ import gzip import html import sys import string import unicodedata import pickle from html.parser import HTMLParser #gzfile = gzip.open(‘enwiki-latest-langlinks.sql.gz’) #outfile = open(‘enwiki-latest-langlinks.csv’, ‘w’, …

Train word embeddings from CoNLL corpus file

Train word embeddings using CoNLL corpus as input. Depends on: CoNLL Utils train_word_embeddings # take a CoNLL corpus and train word/doc embeddings import argparse import os import sys from conll_utils import * from gensim.models.word2vec import * # random from random import shuffle # necessary for seeing logs import logging logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO) …

CoNLL File Parsing Module

Module to help parse CoNLL (including optional special functionality for handling Korean characters). For use with POS tagging and dependency parsing. Depends on:¬†Well-formed/projectivize filters conll_utils ”’ A set of classes to handle input and output of CoNLL-U files http://universaldependencies.org/docs/format.html The Parsed* classes are useful to store extra properties needed during the parsing process that are …