NLP from scratch | The War Of Mine

最近实习在学NLP，总结一下综述性的内容吧

介绍

NLP实际上可以分为5个问题

分类
匹配
翻译
结构预测
马尔科夫决策过程

以此为划分，可以解决非常多的问题

Pipeline

NLP的pipeline基本是一样的

预处理
获取embedding
各种应用

文本预处理

预处理主要是为了处理成可以用处分析的形式

一般分为3步

噪音移除（一种可以用词袋，二者可以用正则表达式，枚举去除没用的单词比如is, the, a, am）
词典规范化（又包括stemming，词干提取，以及词性还原，lemmatization）这个可以用NLTK的包来实现。
对象标准化(用来处理在常见的词汇库不存在的单词，比如俚语，网络用语等，最简单的可以用字典枚举实现)

噪音移除

# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "..."] 
def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

_remove_noise("this is a sample text")
>>> "sample text"

# Sample code to remove a regex pattern 
import re 

def _remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[\w]*"  

_remove_regex("remove this #hashtag from analytics vidhya", regex_pattern)
>>> "remove this  from analytics vidhya"

词汇标准化

from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()

word = "multiplying" 
lem.lemmatize(word, "v")
>> "multiply" 
stem.stem(word)
>> "multipli"

目标标准化

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love", "..."}
def _lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) new_text = " ".join(new_words) 
        return new_text

_lookup_words("RT this is a retweeted tweet by Shivam Bansal")
>> "Retweet this is a retweeted tweet by Shivam Bansal"

文本 embedding

常用的embedding包括word2vec,GloVe

主要应用

计算相似度
- 寻找相似词
- 信息检索
作为 SVM/LSTM 等模型的输入
- 中文分词
- 命名体识别
句子表示
- 情感分析
文档表示
- 文档主题判别
Text Classification
Text Summarization – Given a text article or paragraph, summarize it automatically to produce most important and relevant sentences in order.
Machine Translation – Automatically translate text from one human language to another by taking care of grammar, semantics and information about the real world, etc.
Natural Language Generation and Understanding – Convert information from computer databases or semantic intents into readable human language is called language generation. Converting chunks of text into more logical structures that are easier for computer programs to manipulate is called language understanding.
Optical Character Recognition – Given an image representing printed text, determine the corresponding text.
Document to Information – This involves parsing of textual data present in documents (websites, files, pdfs and images) to analyzable and clean format.

常见的NLP处理库

Scikit-learn: Machine learning in Python
Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
Pattern – A web mining module for the with tools for NLP and machine learning.
TextBlob – Easy to use nl p tools API, built on top of NLTK and Pattern.
spaCy – Industrial strength N LP with Python and Cython.
Gensim – Topic Modelling for Humans
Stanford Core NLP – NLP services and packages by Stanford NLP Group.