[object Object] | Jason Hao's Blog

.use-motion .brand, .use-motion .menu-item, .sidebar-inner, .use-motion .post-block, .use-motion .pagination, .use-motion .comments, .use-motion .post-header, .use-motion .post-body, .use-motion .collection-header { opacity: initial; } .use-motion .site-title, .use-motion .site-subtitle { opacity: initial; top: initial; } .use-motion .logo-line-before i { left: initial; } .use-motion .logo-line-after i { right: initial; }

0%

[object Object]

Posted on 2021-07-10 Edited on 2021-08-24 In Research Method Views: Views: Valine: 1.2k 1 mins.

General Tools

NLTK - 自然语言工具包 :+1:
spacy - 使用 Python 和 Cython 的高性能的自然语言处理库 :+1:
gensim - 用于对纯文本进行无监督的语义建模的库，支持 word2vec 等算法 :+1:
StanfordNLP - 适用多语言的 NLP Library ，包含 Java 和 Python 语言 :+1:
OpenNLP - 基于机器学习的自然语言处理的工具包，使用 Java 语言开发 :+1:
TextBlob - 为专研常见的自然语言处理（NLP）任务提供一致的 API
Jieba 结巴分词 - 强大的Python 中文分词库 :+1:
HanLP - 面向生产环境的多语种自然语言处理工具包
SnowNLP - 中文自然语言处理 Python 包，没有用NLTK，所有的算法都是自己实现的
FudanNLP - 用于中文文本处理的 Java 函式库
THULAC - 包括中文分词、词性标注功能。

Term Extraction

Bag of What Simple Noun Phrase Extraction for Text 2016. It is a pattern-based phrase extraction tool, written in Python and R.

Basic usage of phrasemachine

pip install phrasemachine

import phrasemachine
text = "Barack Obama supports expanding social security."
phrasemachine.get_phrases(text)
{'num_tokens': 7, 'counts': Counter({'barack obama': 1, 'social security': 1})}

It can support other higher accuracy spaCy tagger, or with Stanford CoreNLP.
The position of each token can be obtained.

Ontology Query Endpoints

wikidata sparql 在线查询
SparqlEndpoints 列表（部分不能访问）
北大 gStore SPARQL Endpoint （dbpeida、freebase等）
http://dbpedia.org/sparql
Automated Phrase Mining from Massive Text Corpora 2017. This tool can be easily run by a .sh file, but needs g++, and Java as back tool.

References

Bag of What? Simple Noun Phrase Extraction for Text Analysis
Automated Phrase Mining from Massive Text Corpora

Post author: Jason Hao
Post link: https://jason-huanghao.github.io/2021/07/10/Research Method/NLP-Tools/
Copyright Notice: All articles in this blog are licensed under BY-NC-SA unless stating additionally.