Here, we colloct all useful datasets for different taks
大区 Dataset
各个领域 dataset awesome data
Google Datasets contains 2.5 million datasets, which can be searched by keywords. It collects datasets from vast domains.
- Huggingface Datsets (spare github link) includes many datasets for NLP tasks.
- Kaggle Datasets is a well-known machine learning dataset collection.
- Paper with Code datasets contains 4075 machine learning datasets. It contacts papers with their code and dataset.
- Reddit Datasets is also a famous dataset which supports discussion over each dataset.
- CLUE Datasets is a big Chinese NLP dataset.
- Some other datasets:
- https://www.datasetlist.com/
- https://github.com/awesomedata/awesome-public-datasets
- https://tinyletter.com/data-is-plural
- https://jupyter-tutorial.readthedocs.io/en/latest/data/index.html
- https://www.openml.org/search?type=data
- https://github.com/InsaneLife/ChineseNLPCorpus
NLP
nlp-datasets - 很好的自然语言资料集集合 The Big Bad NLP Database CLUEDatasetSearch
- Automatic Keyphrase Extraction
- The Big Bad NLP Database [fixme]
- Blizzard Challenge Speech - The speech + text data comes from
- Blogger Corpus
- CLiPS Stylometry Investigation Corpus [fixme]
- ClueWeb09 FACC
- ClueWeb12 FACC
- DBpedia - Structured data from Wikipedia
- Dirty Words - With millions of images in our library and billions of
- Flickr Personal Taxonomies [fixme]
- Freebase of people, places, and things [fixme]
- German Political Speeches Corpus - Collection of political speeches from
- Google Books Ngrams (2.2TB)
- Google MC-AFP - Generated based on the public available Gigaword dataset
- Google Web 5gram (1TB, 2006)
- Gutenberg eBooks List [fixme]
- Hansards text chunks of Canadian Parliament [fixme]
- LJ Speech - Speech dataset consisting of 13,100 short audio clips of a
- M-AILabs Speech - The M-AILABS Speech Dataset is the first large dataset [fixme]
- Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)
- Machine Comprehension Test (MCTest) of text from Microsoft Research
- Machine Translation of European languages
- Making Sense of Microposts 2013 - Concept Extraction [fixme]
- Making Sense of Microposts 2016 - Named Entity rEcognition and Linking
- Multi-Domain Sentiment Dataset (version 2.0)
- Noisy speech database for training speech enhancement algorithms and TTS [fixme]
- Open Multilingual Wordnet
- POS/NER/Chunk annotated data
- Personae Corpus [fixme]
- SMS Spam Collection in English
- SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)
- Stanford Question Answering Dataset (SQuAD)
- USENET postings corpus of 2005~2011
- Universal Dependencies
- Webhose - News/Blogs in multiple languages
- Wikidata - Wikipedia databases
- Wikipedia Links data - 40 Million Entities in Context
- WordNet databases and tools
- WorldTree Corpus of Explanation Graphs for Elementary Science Questions
Ontology Learning - Concept Formation
SimLex-999 designed a gold standard for similarity measurement for pairs of words. It emphasis the similarity other than the relatedness between words, which is focused by the WordSim-353. paper
VC-SLAM Versatile Corpus for Semantic Labeling And Modeling contains 101 data sets from different open data portals, and a target ontology and an additional ontology for mappings to the PLASMA platform. vc-slam
https://www.wikidata.org/wiki/Wikidata:Database_download 、
中文 https://dumps.wikimedia.org/zhwiki/
https://lod-cloud.net/ | https://lod-cloud.net/clouds/geography-lod.svg
GCMD(Global Change Master Directory)RDF、OWL、CSV、JSON 格式下载 https://gcmdservices.gsfc.nasa.gov/static/kms_save/
http://schemas.opengis.net/ gml等,部分内容为 rdf 格式