Software/Data

Software

Data

Software

  • NGT(Neighborhood Graph and Tree for Indexing)

    Overview

    The software being provided performs high-speed searches for data near to the vectors indicated as queries from among large volumes of high dimensional vector data.

    How to get

  • big3store

    Overview

    This is a prototype system for storing and searching large-scale knowledge graph data efficiently. The scalability of storage system and query processing system towards Peta triples is currently possible by using large-scale distribution of data into shared-nothing clusters.

    How to get

  • AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-Label Classification

    Overview

    AnnexML is a multi-label classifier designed for extremely large label space (10^4 to 10^6). The prediction of AnnexML is efficiently performed by using an approximate nearest neighbor search method.

    How to get

  • yskip: Incremental Skip-gram Model with Negative Sampling

    Overview

    C++ implementation of incremental algorithm for learning skip-gram model with negative sampling.

    How to get

  • LSTM-VAE for text modeling

    For more information, please see "Better Exploiting Latent Variables in Text Modeling".

Data

  • Yahoo! Chiebukuro Data (Ver. 2)

    Overview

    Yahoo! Chiebukuro is the largest community-driven question answering service in Japan. It connects users with questions to those users who may have the answer, enabling people to share information and knowledge with each other. The data being provided consists of resolved questions and answers extracted from the Chiebukuro database for the period as below.

    Period :April 2004 – April 2009
    Number of Questions :about 16 million
    Number of Answers :about 50 million

    How to get

    This data is available for download through the National Institute of Informatics (NII) (external link) homepage. Please refer to the NII’s Yahoo! Chiebukuro Data (Ver. 2) Usage Procedures page (external link) for details regarding applying for and using the data.

  • Yahoo! Search Query Data

    Overview

    The data is composed of a set of related queries to the topic queries of the 12th NTCIR (NTCIR-12) tasks. By using three different techniques, related queries were extracted from search logs of Yahoo! Search for the period as below. The data does not contain any personal information such as operation history, personal identifiers and context.

    Period :July 2009 – June 2013

    How to get

    This data is provided to NTCIR (NII Testbeds and Community for Information access Research) (external link) Evaluation of Information Access Technologies Workshop participants, and can be used for free by research groups taking part in the workshop.
    For details, please check the NTCIR (external link) web page.
    ※ Applications to participate in the task that will use the data provided by Yahoo! JAPAN are no longer being accepted.

  • YJ Captions Dataset

    Overview

    We have developed a Japanese version of the MS COCO caption dataset (external link), which we call YJ Captions 26k Dataset. It is created to facilitate the development of image captioning in Japanese language. Each Japanese caption describes the specified image provided in MS COCO dataset and each image has 5 captions.

    How to get

  • YJ Chat Detection Dataset

    Overview

    This is the chat detection dataset introduced in (Akasaki and Kaji ACL 2017) (external link).

    How to get

    The dataset is available for research purposes only. Please fill in Application for Use of Yahoo’s Speech Transcription Data on Chat Detection Study and send it to yjresearch-data "at" mail.yahoo.co.jp as a pdf file. Qualified applicants include academic or industrial researchers. Students can use the data, but are not qualified as applicants.

  • Japanese Visual Genome VQA Dataset

    Overview

    We have created a Japanese visual question answering (VQA) dataset by using Yahoo! Crowdsourcing, based on the images from the Visual Genome dataset(external link). Our dataset is meant to be comparable to the freeform QA part of Visual Genome dataset. The dataset consists of 99,208 images, together with 793,664 QA pairs in Japanese with every image having eight QA pairs.

    How to get

  • Visual Scenes with Utterances Dataset

    Overview

    With the widespread use of intelligent systems, more and more people expect such systems to understand complicated social scenes. To facilitate development of intelligent systems, we created a mock dataset called Visual Scenes with Utterances (VSU) that contains a vast body of image variations in visual scenes with an annotated utterance and a corresponding addressee. Our dataset is based on images and annotations from the GazeFollow dataset (Recasens et al., 2015). The GazeFollow dataset consists of (1) the original image, (2) cropped speaker image with head location annotated, and (3) gaze. To create our dataset, we further annotated (4) utterances in texts, and (5) to whom an utterance is addressed. The images are available at http://gazefollow.csail.mit.edu/ .

    How to get

  • Experimental Dataset for Post-Ensemble Methods

    Overview

    This is the dataset including 128 summarization models and their outputs used for comparing post-ensemble methods in the following paper.

    Paper :Frustratingly Easy Model Ensemble for Abstractive Summarization (EMNLP 2018)

    How to get

  • YJ Constructive Comment Ranking Dataset

    Overview

    This is the dataset for ranking constructive comments used in the following paper.

    Paper :Dataset Creation for Ranking Constructive News Comments (ACL 2019)

    How to get

    The dataset is available for research purposes only. Please fill in Application for Use of Yahoo's Comment Data on Study for RankingConstructive Comments and send it to yjresearch-data “at” mail.yahoo.co.jp as a pdf file. Qualified applicants include academic or industrial researchers. Students can use the data, but are not qualified as applicants.
  • Yahoo! Chiebukuro Extractive Headline Dataset

    Overview

    This is the dataset for extractive headline generation in Yahoo! Chiebukuro used in the following paper.

    Paper :Extractive Headline Generation Based on Learning to Rank for Community Question Answering (COLING 2018)

    How to get

    The dataset is available for research purposes only. Please fill in Application for Use of Yahoo's Question Data on Study for Question Headline Generation and send it to yjresearch-data “at” mail.yahoo.co.jp as a pdf file. Qualified applicants include academic or industrial researchers. Students can use the data, but are not qualified as applicants.