Quick Start

Notebook available on Google Colab

Setup

DBQR-QA uses Python, and for this shared task, you will only need Pandas to read and manipulate the tables. To get started, download the dataset:

Stage Link Samples Available
Practice Download 50 questions 2 September 2024
Training Download 200 questions 7 November 2024
Blind test Download 150 questions 7 November 2024
Test answers - 150 questions 25 November 2024

All files above belong to the master branch. We also published customized versions of the dataset based on participants' suggestions. These variations may include modifications to the questions or answers. See the download page for more information.

Data Structure

We organized the questions and labels into chats. Each chat contains a series of ten conversational questions stored in JSON files (with corresponding tables stored as a dictionary of Pandas Data Frames in pickle files) in the following structure:

        data
|- questions
|  |- chat-1-01
|  |  |- question-01.json
|  |  |- ...
|  |  |- question-10.json
|  |- ...
|- tables
|  |- chat-1-01
|  |  |- question-01.pkl
|  |  |- ...
|  |  |- question-10.pkl
|  |- ...
        
      

Each JSON file represents a question-answer pair in the following format:

        {
  "sectionID": (Integer) Dataset section ID (1 to 5),
  "sectionTitle": (String) Dataset section title,
  "conversationID": (Integer) Conversation ID
  "questionID": (Integer) Question ID,
  "question": (String) Question,
  "vars": (Dictionary) Variables (e.g., entities, numbers, years),
  "queries": (Dictionary) Database queries (output to pickle files),
  "program": (List) A series of operators and parameters,
  "answer": (Any) The final answer
}
      

We used a series of pre-defined functions to construct a program that manipulates the input tables to produce the answers. However, these functions are optional and not part of the evaluation. You may process the tables in any way, for example, by using a language model to generate independent programs.

The queries are in Cypher (query language). We used these queries to get the data from the graph database (Neo4j). The outputs of the queries are Pandas Data Frames stored in pickle files. The following is an example of a question, variables, queries, and a program:

        {
  ...
  "question": "Was Caterpillar's average total revenue higher or lower than Realogy's lowest net income from 2019 to 2021?",
  "vars": {
    "company_1": {
      "mention": "Caterpillar",
      "name": "CATERPILLAR INC"
    },
    "concept_1": {
      "mention": "total revenue",
      "name": "us-gaap:Revenues"
    }
    ...
  },
  "queries": {
    "var_q1_p1": "WITH [\"CATERPILLAR INC\"] AS companies, ... RETURN o.name AS company ...",
    ...
  },
  "program": [
    "var_q1_p1 = get_company_facts(company_1, concept_1, start=year_1, end=year_2)",
    ...
    "var_q1_p3 = mean(var_q1_p1, axis='columns', level=null)",
    ...
  ]
}
      

In this example, get_company_facts is a pre-defined function that creates a query from variables (e.g., company_1) and outputs a Data Frame. Function mean is also pre-defined and takes var_q1_p1 as its input. Again, you do not need to use our pre-defined functions in your generated programs or to calculate the answers.

Blind Test

The data for the blind test will exclude the programs and answers. Participants must submit their answers to our online evaluators (daily limit applies). We will make the programs and answers available to all participants during the manual evaluation period. All data will be available publicly after the evaluation concludes.

Preparation

Automatic

We have created and published a Python package for the dataset. You can install it directly using the pip command:

        pip install dbqrqa
      

To load the dataset, create an instance of the TableDataset class. The default location of the dataset is the data folder in the project's top directory. You can change the location by setting the class's data_path parameter.

The class has three stage attributes (practice, train, and test), each storing an instance of the TableSplit class if the data for that particular stage is available. The TableSplit class (dataset.practice in the code below) automatically loads and manages the input data and tables if available.

        from dbqrqa.dataset import TableDataset

dataset = TableDataset()
practice = dataset.practice
      

The TableSplit class stores questions, queries, tables, and labels in the same structure as the answers for evaluation. The following example shows how to access them.

        practice.questions['chat-1-01']['1']
practice.answers['chat-1-01']['1']
practice.queries['chat-1-01']['1']['var_q1_p1']
practice.tables['chat-1-01']['1']['var_q1_p1']
        
      

You can use the TableSplit class's answer function to assign an answer or directly store all answers to the class's answers attribute. The answers stored directly to the answers attribute must follow the evaluation input format.

        # Assign an answer individually
practice.answer("chat-1-01", 1, "lower")

# Assign all answers
practice.answers = {
  "chat-1-01": {
    "1": "lower",
    ...
  },
  ...
}
      

Manual

If you prefer to manage the dataset manually, the following is an example code of how to read the questions and tables:

        import json
import pickle
from os.path import join

# The folder name is "chat-[sectionID]-[conversationID]"
# There are five sections and up to ten conversations in a section.
parent_dir = join("data", "practice")

# There are always 10 files in the "questions" and "tables" folders.
with open(join(parent_dir, "questions", "chat-1-01", "question-01.json")) as file:
  sample = json.load(file)

with open(join(parent_dir, "tables", "chat-1-01", "question-01.pkl"), 'rb') as file:
  table = pickle.load(file)
      

To display the table, run:

        print(table["var_q1_p1"])
      

The code above outputs the result (table) of the first query (var_q1_p1):

2019 2020 2021
CATERPILLAR INC us-gaap:Revenues 5.380000e+10 4.174800e+10 5.097100e+10

Evaluation

We implemented two evaluators, heuristic and GPT-based, offering different levels of flexibility and cost. You can use the evaluate function in the dbqrqa.evaluation module directly or from the TableSplit class in the TableDataset class.

        def evaluate(
  questions: dict,
  answers: dict,
  labels: dict,
  evaluator: str,       # [GPT only] Choices: 'heuristic', 'gpt-binary', 'gpt-score'
  model: str,           # [GPT only] GPT model (see OpenAI's documentation for more information)
  retry: int,           # [GPT only] The limit of how many times the function will try to get the score if the previous attempts failed due to invalid GPT's responses
  openai_key: str,      # [GPT only] OpenAI API key
  backup_path: str,     # [GPT only] Path to a file storing the scores and prompts, which can be used to resume evaluation in case of early termination
  is_notebook: bool     # [GPT only] If set to True, prevent tqdm from outputting a new progress bar for every question
  ) -> Union[float, dict]:
  ...
      

Automatic

Once you have assigned the answers, you can use the TableSplit class's evaluate function to compute the accuracy. The function also returns the score of each sample (question).

        accuracy, scores = practice.evaluate()
      

You can also switch to the GPT evaluator by setting the evaluator parameter to gpt-binary or gpt-score, but you will need the OpenAI API key. The backup file (backup_path) stores the scores and prompts, which can be used to resume evaluation in case of early termination.

        accuracy, scores = practice.evaluate(
  evaluator="gpt-binary", 
  openai_key=openai_key, 
  backup_path=join("data", "backup"))
      

Manual

The questions must be in the following format:

        {
  "chat-1-01": {
    "1": "Was Caterpillar's average ...",
    ...
  }
}
            

The answers and labels can be integer, decimal, string, set, or dictionary in the following format:

        {
  "chat-1-01": {
    "1": "higher",
    "2": 48654666666.67,
    ...
    "10": {
      "2018": 48643000000,
      "2019": 47930000000,
      ...
    }
  }
}
      

You can use the evaluate function in dbqrqa.evaluation or download the code for the evaluators to run your experiments locally.

        accuracy, scores = evaluate(questions, answers, labels)
      

Heuristic Evaluator

The heuristic evaluator runs at no additional cost. However, it is less flexible in handling the model's output, especially for a prompt-based approach. For example, the model may output "higher" or "greater," possibly with an explanation, for a question asking whether something is more or less than another. Even so, it offers a quick preliminary evaluation that works well with numbers, covering most answers. The evaluator refers to the label to determine the answer type, then applies the following rules to process the answers:

  1. Integer: Convert the numeric answer into an integer using int(answer).
  2. Float: Convert the numeric answer into a string with two-digit floating point using "%.2f" % answer.
  3. Set: All items in the prediction and label sets must match. Otherwise, the algorithm flags the answer as incorrect.
  4. Dictionary: All keys and values must match. The label uses the entity/concept names, not their mentions, e.g., "CATERPILLAR INC" not "Caterpillar" and "us-gaap: Revenues" not "total revenue."

GPT Evaluator

We use OpenAI's GPT models for evaluation. Our online evaluator uses GPT-4o, but you can choose whichever model you want when running locally. The GPT evaluator is more flexible and accurate (see Section 6.3 in our paper for more information) but will incur additional costs. You will also need to provide your API key (see how to get the key).

We created two evaluation prompts: Binary and scoring. The binary prompt asks the model to determine whether the answer is correct. The scoring prompt asks the model to grade the answer from 0 to 10, 0 being no match and 10 being an exact match. You can use both prompts for your evaluation. However, we only use the binary prompt in this shared task for simplicity and cost management.

System prompt

        You are an evaluator. 
Given a series of conversational questions, 
your task is to compare an answer to the last question predicted by an AI 
to an answer labeled by a human.
      

Binary prompt (default)

        Question:
{{question}}

AI's answer: 
{{answer}}

Human's answer: 
{{label}}

Are the two answers to the last question the same?
Answer "yes" or "no" in the following JSON format:
```
{
  "result": "yes" or "no"
}
```
Do not explain or output anything else.
      

Scoring prompt

        Question:
{{question}}

Compare the following answers to the last question in the above conversation.
AI's answer: 
{{answer}}

Human's answer: 
{{label}}

On a scale of 0 to 10, 0 = not at all and 10 = same, how similar are the two answers?
Answer in the following JSON format:
```
{
  "result": A score from 0 to 10
}
```
Do not explain or output anything else.