MetaQA: entity creation and full ground-truth generation by Slowika · Pull Request #9 · microsoft/MS-KeBAB

Slowika · 2024-07-26T09:06:37Z

This PR creates an entity-oriented representation of the MetaQA knowledge base. It starts from the MetaQA relational triples and produces a set of entities with unique IDs. In addition, the vanilla question datasets from MetaQA are expanded and adapted so that the ground-truth is not just the names of the entities in the final answer to a question; instead, the references to all the entities necessary to provide the final answer are also part of the ground-truth.

The goal of this PR is to introduce the initial logic and infrastructure for creating a structured KB from the MetaQA files. A future PR should convert the notion of the entity that exists here to the Entity class in this repo.

The script is invoked with the following command line:

python get_kb_details.py --metaqa-path <PATH_TO_METAQA_ROOT_DIRECTORY> --output-path <PATH_TO_OUTPUT_DIRECTORY>

Upon running the script, the output directory is populated with:

kb.json: a JSON representation of the knowledge base
three directories, hops_1, hops_2, hops_3, representing the three types of questions in MetaQA, each with train, dev, and test subdirectories, each containing text files containing ground-truth for the MetaQA questions

ti250

Thank you for implementing this Aga, I've taken a quick look and added a few comments! :) Let me know when they've been addressed.

ti250 · 2024-07-29T09:52:19Z

+            )
+            os.makedirs(output_dir_for_split, exist_ok=True)
+            ground_truth = hop_function(split)
+            with open(


NIT: Can have one big with statement of two ones to decrease indentation. (with open(blah) as f_1, open(blah) as f_2)

ti250 · 2024-07-29T09:53:34Z

+                        final_names = [
+                            kb_list[entity_id]["name"] for entity_id in final_answer
+                        ]
+                        f_all.write("|".join(all_relevant_names))


Since we seem to have the entity IDs, I feel it would be better to write those down rather than the names, especially for things like movies where we can actually disambiguate them.

ti250 · 2024-07-29T09:56:51Z

+    return dict[key] if key in dict else []
+
+
+def get_all_relevant_entities(base_query_entity_id, hop_types):


If I'm not misunderstanding this code, we can significantly decrease the number of entities we get in the ground truth from what we have here by looking at the answers and culling all entities that do lead to the answers, which are given to us in the ground truth.

A "simple" way to do this (especially since the forward search seems to be doing well in terms of performance anyway) may be to do the equivalent search backwards with seed entities being those from the last hop of the forwards search that contain at least one of the ground truth answers, then taking the intersection of the sets of entities we get from the forwards and backwards searches.

This is a great idea!

ti250 · 2024-07-29T10:00:14Z

+id_counter = EntityIdCounter()
+
+
+def add_value_to_dict_of_list(dict, property, value):


I don't think we need this function if we use DefaultDict :)

DefaultDict would be a replacement if we were assigning the values to the dictionary. Here, we are appending them. DefaultDict can be used to get rid of the if statement though -- I would in this case still keep it in a function.

ti250 · 2024-07-29T10:01:03Z

+id_counter = EntityIdCounter()
+
+
+def add_value_to_dict_of_list(dict, property, value):


Naming a variable dict may be dangerous considering it's the name of the type too!

Oh no! Well spotted.

ti250 · 2024-07-29T10:02:58Z

+    dict[property].append(value)
+
+
+def parse_line(line, prev_entity_name, kb_list):


Comment applies here and across the script: please add type annotations! :D

tminka · 2024-07-29T13:43:47Z

Please use the Entity class to represent entities

Slowika · 2024-07-30T13:54:54Z

Please use the Entity class to represent entities

I'll add a description that this is outside of the scope of this PR. This is an initial exploration.

Agnieszka Slowik added 2 commits July 24, 2024 15:05

Script to generate ground truth entities for MetaQA vanilla.

8547984

Separate MetaQA directory.

5c27fc9

Slowika requested review from bhaskar-mitra and ti250 July 26, 2024 09:06

ti250 suggested changes Jul 29, 2024

View reviewed changes

Agnieszka Slowik added 2 commits July 30, 2024 16:32

Address the comments

d7cea50

Address all comments

48170a0

		return dict[key] if key in dict else []


		def get_all_relevant_entities(base_query_entity_id, hop_types):

		id_counter = EntityIdCounter()


		def add_value_to_dict_of_list(dict, property, value):

		dict[property].append(value)


		def parse_line(line, prev_entity_name, kb_list):

Conversation

Slowika commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ti250 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tminka commented Jul 29, 2024

Uh oh!

Slowika commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Slowika commented Jul 26, 2024 •

edited

Loading