Skip to content

MetaQA: entity creation and full ground-truth generation#9

Open
Slowika wants to merge 4 commits intomainfrom
agslowik/metaqa-ground-truth-and-kb
Open

MetaQA: entity creation and full ground-truth generation#9
Slowika wants to merge 4 commits intomainfrom
agslowik/metaqa-ground-truth-and-kb

Conversation

@Slowika
Copy link
Copy Markdown

@Slowika Slowika commented Jul 26, 2024

This PR creates an entity-oriented representation of the MetaQA knowledge base. It starts from the MetaQA relational triples and produces a set of entities with unique IDs. In addition, the vanilla question datasets from MetaQA are expanded and adapted so that the ground-truth is not just the names of the entities in the final answer to a question; instead, the references to all the entities necessary to provide the final answer are also part of the ground-truth.

The goal of this PR is to introduce the initial logic and infrastructure for creating a structured KB from the MetaQA files. A future PR should convert the notion of the entity that exists here to the Entity class in this repo.

The script is invoked with the following command line:

python get_kb_details.py --metaqa-path <PATH_TO_METAQA_ROOT_DIRECTORY> --output-path <PATH_TO_OUTPUT_DIRECTORY> 

Upon running the script, the output directory is populated with:

  • kb.json: a JSON representation of the knowledge base
  • three directories, hops_1, hops_2, hops_3, representing the three types of questions in MetaQA, each with train, dev, and test subdirectories, each containing text files containing ground-truth for the MetaQA questions

@Slowika Slowika requested review from bhaskar-mitra and ti250 July 26, 2024 09:06
Copy link
Copy Markdown

@ti250 ti250 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this Aga, I've taken a quick look and added a few comments! :) Let me know when they've been addressed.

)
os.makedirs(output_dir_for_split, exist_ok=True)
ground_truth = hop_function(split)
with open(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Can have one big with statement of two ones to decrease indentation. (with open(blah) as f_1, open(blah) as f_2)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

final_names = [
kb_list[entity_id]["name"] for entity_id in final_answer
]
f_all.write("|".join(all_relevant_names))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we seem to have the entity IDs, I feel it would be better to write those down rather than the names, especially for things like movies where we can actually disambiguate them.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

return dict[key] if key in dict else []


def get_all_relevant_entities(base_query_entity_id, hop_types):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not misunderstanding this code, we can significantly decrease the number of entities we get in the ground truth from what we have here by looking at the answers and culling all entities that do lead to the answers, which are given to us in the ground truth.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "simple" way to do this (especially since the forward search seems to be doing well in terms of performance anyway) may be to do the equivalent search backwards with seed entities being those from the last hop of the forwards search that contain at least one of the ground truth answers, then taking the intersection of the sets of entities we get from the forwards and backwards searches.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea!

id_counter = EntityIdCounter()


def add_value_to_dict_of_list(dict, property, value):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this function if we use DefaultDict :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultDict would be a replacement if we were assigning the values to the dictionary. Here, we are appending them. DefaultDict can be used to get rid of the if statement though -- I would in this case still keep it in a function.

id_counter = EntityIdCounter()


def add_value_to_dict_of_list(dict, property, value):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming a variable dict may be dangerous considering it's the name of the type too!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no! Well spotted.

dict[property].append(value)


def parse_line(line, prev_entity_name, kb_list):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment applies here and across the script: please add type annotations! :D

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@tminka
Copy link
Copy Markdown
Member

tminka commented Jul 29, 2024

Please use the Entity class to represent entities

@Slowika
Copy link
Copy Markdown
Author

Slowika commented Jul 30, 2024

Please use the Entity class to represent entities

I'll add a description that this is outside of the scope of this PR. This is an initial exploration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants