MetaQA: entity creation and full ground-truth generation#9
MetaQA: entity creation and full ground-truth generation#9
Conversation
ti250
left a comment
There was a problem hiding this comment.
Thank you for implementing this Aga, I've taken a quick look and added a few comments! :) Let me know when they've been addressed.
| ) | ||
| os.makedirs(output_dir_for_split, exist_ok=True) | ||
| ground_truth = hop_function(split) | ||
| with open( |
There was a problem hiding this comment.
NIT: Can have one big with statement of two ones to decrease indentation. (with open(blah) as f_1, open(blah) as f_2)
| final_names = [ | ||
| kb_list[entity_id]["name"] for entity_id in final_answer | ||
| ] | ||
| f_all.write("|".join(all_relevant_names)) |
There was a problem hiding this comment.
Since we seem to have the entity IDs, I feel it would be better to write those down rather than the names, especially for things like movies where we can actually disambiguate them.
| return dict[key] if key in dict else [] | ||
|
|
||
|
|
||
| def get_all_relevant_entities(base_query_entity_id, hop_types): |
There was a problem hiding this comment.
If I'm not misunderstanding this code, we can significantly decrease the number of entities we get in the ground truth from what we have here by looking at the answers and culling all entities that do lead to the answers, which are given to us in the ground truth.
There was a problem hiding this comment.
A "simple" way to do this (especially since the forward search seems to be doing well in terms of performance anyway) may be to do the equivalent search backwards with seed entities being those from the last hop of the forwards search that contain at least one of the ground truth answers, then taking the intersection of the sets of entities we get from the forwards and backwards searches.
| id_counter = EntityIdCounter() | ||
|
|
||
|
|
||
| def add_value_to_dict_of_list(dict, property, value): |
There was a problem hiding this comment.
I don't think we need this function if we use DefaultDict :)
There was a problem hiding this comment.
DefaultDict would be a replacement if we were assigning the values to the dictionary. Here, we are appending them. DefaultDict can be used to get rid of the if statement though -- I would in this case still keep it in a function.
| id_counter = EntityIdCounter() | ||
|
|
||
|
|
||
| def add_value_to_dict_of_list(dict, property, value): |
There was a problem hiding this comment.
Naming a variable dict may be dangerous considering it's the name of the type too!
| dict[property].append(value) | ||
|
|
||
|
|
||
| def parse_line(line, prev_entity_name, kb_list): |
There was a problem hiding this comment.
Comment applies here and across the script: please add type annotations! :D
|
Please use the Entity class to represent entities |
I'll add a description that this is outside of the scope of this PR. This is an initial exploration. |
This PR creates an entity-oriented representation of the MetaQA knowledge base. It starts from the MetaQA relational triples and produces a set of entities with unique IDs. In addition, the vanilla question datasets from MetaQA are expanded and adapted so that the ground-truth is not just the names of the entities in the final answer to a question; instead, the references to all the entities necessary to provide the final answer are also part of the ground-truth.
The goal of this PR is to introduce the initial logic and infrastructure for creating a structured KB from the MetaQA files. A future PR should convert the notion of the entity that exists here to the Entity class in this repo.
The script is invoked with the following command line:
Upon running the script, the output directory is populated with:
kb.json: a JSON representation of the knowledge basehops_1,hops_2,hops_3, representing the three types of questions in MetaQA, each withtrain,dev, andtestsubdirectories, each containing text files containing ground-truth for the MetaQA questions