What did I do?
This week, I created a Python script that loads the mismatch
database with the contents of the data
directory. I also finalized a file/directory structure for this data directory:
└── data
└── package_namespace
└── product
└── mismatch_relations.yml
For example:
└── data
└── pypi
└── zstandard
└── mismatch_relations.yml
These mismatch_relations
files contain mapping information for PURLs
and invalid_vendors
:
purls:
pkg:pypi/zstandard
invalid_vendors:
- facebook
So, this database populator script stores the mapping of PURLs and invalid vendors into the mismatch
database.
While working on this, we iterated over the naming of the database and decided to rename it to mismatch
from deduplication
, so I made that change. During this process, we also discussed whether the
DELETE
query I was using to ensure the uniqueness of the data was efficient performance-wise. After looking into it, I figured out a more efficient way and I implemented that as well.
What’s coming up next week?
For the upcoming week, I plan to:
- Create a CI script that triggers/executes the populator script, which will load the contents of the PR into the database.
- Include a flag feature that allows data insertion for any particular subdirectory of
data
. - Write detailed documentation on “How to fix a mismatch”.
- Write initial test to check script’s functionality.
This CI script will trigger when the data
directory is modified. I’m also thinking of adding a demo PR of this to facilitate documentation.
Catch you later! More to come next week.