Topic: Artists : "merge" rules

Hello guys.
I'm attempting to match your artist names to those in my own database. If successful, I would then be able to provide you with API access to my data via your own artist MBIDs.

I understand that the different spellings you have in many cases (Booker T. for example - see below) will eventually by merged down into one entry with unique MBID. It would be really helpful if you had a rule whereby merges would be allocated to, say, the lowest (i.e. earliest) entry for the artist in question - 10647 in the example of Booker T.

Otherwise there seems to be no way I can pin down the unique MBID's for "multi-spelled" artists.
(Incidentally I do understand that the codes below are attached to the 36 char MBIDs in your artist file. I am working from the mbdump download.)

Thanks for your attention.

10647     Booker T. And The MG's    
306056     Booker T & The MGs    
470290     Booker T. & The Mgs
652862     Booker T. & The MG's
687684     Booker T & the M.G.s
979829     Booker T & The MG's
989185     Booker T. & The M.G.s
1027189   Booker T & The M.G.'s

Re: Artists : "merge" rules

I'd suggest that's unlikely to happen. Merges are made to whichever artist entry (with whatever rowID or MBID) editors deem most appropriate. Such IDs are not important (and for row IDs, even available) to users when they make the merge edits, and nor should they be, really.

As was noted on http://forums.musicbrainz.org/viewtopic.php?id=2675 your application should not depend on row IDs; you should match and query by MBID and only join on row ids internally within MB tables.

As for your specific example, where are those IDs from? Are they row IDs for aliases? Or artist credits? They all look like aliases for the same artist http://musicbrainz.org/artist/377015fb- … df6a7ea99e which should be known only as "377015fb-c02f-4b05-960b-e0df6a7ea99e" to any external application.

Can you explain what you mean by "pin[ning] down the unique MBID's" or which situation you are trying to resolve, with an example?

Re: Artists : "merge" rules

voiceinsideyou wrote:

As was noted on http://forums.musicbrainz.org/viewtopic.php?id=2675 your application should not depend on row IDs; you should match and query by MBID and only join on row ids internally within MB tables.

Yes, I understand that.

voiceinsideyou wrote:

As for your specific example, where are those IDs from? Are they row IDs for aliases? Or artist credits? They all look like aliases for the same artist http://musicbrainz.org/artist/377015fb- … df6a7ea99e which should be known only as "377015fb-c02f-4b05-960b-e0df6a7ea99e" to any external application.

The "Booker T" ids are from your "artist_name" file.
I seem to have hit upon a poor example. On further examination, only one of thosd ids appears in your "artist" file and does indeed have the unique MBID you mention. My apologies.

voiceinsideyou wrote:

Can you explain what you mean by "pin[ning] down the unique MBID's" or which situation you are trying to resolve, with an example?

Obviously I can only match names against names. Accordingly my starting query would be: "for each of the names in your "artist_name" file, is there a match in my own "artist" table? If so, store your "artist_name" id
against my "artist" id."
My query enables differently punctuated names to match, so in the "Booker T" case all the variations I showed above were matched to my one entry "Booker T. & the MGs". In other words, I now have several of your "artist_name" ids (aliases for Booker T) corresponding to one of my unique "artist" ids.
From here, I obviously need to ascertain the MBID from the "artist" file based on the alias(es). If my logic is, er, logical thus far, then only one "alias" code will be present in "artist" and its MBID will therefore be unique to the artist and thence to my own artist id.

I hope I'm explaining myself sufficiently. Please do correct me if you think I'm getting something wrong.

Re: Artists : "merge" rules

How large is your database, means how many artists do you have to match? I think automatic matching should always be done with care, you need a human if you have more than one match (on either side, your database or MusicBrainz).

Re: Artists : "merge" rules

@hrglgrmpf: I absolutely agree with you. I'm matching around sixteen thousand artists; any "programmed" matching is always followed by human interpretation of the results.

Re: Artists : "merge" rules

Not quite sure where to start. Everything else seems to roughly make sense apart from your request to merge to a particular artist_name row ID. That part I don't understand - if you aren't storing the rowIDs, why would it matter the direction future merges go in?

7 (edited by hrglgrmpf 2011-09-13 01:42:00)

Re: Artists : "merge" rules

@tangerine: Maybe it would be less complicated to set up your own search server (http://bugs.musicbrainz.org/browser/sea … unk/README)? If that is not an option, I would do the following:

* Since you have a lot less artists than MusicBrainz, I would start with the 16.000 artists of yours.
* For every artist name in your database, search in the "artist_name" table
* For every match, try to get the "gid" column from the "artist" table, where artist_name.id = artist.name
* If you have more than two gids, human interaction is necessary.
* Your end result is a mapping between your artist id and the gid of the musicbrainz artist. This is very important, the MusicBrainz artist.id is useless, only store the gid column. You can get the artist.id later by a simple lookup via gid in the "artist" (and "artist_gid_redirect") table

It seems that the artist_name table contains names of artists that are not even used anymore, can anybody confirm this? E.g. 10647 in the example? If so, you can ignore all artist names in MusicBrainz that are not referenced in the "artist" table...

Re: Artists : "merge" rules

Agree that the search server option is probably the best option. With 16,000 artists I'd just make 1-second separated queries to the web service actually; and store the top 5 results or whatever. Will only take 4-5 hours to run; and given it's a relatively once-off process that shouldn't be such an issue.

hrglgrmpf wrote:

* For every match, try to get the "gid" column from the "artist" table, where artist_name.id = artist.name

I think this deserves a little bit of expansion. artist_names can be used for three purposes, as I understand - and can be seen here:
1) official artist names (handled by the above)
2) aliases for official artists (handled in artist_alias table)
3) artist credits (I always get confused about the references from both artist_credit and artist_credit_name table here...)

I'd suggest that searching for #1 and #2 will be sufficient to get MOST of your matches here.

hrglgrmpf wrote:

It seems that the artist_name table contains names of artists that are not even used anymore, can anybody confirm this? E.g. 10647 in the example? If so, you can ignore all artist names in MusicBrainz that are not referenced in the "artist" table...

Don't believe the latter is true - are you sure these just aren't aliases or artist credit-related entries?

Re: Artists : "merge" rules

You are 100% right, I saw artist_credit_name and thought it contains names; and completely forgot about aliases :-(. With your addition, it makes a pretty ok algorithm for name matching I guess. A more advanced strategy would be to also match releases of the artists, to find out if the match is correct...

Maybe the search server / web service isn't that good actually, because e.g. it returns 3 matches for artist:"The Beatles" (so you would probably have more than one match for most of the 16.000 artists, means too much work...).

Re: Artists : "merge" rules

Guys, very many thanks for your help. Give me a while to digest everything and then I'll come back.

Re: Artists : "merge" rules

With plenty of hindsight, I can see how my original query would have been confusing. My apologies. I was working from your mbdump files, having recently returned to the task after having to shelve an earlier attempt owing to pressure of other work. I had been building temporary tables for cross-referencing purposes - hence storing some rowIDs - but it now looks easier to use your XML web service.

I should have just spelled out my underlying concern, which is this:
Given that it is possible for an artist to be allocated more than one MBID, how does your merge procedure deal with that situation? i.e. can a MBID ever be decommissioned or re-allocated?

Thanks again for your time.

Re: Artists : "merge" rules

AFAIK, the merge process makes all MBIDs point to the merged entity. The *remove* process, though, loses the MBID. It could technically be re-allocated thus, although the sheer amount of MBIDs makes it extremely unlikely.

Re: Artists : "merge" rules

OK, that's pretty much what I guessed. So my worst-case scenario would be that, long term, I might end up with a few artists linked to orphaned MBID's. I could check for that periodically by hitting your web service.
Thank you, reosarevok. I appreciate your help.

Re: Artists : "merge" rules

Very helpful! :D