Topic: a huge backlog of (disc *) releases on MB

While editing some obscure artists I often find that we have those old style release title with "(disc 1)" etc at end. Just out of curiosity I searched for releases and Found 46,682 results for "(disc 1)". This number is still quite huge.

I think we can script a bot to clean it up a bit. Or may be a bot can start making open edits up for voting at a controlled rate so things could be cross checked as well. I'm aware that we are always short of people on voting.

I don't think MB's webservice has api for making edits but a server side carefully crafted bot can be helpful. The bot can open merge edits after some heuristic like matching release date, barcode and catalog number apart from release title keeping the false positive rate very low. I hope the number can be dwindled by 20 fold with ease.

2 (edited by hrglgrmpf 2011-11-24 22:41:51)

Re: a huge backlog of (disc *) releases on MB

I think your search may be wrong, mine has only 13,156. There is also a report which lists releases with (disc n) or (bonus disc), with 31,255 hits.

I don't think a bot can do this in all cases. Maybe in the simplest ones (exact same title except for (disc n), same release group, same album artist, ...). The problem is, since many of those releases are in "obscure" artists or "Various Artists" (to which nobody is subscribed), these automatic merges will most probably not be overseen by any human.

A partly solution to this problem (and all the other reports!) I think is solving MBS-2662 (filter reports by subscribed entities). I have zero motivation to clean up a 31,255 hits report of obscure artists I have never heard of. However, I am motivated to clean up a 100-1000 hits report of artists I know (I'm subscribed to). If there are more people like me (with a different set of subscribed artists), the report will get smaller.

For VA compilations, it would be good if the report could not only be filtered by release artist, but also by track/recording artist.

P.S.: HumHumXX for example is doing an amazing job merging all VA-compilations released in Germany (!).

Re: a huge backlog of (disc *) releases on MB

eh! strange. Removing advanced changes the number drastically. Don't know what's going on. Either way having filtered report will certainly go a long way.

Having bot do some manual labour before a voting drive was just an idea. I think my day job has changed the way I think of addressing problem by being close to database and doing things as scarily low level. :D

VA artists are the ones which get left all the time. No wonder their state is a mess (nb. with elevated standards of course. not a freedb mess).

Re: a huge backlog of (disc *) releases on MB

Yes, without advanced=1 I think the search is fuzzy, so there are false positives. The bot idea is good as a last resort. Let's see how it develops after MBS-2662 is implemented. Another important feature (but more for the other reports) is MBS-2919, which would enable to mark false positives in reports.

For my slave server I implemented a hack to show only subscribed artists in the report: It shows that I have 1881 releases with (disc n) still ahead to be fixed :-).

VA releases are a problem, especially because they don't appear in the "Edits for my subscribed artists" page any more (see also MBS-3311).

Re: a huge backlog of (disc *) releases on MB

Modbot did merge a lot of multi-disc releases before we switched to NGS (the current system which allows multi-disc releases, a two disc release used to be two separate releases before this). It was decided to only automatically merge multi-disc releases that had an AR linking the two to minimise incorrect merges. So all (disc x) titles still in the database will have to be fixed by ourselves. That is a lot of work, and we may never merge all of them, but they will be fixed by editors whenever they need the releases for tagging (or when editors get bored and merge some random ones). So the most important ones will be fixed, and I don't think it's a disaster that releases nobody uses exist in a sub-optimal way.

Re: a huge backlog of (disc *) releases on MB

13,000 is eminently doable :-)

Re: a huge backlog of (disc *) releases on MB

When a cleanup was made for adding "part of" relationships back in the old MusicBrainz, the number went down by more than this amount in just one month IIRC. So yeah, definitely doable.

Re: a huge backlog of (disc *) releases on MB

The last 1,000 will be hard. Those ones are probably missing one or more discs; and potentially difficult to find any evidence online. But that'd be a nice problem to end up at! :)

9 (edited by HumHumXX 2011-11-26 13:37:53)

Re: a huge backlog of (disc *) releases on MB

http://musicbrainz.org/search?query=dis … advanced=1

If the above query is representative of all disc n releases in the db, I'm afraid we'll be left with ~ 10,000 releases that don't have corresponding discs. I started with ~ 1,500, now I'm down to ~ 300 (1/5), and - a handful of releases aside - there's nothing left to merge. In May 2011, the overall report listed ~ 50,000 releases, IIRC.

Re: a huge backlog of (disc *) releases on MB

One of the issues is the relatively convoluted process required to fix those releases (especially if you are not an autoeditor).

Re: a huge backlog of (disc *) releases on MB

The advanced search doesn't work properly at the moment, does it? I used to get 300+ hits for the above query.

Re: a huge backlog of (disc *) releases on MB

HumHumXX: http://musicbrainz.org/search?query=%22 … advanced=1 is what you want now (parentheses must apparently be indicated and put into quotation marks for it to work, go figure - you might want to raise a ticket about it.

Re: a huge backlog of (disc *) releases on MB

Isn't this the same issue as http://tickets.musicbrainz.org/browse/SEARCH-159 ?

Re: a huge backlog of (disc *) releases on MB

Most likely it is.

15 (edited by lytron 2011-12-24 07:31:28)

Re: a huge backlog of (disc *) releases on MB

I try to clean up these releases.
There are a lot of singles with "(disc 1)" and "(disc 2)", as these extensions are used to distinguish different versions (for example: Radioheads Knives Out). Should these things be kept or should I change these, too?

Re: a huge backlog of (disc *) releases on MB

:/ I couldn't find any docs about this in a normal place, just at http://musicbrainz.org/doc/Multi-Disc_Release - Different volumes in a series that are released separately should stay separated even if they are called "CD 1" and "CD 2", as should UK single pairs that are released with "CD 1" and "CD 2" in their titles

I'd say move that information to the comment - but not sure. Definitely don't merge them though!

Re: a huge backlog of (disc *) releases on MB

reosarevok wrote:

:/ I couldn't find any docs about this in a normal place, just at http://musicbrainz.org/doc/Multi-Disc_Release - Different volumes in a series that are released separately should stay separated even if they are called "CD 1" and "CD 2", as should UK single pairs that are released with "CD 1" and "CD 2" in their titles

I'd say move that information to the comment - but not sure. Definitely don't merge them though!

That's how nikki and I supposed would be the best way to do it via IRC many months ago.