DISCUSSION TOPIC
"DATABASE CLEANSING"

BACK TO THE DISCUSSION BOARD
Posted by: Yuri Tsivian Date: 2013-12-07

Cinemetrics as a website is some eight years old now, and it continues to grow. The number of people that measure film data and submit them to Cinemetrics grows weekly. Their cumulative number is close to 2.000. It would take a separate calculation to assess the flow of Cinemetrics clients: some come and go, others come and become our regular submitters. But the increase of submitted films is steady. Cinemetrics’ Measurement Database counts 13,561 entries as I write this; when I finish writing this the number will surely change.

This is good news; the bad news is that the database needs some cleansing. My previous topic posted on this discussion board was called “Once again on the accuracy of data”. It triggered 13 replies from a number of Cinemetrics regulars equally concerned about the reliability of our data pool. That post was mainly about how to increase the precision of how we measure things. This one is about what we want to do to rank and order the 13-plus hundred items already found in the database.

The way Cinemetrics has been designed from the outset was to provide tools and storage for film researchers interested in quantifying the material they worked on. This, in fact, meant (and still means, since we have no plans to change the philosophy of Cinemetrics) that Cinemetrics is a self-service site, a site which relies on the intellectual integrity of those who choose, measure and submit their data to our common pool. You submit your data, you work with your data as your own project requires, yet your data are also available to others. It is when it comes to others that the problem of verifying data arises. I can trust my own submissions, or at least I know their margins of error. But to what extent I can rely on your data depends on many factors: your reputation in the scholarly community, your experience with the cinemetrics tools, the accuracy of your submissions overall, etc.

Early on in the project Keith Brisson grew keenly interested in finding an optimal solution as to which film data on the Cinemetrics database are full and reliable and which are less reliable and partial. Brisson has outlined his plan of solving this issue as a precise and itemized plan he posted on our Discussion Board on 2011-05-04 under the title “New Cinemetrics data standards”. Keith also proposed to create a “profiling” piece of software that would rank data automatically using a number of criteria: a) the length of the submission as compared to the length of this films as marked on the IMDb; b) the number of films earlier submitted by the person who had submitted this particular film; c) how this particular submission tallies with other submission of this particular film.

 

Late in 2011 Keith accepted a job offer which required his full attention; consequently, he was unable to finish what he planned on Cinemetrics. To bridge the gap left by Brisson’s departure, Ian Jones (graduate student in Cinema and Media) volunteered to do research various avenues of cleansing and ranking the Cinemetrics database. The work plan Ian and I worked out to achieve this was four-fold: deletion, segregation, ranking and unification. A Cinema and Media Studies graduate student at the University of Chicago, Jan is a film person, not a programmer. His task was not to take over Keith Brisson’s mission of creating a piece of software to do parts of this job automatically. As we agreed, Ian would do the preliminary “manual” work in this field:

Action 1: DELETION

Identify and delete from the database:

a)      “test submissions” marked so (search for “test” and other possible synonyms; copy such movie’s IDs (e.g.=13206) and send to Gunars for deletion;

b)      do the same for submissions with 0 values in ASL or MSL (search by ASL) and send IDs to Gunars;

c)       Use the search method to identify movies without titles;

d)      Identify movies without submitter or year; assess the validity of their data.

 
Action 2: SEGREGATION
Identify:

a)      films which are not films (for instance, OBAMA INFOMERCIAL (2008), BASEBALL WORLD SERIES, GAME 3 (2006, USA)) or PRESIDENTIAL STATE OF THE UNION ADDRESS (MSNBC) (2007) or AMERICA'S NEXT TOP MODEL series of shows submitted by Christina Petersen; also, search by suspect words like “show” etc.

b)      films measured not by cuts by other criteria; e.g. Tsivian-submitted Chaplin films measuring the laugh frequency in the auditorium, like ID=613, in 2007;

 
Action 3: RANKING BY ACCURACY
Identify:

a)      Frame accurate submissions by: Barry Salt, James Cutting (minus films with data errors identified by Salt, Redfern and Baxter, see http://www.cinemetrics.lv/topic.php?topic_ID=373 ), Adriano Apra & Simone Starace together (NOT Starace separately); Heidi Heftberger; Films: TURKSIB ID=8283 and ID=8284; THE HOUSE OF HATE (EPISODE 1); also: Eric T Jones and Viktorija Eksna reported some frame accurate submissions, please check in comment boxes; Also, all films submitted by FACT (check them one by one, there can be bugs and failures);

b)      Mark full film data as separate from fragments data;

c)       Rank submissions by reliability:
Frequent clients vs. one-time clients;

Researchers vs. Students (Holland, Taiwan, Czech republic; usually the whole class submits the same film);

Submissions with shots of minimal length 0.1 & 0.2 are suspect of double-clicks;

 
Action 4: UNIFY METADATA
a)      USA – United States

b)      “Russia: before Oct 1917 “Soviet Union” 1917-1991; “Russian Federation” 1991-till now;

c)       Add missing metadata.

Replied by:Ian Jones Date:2013-12-07

What I did was to create a "meta-database" having gone through almost all the entries in the Cinemetrics database. What I’ll try to insert below this post is an working copy of it which covers around ¾ of all items in the database (for Nov 2013). Here is an explanation of the different categories, cordoned off in different tabs on this document (I plan my meta-database to be made available to the Cinemetrics Community by the end of December 2013):

 

Strongly recommended for immediate deletion:

 

I’ve included here only the absolute worst offenders in the database:

 

(1) Entries with an ASL of 0;

(2) Entries explicitly marked as “test” or "try" somewhere in the title or notes;

(3) Entries so ambiguously titled as to make identification completely impossible.  Many of the titles here are gibberish, accompanied by gibberish in the year field and gibberish in the submitter field;

(4) Entries in which the running time significantly departs from the official running time of the film (i.e., identifies itself as a full-length feature but is only a few minutes long), in which the submitter has made absolutely no effort to label the entry as a segment or excerpt of the film in the title field, user comments, or anywhere else.

 

Confusing and Ambiguous:

 

There were many entries with vague and ambiguous identifying information that I was tempted to lump into the immediate deletion category, but decided to create a separate category for.  

 

The submissions here aren’t quite as disastrous and hopeless as those I recommended for immediate deletion, but they’re still pretty bad.  My attempts to identify them, by searching for any film that matched their title, year, running time, anything else offered by the submitter, failed.  

 

Perhaps someone with a different perspective might have more luck?  Perhaps not.  Most of them will probably have to be deleted.

 

Suspicious:

 

Anything with a running time that was clearly off went directly into the “immediate deletion” pile.  But things like numbers of shots obviously can’t be checked by an external source such as IMDB so easily.  Entries with ASLs that were wildly divergent from other submitter’s loggings of the same film, then, go here.  

 

As I said during our meeting, it's easy to spot the outliers when there are a dozen or so loggings of a film, which is the case in many of the films that professors assigned their students to log.  When there are fewer loggings, though, this task becomes impossible.  Some of the entries listed here are the only two or three loggings of a film.  In cases like these, I was in no position to judge which was the "correct" entry and which was the "outlier," so I just marked them both/all down, with the note that they diverge from one another, and at least one of them has to be wrong.

 

I also included here films that had running times that were off from the specific ones listed on IMDB, but not enough to immediately throw them out.  Most of them, I suspect, are projection PAL conversion issues. 

 

IMPORTANT NOTE:  After awhile, I gave up on including silent shorts here.  Short films with a small number of shots can very easily end up with highly divergent ASLs with even a single mis-click, and that's not even taking into account projection speeds.  This is exactly the type of data that would benefit most from a ranking system for users, and perhaps controls for projection speed or fps of DVD.

 

Excerpts and Segments:

 

Fairly self-explanatory:

Poorly-labeled excerpts do beg the question of what sort of responsibility we have to allow these to remain in the database - if no other user can determine what scene of the film has been logged because the description is too vague, how does it do any other users any good?  This is a methodological issue to consider when deciding what to keep and what to delete from the database.  I'm of the opinion that well-labeled excerpts can do some good (and there are some very well-labeled excerpts in the database), but there's not much of a case for leaving poorly-labeled ones in.

 

Not Cinema

 

Television series, music videos, web series, televised events, videogame cutscenes, trailers, commercials.  

In general, I tried NOT to put TV movies & miniseries in this category, as it seems to me that they belong in the same category as standard cinema.  But a few may have slipped through.

 

Not Measuring Shot Length

 

Again, fairly self-explanatory.

 

Frame-Accurate

 

This is VERY incomplete right now.  Ideally, it should include all entries by Barry Salt and James Cutting, as well as those "measured with FACT."  Those should be easy to search for and add later, though.  Mainly I included those here that I was finding problems with, and therefore was becoming concerned with their supposedly "frame-accurate" status.

 
 

Min Shot .2 or less

 

I stopped adding things to this category after a short time, because it became obvious that far too many entries in the database qualify.  

If you really wanted to do the work of marking those as somewhat less trustworthy than other entries, creating a way to sort the database by min shot length would really make this easy.

But I think the only place it really makes a difference are short films, especially silent shorts.

Replied by:Ian Jones Date:2013-12-07

Oops, the browser does not recognize my Excel film, will have to ask Gunars to insert my meta-database later

Replied by:Armin Jaeger Date:2013-12-07

I'd preferred first to unify the entries, so that you can be really sure you have all entries for the same film sorted below each other. There are quite a few cases in the database of either missing articles or cases where people put them at the end of the film title. You can solve this problem by searching for each entry seperately without the articles (lot of work though), but it's impossible to deal this way with the spelling errors.

Until Ian's job is done, I've only the following general comments for the categories:


Strongly recommended for immediate deletion: 0 is the obvious case, 0.1 or similar low scales also can indicate gibberish. It's also not only ASL of 0 but a MSL of 0, too. The latter indicates defective entries as the one I've written you about.
 
Confusing and Ambiguous: As for unidentified entries which could be salvageable, I still recommend simply asking the people who have submitted a mail address. We could set up a standard introduction text about our problems with the submissions, the goal to make the database useful for all plus the veiled threat to delete the entries. Then in the mail we ask all questions for each submitter in one mail. One could divide the work, setting up a mail account especially for this task to which the interviewers have access.
 
Suspicious: There can be obviously lots of different reasons for diverging results. Some are not mistakes and more a matter of definition like ultra-fast cutting, complicated dissoves, concealed cuts, opening credits with visual narration and so on. Considering that credits can have very different lengths, films can feature overtures, intermissions and closing musics and the lack of PAL/NTSC identification, one often simply can't be sure about the accurateness of length. So there needn't necessarily be a wrong measurement if two differ more than a bit.
However this brings me to the point of multiple measurements where sometimes I wish a seasoned professional would measure the film from a clearly stated source and then one could nuke the rest. My favorite are the 192 (!) measurements for The Painted Lady which make up 1.4% of the entire database. Seeing more than two simple measurements for the same film causes me to essentially give up on this film because which one am I supposed to believe?

Also the user rating isn't necessarily helpful. If you take e.g. Barry Salt's Star Trek measurement and mine, one of us both is way off regarding Star Trek - Generations because the film doesn't have any major problems which could account for the margin of difference. Now who of us is less trustworthy? I'd prefer to rate entries not people though I obviously understand where this point is coming from.  
 

Not Cinema: While I understand the cataloguing difficulties for lots of stuff in this category, it eludes why episodes of TV series are listed here seperately. They are clearly identifiable units with their own imdb entries like TV movies or miniseries are. Just a bit shorter that's all. Not that I want to rob Ian of the pleasure to list hundreds of Stat Trek episodes in an Excel sheet because that's right next up for him.

 

Min Shot .2 or less: This point eludes me somehow. It's physically possible to click at 0.1 second level and during ultra-fast cutting this might be an approximation of what happens unless you use frame-accurate measurements.  I don't see how this identifies faulty entries.

Replied by:Gunars Civjans Date:2013-12-09

 Here is the link to the file.

Replied by:Mike Baxter Date:2014-01-16
Ian comments that he ‘gave up on including silent shorts here. Short films with a small number of shots can very easily end up with highly divergent ASLs with even a single mis-click, and that's not even taking into account projection speeds’. I sympathize, but think that the real issue here is comparability rather than reliability. I’ll take ‘short’ to mean one-reelers or less. Of the analyses I’ve looked at – Griffith, Sennett, Chaplin – most submissions are by contributors considered to be reliable. What is striking, where repeat measurements exist for a film which mainly applies to Griffith, is how different the ASL values can sometimes be, even when it appears that the same number of shots have been counted. This sometimes occurs when the ‘repeat’ measurements are by the same person. It is often obvious that differences arise because measurements are based on different source material involving different projection speeds. This, unfortunately, is a nettle that has to be grasped if you want to engage in the quantitative comparison of films from this period using ASLs. If you don’t know the projection speed for the source on which the ASL is based but have some idea about the footage of the film it is possible to make a ‘guesstimate’ of the projection speed. ‘Guess’ because the source may not correspond to the original and usually parts of the footage are omitted in the measurement process. Nevertheless you can try this and get an estimate of the fps for the source on which the measurements are based. You can then, if you wish, effect comparisons by ‘correcting’ ASLs to a common projection speed. It is not at all obvious that this should be done if you don’t know how the producers intended a film to be projected – if they had such an intention. Most of this is possibly obvious. A less obvious issue is how you treat titles in shorter silent films. There is a (statistically) remarkable example in two of Charles O’Brien’s measurements for D.W. Griffith’s 'The Voice of the Violin' where the ASLs differ by 25.4 seconds. My understanding is that the measurements were made from two different DVDs. One contains a very long and continuous shot that I am assuming was broken up by titles – resulting in a much smaller ASL – in the other copy. The reliability of measurement is not in question here; it is the comparability of the source material that is the issue. This is not an isolated example. If you wanted, for example, to compare Griffith’s films some are measured without titles and some with, and in the latter case sometimes only their position and not their length is known. This comment has been prompted by problems I have been grappling with in recent work on films of the directors mentioned above. The points made are, perhaps, obvious. In the context of the present discussion about ‘database cleansing’ the thought is that, as far as silent film goes, data can be ‘clean’ but not immediately comparable. If you are interested in the ‘stylistic’ comparison of early film using quantitative measures such as the ASL, and use the database, problems of comparability add at least one extra dimension to those of reliability. I do not know the answer to these problems, if they are regarded as such!