Recently, Yuri and I have been focusing our Cinemetrics efforts on two fronts.

The first, and more daunting issue, is that the data used on the website is of varying quality. That is, it is not necessarily completely precise, accurate, or complete. We seek to define what these three terms mean, in the context of the goal of the project and analysis of cutting rates. Unfortunately, this work has been moving very slowly, since it is not a trivial task.

The second issue we've been addressing is the exploration of new analysis techniques. Specifically, I have been working on ways to overcome a limitation of other analysis techniques used on the Cinemetrics data. Previous researchers have focused on one number when making comparisons between films or groups of films: the average shot length (ASL). This is not an unreasonable statistic to use for comparing two films; it provides a broad summary of how short or long the films' shots are. However, it tells us nothing about the distribution of short and long shots over the length of a film.

Other researchers--Barry Salt is a prominent example--have attempted to describe the relative proportions of different length shots within each film. That is, they have attempted to describe distribution of shot lengths by comparing histograms of shot lengths for various films to histograms produced by distributions, such as the lognormal distribution. Although this technique may have use, it still is summarizing data over the whole film. That is, it does not allow for comparison between two films at different points in their respective lengths. For example, we cannot compare two films using this technique and say "both films had approximately the same cutting rate in their first halves, but film A had significantly shorter shots in its second half than film B."

Analyzing films by looking at changes in shot lengths across their lengths has long been a goal of the Cinemetrics project. To this end, Yuri and Gunars developed their inverted shot length versus shot number graph. This graph has many useful properties, and the use of an inverted y-axis was innovative. They continued to develop this technique by introducing features such as a moving average graph, to smooth noise, and estimation using high-degree polynomials, in an effort to numerically describe the shape of these graphs and make comparisons.

There are two primary issues with the shot length versus shot number graph. First is that it is not necessarily intuitive. Because the x-axis is shot number, not time, looking at the middle of the graph does not necessarily mean looking at the middle of the film. In fact, suppose we had a film of 100 shots. The 99 shots occur within the first minute, and the 100th shot is 99 minutes long. Looking at the middle of the graph, we expect to be looking at minute 50 of this 100 minute film. However, we are actually looking at the 50th shot, which occurs somewhere in the first minute. Cinemetrics addresses this by labeling the x-axis by time, not shot number, and allowing the intervals to not be regular. That is, the non-linear time scale is clearly labeled. Unfortunately, using shot length versus shot number has another side effect: it underemphasizes the importance of long shots and overemphasizes the importance of short shots. In our hypothetical example, 99% of the graph will represent 1% of the film, while the only data point we have for the rest of the 99% of the film is one shot length at the end of the graph.

The ideal solution is to somehow plot shot length versus time. This poses several problems. First: the Cinemetrics data is composed of shot number-timecode pairs. Thus we can calculate shot length and time code for each shot. Simply plotting these data points using a scatter plot is a potential way of visualizing this information. However, this still underemphasizes long shots. In our hypothetical example, we would also have no data points for minutes 2-99 of our film, though we intuitively know that it's a long shot, and somehow our data should represent that fact. The second problem with this technique is that films vary in length. It is difficult to make comparisons between a 50-minute film and a 100-minute film.

One potential solution I have devised is to generate a shot length versus fraction of film dataset. By partitioning each film into the same number of equal-length segments we can calculate the average shot length for each segment, regardless of whether or not there was a cut in that segment. For example, we can divide our hypothetical 100 minute film into 100 equal partitions. For each segment, we can calculate the number of shots that occur in each partition. In our example from above, there were 99 shots in the first minute, which in this case happens to correspond to the first partition. So, partition number 1 has 99 shots. We know that this partition is one minute long, so by dividing 60 seconds by 99 shots we calculate that the average shot length in the first partition is about 0.61 seconds. In this way, 0.61 seconds per shots becomes our data point for partition one. Similarly, we look at partition two. We know that 1/99th of a shot occurs in partition two. Therefore the average shot length during the second partition is 60 seconds divided by (1/99 shots) equals 5,940 seconds per shot, the length of our 99 minute shot. Using this method of calculating fractional shots per partition we can continue through every partition in the film, yielding one ASL per partition.

Suppose we have a second film of a different length. We can use the same partition, count shots, then calculate ASL method for each partition in the new film. As long as we use the same number of partitions, the values between the two films are directly comparable. That is, if we use 100 partitions, the 57th partition for each film always represents the time period 57% of the length through the film, regardless of film length. Therefore, using this method we can directly compare the shots lengths across the durations of two films of different lengths.

We do have a few reservations regarding this method. First, if we are comparing two films of dramatically different lengths, say 10 minutes and 100 minutes, then the shot length increasing significantly from one partition to the next means very different things in the context of each film. In the shorter film, it means that change occurred over 6 seconds; in the long film it means that change occurred over 60 seconds. That is a severe limitation of this method. Therefore, it is important to only use it to make statements such as "in both of these films, the shot lengths in their second halves is about twice the shot lengths in their first halves", and to not make statements like "these cutting rates of these films increase equally rapidly". Second, we don't know whether or not it is valid to calculate the average shot length for a partition in which no cuts occur. In one sense it is intuitively valid: even a partition is entirely contained within a shot, the longer that shot is the "slower" it is, and our data should reflect that. Nevertheless, it is a concern.

Using this technique, we have generated ASL by partition data for every film in the Cinemetrics database. We are experimenting with comparisons, both between films and between groups of films. In fact, using this technique we can even generate a "curve" that represents shot length versus fraction of film complete for the average film. That is, given a group of films (or the whole database), we can average the ASL in each partition across all films in that group. Performing this for each partition yields an "Average ASL versus fraction of film" data set. Though these are discrete values, and not a continuous curve like the ones generated by the polynomial interpolation that Cinemetrics currently uses, the result is a novel description of how a group of films tends to change over time. Yuri has generated an example application and analysis and will be releasing that for comments soon.

We have our own reservations regarding this technique and would appreciate any feedback. We believe, however, that this will open Cinemetrics' data to new analysis techniques, particularly those that rely on having sequences of data measured at uniform time intervals--ie time series and other advanced techniques. Input from statisticians on the validity of this method would be particularly helpful.