Diff Initiative
From Google Summer of Code Mentor Wiki
Version control is central to every grown-up software project, but even the most modern version control systems have trouble diffing or merging things that aren't stored as lines of text. If your tool's native data format is SVG, PNG, or MP3, then everything from CVS to Git throws up its hands and says, "Sorry, you'll have to do this one yourself."
This is a real pain. Right now, for example, several of us are using LaTeX instead of OpenOffice because merging concurrent edits to an OpenOffice document is a pain. And if one of my colleagues edits a diagram at the same time as I do, we're reduced to printing them on transparencies and laying them on top of one another. (OK, I'm exaggerating a bit, but only a bit.)
Anyone who keeps Microsoft project files in a VCS will also know this problem. The merges are often done 'wrong'.
It doesn't have to be like this. GIMP and Inkscape could be taught to display the differences between two PNG or SVG images, and to help users reconcile them; Audacity (or some other audio editor) could do the same for MP3's, and so on. Wire them up so that they can be launched by version control systems according to MIME type, and everybody's life gets better. (They also get more users...)
I'd therefore like to ask mentors and organizers to keep an eye out for projects like this, or even to suggest them to students who are looking for something to tackle. If nothing else, it'd be fun to see all the different approaches people take for different kinds of data.
Contents |
[edit] Diff Frameworks
Having a generalized comparison framework that doesn't simply resort to turning everything into a textual diff is something we're also very interested in. One of the things I've been looking at recently is using something like extending libsvn_delta and libsvn_diff and writing the "driver logic" for the data types we care about. With that, one should conceivably be able to describe differences between recognized mime types and present users with a visualization of those differences. Having the diff visualization capability tie into an existing revision control system makes for a simple testing harness too.
The diff algorithm itself uses dynamic programming, but is often more simply presented as a recursive algorithm. Another idea that has been batted around is a diff generator, rather like a parser generator. The cross-typed diffs benefit particularly from this approach as directly coding the dynamic programming logic in this case leads to very verbose hard to follow code - somewhat like the case with parsing. In fact parsing and diff are related, in that parsing is matching a string to a grammar.
The generator could alow experimentation with different implementation options, for example there are both 'push based' and 'pull based' formulations of diff, as well as high speed variants which are based on various kinds of index. Experimenting by hand with coding all the variants is very time consuming. Diff is often a highly CPU intensive algorithm, so having a custom generator that can generate down to assembler level can be advantageous too.
[edit] GIMP
I agree it would be hard to do something for JPG or other lossy image formats
but that's just another way of saying that it would be a more interesting challenge.
[edit] Inkscape
[edit] Panatools
Has an advanced form of diff for images
It is used to stitch images together to make panoramas.
[edit] Bioinformatics
Diff is absolutely central to bioinformatics, in comparing and relating DNA sequences.
[edit] Audacity
Proposed an audio diff idea in 2007 and in GSoC 2008.
There were two research level students interested, but it didn't go ahead.
There is a project (non GSoC) currently going ahead in Audacity for a crossed-type diff. That's where the two streams have different data types. It's a neat extension to the basic algorithm. This has been used in bioinformatics to correct translation errors in comparing DNA to proteins. In Audacity it is being used to compare Midi data streams with Wave audio.
One detail that came out of discussion was to have a very crude diff algorithm and have the focus of the GSoC work be on a good user interface for presenting the diff. This made it more doable. It also made it a greater enabler for university level research on audio diff.
(From Audacity GSoC 2008 proposal)
[edit] Audio Diff
Ability to compare and align two sound sequences just as one compares text using diff would be a powerful new feature in Audacity. It would greatly facilitate the combining of sounds from multiple 'takes' of the same track. It would also be of use to people looking to identify particular known sounds, e.g. repeated themes in birdsong, in very long recordings.....
Suggested mentors:
- James Crook
More details about audio diff
[edit] BRL-CAD
Non-textual diffs are something that have come up in discussion many times over the years
within the BRL-CAD project. We've had a 2D image diff capability for a long while (couple decades),with one tool that simply reports off-by-one and off-by-many RGB differences in raster images and another tool that "merges" two images providing a visual "contextual diff" (highlighting the differences and de-emphasizing the matching parts via luminance and saturation changes). Those simplistic tools, however, only scratch the surface of what is possible. Being able to do 3D diffs on geometry is something we're just getting started on and is a rather different beast altogether.
The differencing requirements will vary widely from application to application, for various contexts, and different data types (minimally). There are probably more than a couple GSoC bite-sized projects that could come to fruition related to this discussion. Implementing a 3D diff capability was a project idea we tossed around last year, but I abandoned in favor of keeping the ideas more tractable. This is the kind of research project that would require a lot of background knowledge and work to be an effective GSoC project (imho, of course).
[edit] ArgoUML
I'd like ArgoUML to be able to show me the differences between two UML diagrams,
and help me merge them --- I don't expect a general or unified solution.
[edit] Git
(From Git GSoC 2009 proposal)
[edit] Domain specific merge helpers
Git adds conflict markers to text files when there where conflicts while merging. This is pretty useful if your text file is line based source code. But it can be pretty difficult to use, say, with LaTeX files (where the lines typically reflect whole paragraphs, and a single typo fix results in the same conflict size as if the whole paragraph were rewritten) or XML/SVG/build.xml files (which are not commonly edited with text editors, but other programs that expect valid XML as input -- which might not be true when the conflict markers are added).
To be useful with other file types than plain text or (line based) source code, domain specific merge helpers are needed which present the merge conflict in a reasonable form, helping the user to resolve the conflicts.
Goal: to implement a merge helper for at least one additional file type (depending on the file type, one domain specific merge helper might not fill a whole GSoC project)
Language: Open for proposal.
Suggested mentors:
- Johannes Schindelin
[edit] OpenOffice
Diffing and merging OpenOffice documents is in fact quite painful.
(I've managed to do the diffing with the help of some scripts, but not the merging.) A number of problems contribute towards making it painful. First, a .odt file is a bunch of XML files zipped together. They could be diffed individually, but if you check them into git together, git sees them as a single binary file. Being able to work with zip files transparently would be a great feature for git. And probably entirely doable as a GSOC project. However, diffing content.xml from two versions of an .odt files is also not very helpful because of how whitespace is placed and because of the way styles and IDs are regenerated every time. (Changing one letter in the document often leads to many many changes in content.xml, styles.xml, etc.) Making the XML files that comprise a .odt archive diffable again seems quite doable as a GSOC project. Third, git interface may be too difficult for most OpenOffice users. Another GSOC project could perhaps help with a git extension for OpenOffice. Each of those projects could be useful by itself, even if the other ones do not get done. However, solving any of those problems will hopefully help provide motivation for solving the others.

