Diffing and merging notebooks — Jupyter Enhancement Proposals (2024)

Problem#

Diffing and merging notebooks is not properly handled by standard linebased diff and merge tools.

Proposed Enhancement#

  • Make a package containing tools for diff and merge of notebooks

  • Command line functionality:

    • A command nbdiff with diff output as json or pretty-printed to console

    • A command nbmerge which should be git compatible

    • Command line tools for interactive resolution of merge conflicts

    • Optional launching of web gui for interactive resolution of merge conflicts

    • All command line functionality is also available through the Python package

  • Web gui functionality:

    • A simple server with a web api to access diff and merge functionality

    • A web gui for displaying notebook diffs

    • A web gui for displaying notebook merge conflicts

    • A web gui for interactive resolution of notebook merge conflicts

  • Plugin framework for mime type specific diffing

Detailed Explanation#

Preliminary work resides in nbdime.

Fundamentally, we envision use cases mainly in the categoriesof a merge command for version control integration, anddiff command for inspecting changes and automated regressiontesting. At the core of it all is the diff algorithms, whichmust handle not only text in source cells but also a number ofdata formats based on mime types in output cells.

Cell source is usually the primary content, and output can presumablybe regenerated. In general it is not possible to guarantee that mergedsources and merged output is consistent or makes any kind ofsense. For many use cases options to silently drop output instead ofrequiring conflict resolution will produce a smoother workflow.However such data loss should only happen when explicitly requested.

Basic use cases of notebook diff#

  • View difference between two versions of a file:nbdiff base.ipynb remote.ipynb

  • Store difference between two versions of a file to a patch filenbdiff base.ipynb remote.ipynb patch.json

  • Compute diff of notebooks for use in a regression test framework:

 import nbdime di = nbdime.diff_notebooks(a, b) assert not di
  • View difference of output cells after re-executing notebook

Variations will be added on demand with arguments to the nbdiff command, e.g.:

  • View diff of sources only

  • View diff of output cells (basic text diff of output cells, image diff with external tool)

Basic use cases of notebook merge#

The main use case for the merge tool will be a git-compatible commandline merge tool:

 nbmerge base.ipynb local.ipynb remote.ipynb merged.ipynb

which can be called from git and launch a console tool or web gui for conflict resolution if needed.Ideally the web gui can reuse as much as possible from Jupyter Notebook.

Goals:

  • Trouble free automatic merge when no merge conflicts occur

  • Optional behaviour to drop conflicting output, execution counts, and eventual other secondary data

  • Easy to use interactive conflict resolution

Notes on initial implementation#

  • An initial version of diff gui can simply show e.g. two differingimages side by side, but later versions should do something moreclever.

  • An initial version of merge can simply join or optionally deleteconflicting output.

  • An initial version of conflict resolution can be to output anotebook with conflicts marked within cells, to be manually editedas a regular jupyter notebook.

Diff format#

The diff object represents the difference between two objects A andB as a list of operations (ops) to apply to A to obtain B. Eachoperation is represented as a dict with at least two items:

{ "op": <opname>, "key": <key> }

The objects A and B are either mappings (dicts) or sequences (lists),and a different set of ops are legal for mappings and sequences.Depending on the op, the operation dict usually contains anadditional argument, documented below.

Diff format for mappings#

For mappings, the key is always a string. Valid ops are:

  • { "op": "remove", "key": <string> }: delete existing value at key

  • { "op": "add", "key": <string>, "value": <value> }: insert new value at key not previously existing

  • { "op": "replace", "key": <string>, "value": <value> }: replace existing value at key with new value

  • { "op": "patch", "key": <string>, "diff": <diffobject> }: patch existing value at key with another diffobject

Diff format for sequences (list and string)#

For sequences the key is always an integer index. This index isrelative to object A of length N. Valid ops are:

  • { "op": "removerange", "key": <string>, "length": <n>}: delete the values A[key:key+length]

  • { "op": "addrange", "key": <string>, "valuelist": <values> }: insert new items from valuelist before A[key], at end if key=len(A)

  • { "op": "patch", "key": <string>, "diff": <diffobject> }: patch existing value at key with another diffobject

Relation to JSONPatch#

The above described diff representation has similarities with theJSONPatch standard but is different in some significant ways:

  • JSONPatch contains operations “move”, “copy”, “test” not used bynbdime, and nbdime contains operations “addrange”, “removerange”, and“patch” not in JSONPatch.

  • Instead of providing a recursive “patch” op, JSONPatch uses a deepJSON pointer based “path” item in each operation instead of the “key”item nbdime uses. This way JSONPatch can represent the diff object asa single list instead of the ‘tree’ of lists that nbdime uses. Theadvantage of the recursive approach is that e.g. all changes to a cellare grouped and do not need to be collected.

  • JSONPatch uses indices that relate to the intermediate (partiallypatched) object, meaning transformation number n cannot be interpretedwithout going through the transformations up to n-1. In nbdime theindices relate to the base object, which means ‘delete cell 7’ meansdeleting cell 7 of the base notebook independently of the previoustransformations in the diff.

A conversion function can fairly easily be implemented.

High level diff algorithm approach#

The package will contain both generic and notebook-specific variants of diff algorithms.

The generic diff algorithms will handle most json-compatible objects:

  • Arbitrary nested structures of dicts and lists are allowed

  • Leaf values can be any strings and numbers

  • Dict keys must always be strings

The generic variants will by extension produce correct diffs fornotebooks, but the notebook-specific variants aim to produce moremeaningful diffs. “Meaningful” is a subjective concept and thealgorithm descriptions below are therefore fairly high-level withmany details left up to the implementation.

Handling nested structures by alignment and recursion#

The diff of objects A and B is computed recursively, handling dictsand lists with different algorithms.

Diff approach for dicts#

When computing the diff of two dicts, items are always aligned by keyvalue, i.e. under no circ*mstances are values under different keyscompared or diffed. This makes both diff and merge quitestraightforward. Modified leaf values that are both a list or both adict will be diffed recursively, with the diff object recording a“patch” operation. Any other modified leaf values are consideredreplaced.

Diff approach for lists#

We wish to diff sequences and also recurse and diff aligned elementswithin the sequences. The core approach is to first align elements,requiring some heuristic for comparing elements, and then recursivelydiff the elements that are determined equal. These heuristics willcontain the bulk of the notebook-specific diff algorithmcustomizations.

The most used approach for computing linebased diffs of source code isto solve the longest common subsequence (lcs) problem or somevariation of it. We extend the vanilla LCS problem by allowingcustomizable predicates for approximate equality of two items,allowing e.g. a source cell predicate to determine that two pieces ofsource code are approximately equal and should be considered the samecell, or an output cell predicate to determine that two bitmap imagesare almost equal.

In addition we have an experimental multilevel algorithm that employsa basic LCS algorithm with a sequence of increasingly relaxed equalitypredicates, allowing e.g. prioritizing equality of source+output overjust equality of source. Note that determining good heuristics andrefining the above mentioned algorithms will be a significant part ofthe work and some experimentation must be allowed. In particular thebehaviour of the multilevel approach must be investigated further andother approaches could be considered..

Displaying metadata diffs#

The notebook format has metadata in various locations,including on each cell, output, and top-level on the notebook.These are dictionaries with potentially arbitrary JSON content.Computing metadata diffs is not different from any other dictionary diff.However, metadata generally does not have a visual representation in the live notebook,but it must be indicated in the diff view if there are changes.We will explore various represenentations of metadata changes in the notebook view.The most primitive would be to display the raw dictionary diff as a JSON field in the notebook view,near the displayable item the metadata is associated with.

Note about the potential addition of a “move” transformation#

In the current implementation there is no “move” operation.Furthermore we make some assumptions on the structure of the jsonobjects and what kind of transformations are meaningful in a diff.

Items swapping position in a list will be considered added and removedinstead of moved, but in a future iteration adding a “move” operationis an option to be considered. The main use case for this would be toresolve merges without conflicts when cells in a notebook arereordered on one side and modified on the other side.

Even if we add the move operation, values will never be moved betweenkeys in a dict, e.g.:

diff({"a":"x", "b":"y"}, {"a":"y", "b":"x"})

will be:

[{"op": "replace", "key": "a", "value": "y"}, {"op": "replace", "key": "b", "value": "x"}]

In a notebook context this means for example that data will never beconsidered to move across input cells and output cells.

Merge format#

A merge takes as input a base object (notebook) and local and remoteobjects (notebooks) that are modified versions of base. The mergecomputes the diffs base->local and base->remote and tries to apply allchanges from each diff to base. The merge returns a merged object(notebook) contains all successfully applied changes from both sides,and two diff objects merged->local and merged->remote which containthe remaining conflicting changes that need manual resolution.

Pros and Cons#

Pros associated with this implementation include:

  • Improved workflows when placing notebooks in version control systems

  • Possibility to use notebooks for self-documenting regression tests

Cons associated with this implementation include:

  • Vanilla git installs will not receive the improved behaviour, i.e. this will require installation of the package. To reduce the weight of this issue the package should avoid unneeded heavy dependencies.

Interested Contributors#

@martinal @minrk

Diffing and merging notebooks — Jupyter Enhancement Proposals (2024)

FAQs

What is the difference between collab and Jupyter Notebook? ›

Google Colab is an online platform provided by Google that allows you to run Jupyter Notebooks in the cloud whereas Jupyter Notebook is a locally installed application that runs on your machine.

How to resolve merge conflicts in Jupyter Notebook? ›

If there are any conflicts, Git will prompt you to open the notebook in a merge tool. Choose nbdime as the merge tool. nbdime will open a browser window that shows the differences between the two versions of the notebook. Use the merge tool to manually resolve the conflicts by selecting the changes you want to keep.

Which command is used for merging selected cells in Jupyter Notebook? ›

With multiple cells selected: Press Shift + M to merge your selection.

Can you run two Jupyter notebooks at the same time? ›

In general, yes, you should be able to run multiple notebooks simultaneously. Are you running on a single node cluster? Or a very tiny standard cluster? Each notebook is a separate Spark job (you can confirm this in YARN UI or Spark UI).

Is Jupyter Notebook obsolete? ›

However, as we have started progressing towards leveraging efficient systems for collaboration for teams, Jupyter notebooks have started becoming incompetent and obsolete in the realm of large-scale data science projects.

What are the disadvantages of using Jupyter Notebook? ›

Large file size: Jupyter Notebook files can become quite large, especially when working with large datasets. This can make it challenging to share files with others or store them on disk. Limited collaboration capabilities: While Jupyter Notebook allows for some collaboration, it's not as robust as other tools.

What are the disadvantages of Colab? ›

Limited Space & Time: The Google Colab platform stores files in Google Drive with a free space of 15GB; however, working on bigger datasets requires more space, making it difficult to execute. This, in turn, can hold most of the complex functions to execute.

How do I merge and resolve merge conflicts? ›

Step 1: The easiest way to resolve a conflicted file is to open it and make any necessary changes. Step 2: After editing the file, we can use the git add a command to stage the new merged content. Step 3: The final step is to create a new commit with the help of the git commit command.

Can you use Git with Jupyter notebook? ›

When working with data science projects, Jupyter notebooks are a popular tool among data scientists and researchers. Version control for these notebooks can be optimized by using Git, a widely used system for tracking changes in source code during software development.

How do I restore a merge conflict? ›

On the command line, a simple "git merge --abort" will do this for you. In case you've made a mistake while resolving a conflict and realize this only after completing the merge, you can still easily undo it: just roll back to the commit before the merge happened with "git reset --hard " and start over again.

How do I merge files in Jupyter? ›

This function takes a list of filenames as input, creates a new notebook, and then for each filename, it opens the file, reads the notebook, and extends the cells of the merged notebook with the cells of the current notebook. This will create a new Jupyter Notebook named "merged_notebook.

Can you make a collaborative Jupyter notebook? ›

JupyterLab has support for real-time collaboration (RTC), where multiple users are working with the same Jupyter server and see each other's edits. Beyond other collaborative-editing environments, Jupyter includes execution.

Top Articles
Latest Posts
Article information

Author: Errol Quitzon

Last Updated:

Views: 5757

Rating: 4.9 / 5 (59 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Errol Quitzon

Birthday: 1993-04-02

Address: 70604 Haley Lane, Port Weldonside, TN 99233-0942

Phone: +9665282866296

Job: Product Retail Agent

Hobby: Computer programming, Horseback riding, Hooping, Dance, Ice skating, Backpacking, Rafting

Introduction: My name is Errol Quitzon, I am a fair, cute, fancy, clean, attractive, sparkling, kind person who loves writing and wants to share my knowledge and understanding with you.