Cell ID Addition to Notebook Format

Problem

Modern applications need a mechanism for referencing and recalling particular cells within a notebook. Referencing and recalling cells are needed across notebooks’ mutation inside a specific notebook session and in future notebook sessions.

Some application examples include:

  • generating URL links to specific cells

  • associating an external document to the cell for applications like code reviews, annotations, or comments

  • comparing a cell’s output across multiple runs

Existing limitation

Traditionally custom tags on cells have been used to track particular use-cases for cell activity. Custom tags work well for some things like identifying the class of content within a cell (e.g., papermill parameters cell tag). The tags approach falls short when an application needs to associate a cell with an action or resource dynamically. Additionally, the lack of a cell id field has led to applications generating ids in different proprietary or non-standard ways (e.g. metadata["cell_id"] = "some-string" vs metadata[application_name]["id"] = cell_guuid).

Scope of the JEP

Most resource applications include ids as a standard part of the resource / sub-resources. This proposal focuses only on a cell ID.

Out of scope for this proposal is an overall notebook id field. The sub-resource of cells is often treated relationally, so even without adding a notebook id; thin scope change would improve the quality of abstractions built on-top of notebooks. The intention is to focus on notebook id patterns after cell ids.

The Motivation for a JEP

The responses to these two questions define requiring a JEP:

1. Does the proposal/implementation PR impact multiple orgs, or have widespread community impact?

  • Yes, this JEP updates nbformat.

2. Does the proposal/implementation change an invariant in one or more orgs?

  • Yes, the JEP proposes a unique cell identifier.

This proposal covers both questions.

Proposed Enhancement

Adding an id field

This change would add an id field to each cell type in the 4.4 json_schema. Specifically, the raw_cell, markdown, and code_cell required sections would add the id field with the following schema:

"id": {
    "description": "A str field representing the identifier of this particular cell.",
    "type": "string",
    "pattern": "^[a-zA-Z0-9-_]+$",
    "minLength": 1,
    "maxLength": 64
}

This change is not an addition to the cells’ metadata space, which has an additionalProperties: true attribute. This is adding to the cell definitions directly at the same level as metadata, in which scope additionalProperties is false and there’s no potential for collision of existing notebook keys with the addition.

Required Field

The id field in cells would always be required for any future nbformat versions (4.5+). In contrast to an optional field, the required field avoids applications having to conditionally check if an id is present or not.

Relaxing the field to optional would lead to undesirable behavior. An optional field would lead to partial implementation in applications and difficulty in having consistent experiences which build on top of the id change.

Reason for Character Restrictions (pattern, min/max length)

The RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) defines the unreserved characters allowed for URI generation. Since IDs should be usable as referencable points in web requests, we want to restrict characters to at least these characters. Of these remaining non-alphanumeric reserved characters (-, ., _, and ~), one has semantic meaning which doesn’t impact our use-case (_) and two of them are restricted in URL generation leaving only alphanumeric, -, and _ as legal characters we want to support. This extra restriction also helps with storage of ids in databases, where non-ascii characters in identifiers can oftentimes lead to query, storage, or application bugs when not handled correctly. Since we don’t have a pre-existing strong need for such characters (. and ~) in our id field, we propose not introducing the additional complexity of allowing these other characters here.

The length restrictions are there for a few reasons. First, you don’t want empty strings in your ids, so enforce some natural minimum. We could use 1 or 2 for accepting bascially any id pattern, or be more restrictive with a higher minimum to reserve a wider combination of min length ids (63^k combinations). Second, you want a fixed max length for string identifiers for indexable ids in many database solutions for both performance and ease of implementation concerns. These will certainly be used in recall mechanisms so ease of database use should be a strong criterion. Third, a UUID string takes 36 characters to represent (with the - characters), and we likely want to support this as a supported identity pattern for certain applications that want this. Thus we choose a 1-64 character limit range to provide flexibility and some measure of consistency.

Updating older formats

Older formats can be loaded by nbformat and trivially updated to 4.5 format by running uuid.uuid4().hex[:8] to populate the new id field. See the Case: loading notebook without cell id section for more options for auto-filling ids.

Alternative Schema Change

Originally a UUID schema was proposed with:

"id": {
    "description": "A UUID field representing the identifier of this particular cell.",
    "type": "uuid"
}

where the id field uses the uuid type indicator to resolve its value. This is effectively a more restrictive variant of the string regex above. The uuid alternative has been dropped as the primary proposed pattern to better support the existing aforementioned id generating schemes and to avoid large URI / content generation by direct insertion of the cell id. If uuid were adopted instead applications with custom ids would have to do more to migrate existing documents and byte-compression patterns would be needed for shorter URL generation tasks.

The uuid type was recently added to json-schema referencing RFC.4122 which is linked for those unfamiliar with it.

As an informational data point, the jupyterlab-interactive-dashboard-editor uses UUID for their cell ID.

Reference implementation

The nbformat PR#189 has a full (unreviewed) working change of the proposal applied to nbformat. Note that the pattern allows for numerics as the first character, which in some places in html4 is not allowed. Outside of tests and the cell id uniqueness check the change can be captured with this diff:

diff --git a/nbformat/v4/nbformat.v4.schema.json b/nbformat/v4/nbformat.v4.schema.json
index e3dedf2..4f192e6 100644
--- a/nbformat/v4/nbformat.v4.schema.json
+++ b/nbformat/v4/nbformat.v4.schema.json
@@ -1,6 +1,6 @@
 {
     "$schema": "http://json-schema.org/draft-04/schema#",
-    "description": "Jupyter Notebook v4.4 JSON schema.",
+    "description": "Jupyter Notebook v4.5 JSON schema.",
     "type": "object",
     "additionalProperties": false,
     "required": ["metadata", "nbformat_minor", "nbformat", "cells"],
@@ -98,6 +98,14 @@
     },
 
     "definitions": {
+        "cell_id": {
+            "description": "A string field representing the identifier of this particular cell.",
+            "type": "string",
+            "pattern": "^[a-zA-Z0-9-]+$",
+            "minLength": 1,
+            "maxLength": 64
+        },
+
         "cell": {
             "type": "object",
             "oneOf": [
@@ -111,8 +119,9 @@
             "description": "Notebook raw nbconvert cell.",
             "type": "object",
             "additionalProperties": false,
-            "required": ["cell_type", "metadata", "source"],
+            "required": ["id", "cell_type", "metadata", "source"],
             "properties": {
+                "id": {"$ref": "#/definitions/cell_id"},
                 "cell_type": {
                     "description": "String identifying the type of cell.",
                     "enum": ["raw"]
@@ -148,8 +157,9 @@
             "description": "Notebook markdown cell.",
             "type": "object",
             "additionalProperties": false,
-            "required": ["cell_type", "metadata", "source"],
+            "required": ["id", "cell_type", "metadata", "source"],
             "properties": {
+                "id": {"$ref": "#/definitions/cell_id"},
                 "cell_type": {
                     "description": "String identifying the type of cell.",
                     "enum": ["markdown"]
@@ -181,8 +191,9 @@
             "description": "Notebook code cell.",
             "type": "object",
             "additionalProperties": false,
-            "required": ["cell_type", "metadata", "source", "outputs", "execution_count"],
+            "required": ["id", "cell_type", "metadata", "source", "outputs", "execution_count"],
             "properties": {
+                "id": {"$ref": "#/definitions/cell_id"},
                 "cell_type": {
                     "description": "String identifying the type of cell.",
                     "enum": ["code"]

Questions

  1. How is splitting cells handled?

    • One cell (second part of the split) gets a new cell ID.

  2. What if I copy and paste (surely you do not want duplicate ids…)

    • A cell in the clipboard should have an id, but paste always needs to check for collisions and generate a new id if and only if there is one. The application can choose to preserve the id if it doesn’t violate this constraint.

  3. What if you cut-paste (surely you want to keep the id)?

    • On paste give the pasted cell a different ID if there’s already one with the same ID as being pasted. For cut this means the id can be preserved because there’s no conflict on resolution of the move action. This does mean the application would need to keep track of the ids in order to avoid duplications if it’s assigning ids to the document’s cells.

  4. What if you cut-paste, and paste a second time?

    • On paste give the pasted cell a different ID if there’s already one with the same ID as being pasted. In this case the second paste needs a new id.

  5. How should loaders handle notebook loading errors?

    • On notebook load, if an older format update and fill in ids. If an invalid id format for a 4.5+ file, then raise a validation error like we do for other schema errors. We could auto-correct for bad ids if that’s deemed appropriate.

  6. Would cell ID be changed if the cell content changes, or just created one time when the cell is created? As an extreme example: What if the content of the cell is cut out entirely and pasted into a new cell? My assumption is the ID would remain the same, right?

    • Correct. It stays the same once created.

  7. So if nbformat >= 4.5 loads in a pre 4.5 notebook, then a cell ID would be generated and added to each cell?

    • Yes.

  8. If a cell is cut out of a notebook and pasted into another, should the cell ID be retained?

    • This will depend on the application for now, as this JEP only focuses on Cell ID within an individual notebook. Different applications might handle pasting cells across notebooks differently.

  9. What are the details when splitting cells?

    • The JEP doesn’t explicitly constraint how this action should occur, but we suggest one cell (preferably the one with the top half of the code) keeps the id, the other gets a new id. Each application can choose how to behave here so long as the cell ids are unique and follow the schema. This can be a per-application choice to learn and adapt to what users expect, without requiring a new JEP.

Pros and Cons

Pros associated with this implementation include:

  • Enables scenarios that require us to reason about cells as if they were independent entities

  • Used by Colab, among others, for many many years, and it is generally useful. This JEP would standardize to minimize fragmentation and differing approaches.

  • Allows apps that want to reference specific cells within a notebook

  • Makes reasoning about cells unambiguous (e.g. associate comments to a cell)

Cons associated with this implementation include:

  • Lack of UUID and a “notebook-only” uniqueness guarantee makes merging two notebooks difficult without managing the ids so they remain unique in the resulting notebook

  • Applications have to add default ID generation if not using nbformat (or not python) for this (took 1 hour to add the proposal PR to nbformat with tests included)

  • Notebooks with the same source code can be generated with different cell ids, meaning they are not byte equal. This will make testing / disk comparisons harder in some circumstances

  • Pasting / manipulating cells needs to be aware of the other cells in a notebook. This increases the complexity for applications to implement Jupyter notebook interfaces

Interested

@MSeal, @ellisonbg, @minrk, @jasongrout, @takluyver, @Carreau, @rgbkrk, @choldgraf, @SylvainCorlay, @willingc, @captainsafia, @ivanov, @yuvipanda, @bollwvyl, @blois, @betatim, @echarles, @tonyfast


Appendix 1: Additional Information

In this JEP, we have tried to address the majority of comments made during the pre-proposal period. This appendix highlights this feedback and additional items.

Pre-proposal Feedback

Feedback can be found in the pre-proposal discussions listed above. Additional feedback can be found in Notes from JEP Draft: Cell ID/Information Bi-weekly Meeting.

Min’s detailed feedback was taken and incorporated into the JEP.

$id ref Conclusion

We had a follow-up conversation with Nick Bollweg and Tony Fast about JSON schema and JSON-LD. In the course of the bi-weekly meeting, we discussed $id ref. From further review of how the $id property works in JSON Schema we determined that the use for this flag is orthogonal to actual proposed usecase presented here. A future JEP may choose to pursue using this field for another use in the future, but we’re going to keep it out of scope for this JEP.

Implementation Question

Auto-Update

A decision should be made to determine whether or not to auto-update older notebook formats to 4.5. Our recommendation would be to auto-update to 4.5.

Auto-Fill on Save

In the event of a content save for 4.5 with no id, we can either raise a ValidationError (as the example PR does right now) or auto-fill the missing id with a randomly generated id. We’d prefer the latter pattern, provided that given invalid ids still raise a ValidationError.