Tree Editing

Specify Software Project Staff
07 September 2011
Version 1.0

Specify users are known to wonder why two or more people cannot edit the Taxon tree (or any data displayed as trees in Specify) at the same time. In fact, two or more users can update the tree-structured information in a Specify at the same time--it just cannot be done in the graphical tree display.

How Data are Edited

When a Specify user searches for a Collection Object record and loads it into a form, the data has been transferred from the database on disk into the memory of the computer. The data in the Specify form is a snapshot of the data on disk. If a different user updates and saves the same Collection Object before the first user saves their changes, a conflict will occur and the data form will refresh with the latest data from the database. The first user must then re-enter the data and press the 'save' button. This does not happen when new records are being added because they do not exist yet in the database. Most users never experience this conflict of two users updating the same record because the odds of two people editing the same record are quite small.

Most software applications work this this way, although some are designed to 'lock' the record that is currently 'open' and keep it 'locked' until the user is done with it. A 'locked' record cannot be updated by anyone else. A record locking approach has advantages and disadvantages, the Specify Project decided not to lock records for various additional technical reasons.

How are Tree Displays Different?

Tree-structured data in Specify are organized as a hierarchy which has a root (base) node and subsidiary nodes with parent-child relationships. Each item or node in the tree has a parent (except the root), and may or may not have child nodes. Users want to view more than one node in a tree, they usually want to see at least the parent and children of the node of interest. Viewing several nodes in a tree or subtree provides useful context by localizing nodes in relation to others.

An effective visual representation of hierarchical data shows the nodes of the tree with lines representing the parent-child relationships. In order to visualize the tree, each node, which is a record in the database, must be read into the computer's memory and displayed. The graphical tree is a snapshot of the records in the database, but the biggest difference between the graphical tree and the Collection Object form is that the form has just one Collection Object data record in memory. The tree may start out with just a few records, but as nodes are expanded more and more records are loaded into memory. This diagram shows a portion of a Specify tree display with the names of the nodes set as letters.

Specify Tree Display

Another important difference with trees is that each item in the tree is related to one or more other tree items. Adding, removing or moving a parent node changes the location of all the children nodes or removes them all together. In comparison, changing one Collection Object does not affect any of the other Collection Objects.

Again, the probability of two people changing the same Collection Object record at the same time is fairly remote, but this is not true for tree data displays. A simple change in one of the parent nodes affects all of the children below thus requiring them to all be updated. If more than one person could edit the tree at the same time, their own snapshot of the tree could become quickly outdated and any changes they would make to the out-of-date copy on display would be invalid. Ideally any user's graphical view of the tree should be automatically updated when changes are made by other users. But this woudl not be ideal--having the tree change before your eyes could be disconcerting, and the node you were editing could disappear all together.

Specify's Approach to Tree Editing

Specify prevents 'tree chaos' by locking out subsequent users from editing the tree when someone oepns a graphical tree display in 'edit' mode. When a user edits a tree with one of Specify's tree data editing functions, they can assume that what they see on the screen is the current state of the tree. If users are updating the tree nodes via the form system the tree is not locked. The tree is only locked when a graphical tree is opened for editing or for a few milliseconds when a user is saving a form's data for a node in the tree.

Specify 6.3 introduced a new function in the sidebar of the Tree pane. Users can now open any of the trees in 'view' mode and this does not lock the tree. The snapshot the user has open may become 'out of sync' with the database, but it will not cause any data integrity issues. The tree in 'view' mode is a great reference while editing them via the form system.

How can Multiple Users Edit the Same Tree?

Here are two simple guidelines that will maximize productivity for all users when working with Specify's trees:
  1. Only open the graphical tree editor in 'edit' mode when you know no one else will be editing the tree. If you are not sure, then only open it in 'view' mode.
  2. If possible, perform updating, adding, removing and moving of nodes in tree-structured data (taxon, locality, storage location, etc.) using the data form system. This only locks out other users while the actual 'save' is being performed. The lock will only prevent them from saving for a short amount of time.

Digging Deeper - an Engineering Point of View

This section is for the more technically inclined who wish to better understand how Specify's tree-structured data are maintained and updated in software.

One of the challenges of implementing a hierarchical tree in a SQL database is making it efficient. It is standard practice for each node to have a 'parent id', a node number (NN), and a 'highest child node number' (HCNN). The 'parent id' is exactly that, it is the id of the record of the parent. The 'node number' is the unique number of the tree node when the tree is numbered top-to-bottom-left-to-right, this number can change depending on where nodes are added or removed in relation to that node. The 'highest child node number' is the largest node number of any of it's children nodes. The most inefficient aspect of drawing an expandable tree is knowing whether any individual node has any children nodes. This is important because there needs to be a visual cue as to whether the node has children. In the first diagram this visual cue is a black triangle. By keeping the NNs and the HCNNs up to date the number of children can be calculated by subtracting the NN from the HCNN. In the diagram below, which is the same tree as the first diagram, the two boxes are the 'node number' and 'highest child node number' from left to right. For the 'D' node the difference between NN and HCNN is 2, which is the number children below 'D'.

Tree with Node IDs

Adding a new node to the 'J' node causes the tree to be renumbered and many of the HCNNs reset. The number of children for 'B' is 7-2=5, and there is five children under 'B'.

Tree Nodes Changed

When the 'J' node was added, note how every node in the tree except 'E' had one or more of it's node values updated. Caching information about the children node is critical for performance, without the information a separate query would need to be run to determine whether the triangle should be shown. The downside of caching the information is that it requires all the nodes to the right of the change to be updated. The node updating process itself is not very time consuming.

Suggestions as Questions and Answers

We have received several suggestions (and we are glad to receive them!) on how to improve graphical data trees in Specify but they usually are too inefficient or do not solve the underlying usability issues. For example:

Q. Could you stop using internal tree node numbers and HCNN and do everything dynamically--as the tree is opened level-by-level, or subtree-by-subtree anyway?
A. The performance impact would be minimal until a rank (column) with several hundred items needed to be displayed, then the delays would make the tree unusable.

Q. When the tree is in edit mode, maybe a only a portion of the tree, a sub-tree, could be locked?
A. Nice thought but this will not work because the entire tree may need to have its node numbers updated--not just for the locked portion of the tree but also for the unlocked portions.

Q. Would it be possible to load the tree into memory and then reconcile against the database when it is time to save?
A. For large trees this approach would require too much memory, some Taxon trees in Specify contain greater than 100,000 nodes. This approach would allow the entire tree to become out-of-sync with the data in the underlying database, which would be chaotic and deleterious in active multi-user environments.

Q. Could you load the tree into memory as the nodes are expanded then reconcile changes against the database when it is time to save the session changes?
A. If the branch you were working on was moved or re-parented, it could be very difficult to reconcile the tree. The entire branch you were working on could be deleted out from under you.