We were reviewing some data this morning from various update pack time tests, as well as a breakdown of the types of changes in each update pack. During this analysis, we noticed that there doesn't seem to be a direct correlation between size of the update pack and the time it takes to download and index it. For example, we’ve seen test runs where a 3MB update pack took longer to complete than a 93MB update pack.
Can you provide some insight into what happens during the indexing process at a detailed level? We’re specifically interested in what’s going on during the “Updating table of contents and search index” and “Updating contents” steps, and how the updates occur (i.e., does the process scan the file and update the changes in place or does it append changes. Also, what is the time involved (i.e., does an update have a standard time required to make the update, regardless of size)?
We reviewed the DMP / DMC documentation on update packs. There is precious little information on the update process, however there are a few statements on how documents get marked as changed:
- A document is marked as changed if it has been edited.
- A document is marked as changed if it's position in the TOC has been changed.
A case was been opened with PTC Support (Case ID C11102950) asking for any insight they can provide on the runtime behavior of updates.
Findings: DMC Update Pack Operation
- During composition, three lists of updated content are compiled:
- Deleted files.
- Modified files. A checksum is computed for each file to determine whether it has been modified.
- New files.
- The index is updated to reflect deleted, modified and new files.
- The table of contents is updated to reflect deleted, modified and new files. Note that files that change position in the rebuilt table of contents are flagged as modified.
During deployment, existing files are validated against the deleted, modified and new files and then the index and table of contents are rebuilt in memory and then written to the filesystem. Modified files in the document set are completely rewritten. The complexity (i.e. the compute time) of an update is therefore tied to the number of files extant on the local file system, the number of files in an update pack and the number of index and table of contents entries extant in a document set and the number of new index and table of contents entries created by an update pack. Note that the size of update packs is a much less significant factor than all of the preceding ones. There is also a fixed amount of overhead to initialize the Java Runtime Environment (JRE), load the DMC Java classes and launch the lzPack install engine. On older systems, the fixed overhead may be very significant.
Recommendations for more efficient updates
- Whenever possible, preserve the position of existing table of contents entries to avoid cascading updates to subsequent entries and files.
- Limit the number of index and table of contents entries to the minimal useful set of entries.
- Limit the number of files in a document set (i.e. prefer fewer larger files to numerous smaller files).
- Avoid unnecessarily recomposing PDF files. Merely saving and recomposing will cause internal fields to be updated, which will change the checksum of the PDF.
- Publish HTML chunks to DMC rather than big binary files. Smaller fractions of documents will be updated and the cost of transferring and updating HTML files is significantly less than the same costs for PDF files.