DMC - What happens during indexing process at a detailed level?

Useful stuff for users and developers of the XML based dynamic publishing system "Arbortext". All information in this wiki came from posts to the Arbortext Adepters mailing list that were identified by members of that list as being extremely useful because the mailing list has moved around over the years. For more information see the group FAQ.

This forum and its sub-forums are open to the public.
Forum rules
Discussion posts to this group should be relevant to Arbortext users. Posts on other products or from vendors are welcome, so long as the post also mentions or includes Arbortext in the subject matter. General posts on dynamic publishing and posts from vendors other than PTC will be subject to higher scrutiny and should generally avoid posting here (especially if the referenced material does not mention Arbortext). Advertisement posts for other products (e.g., "check out my product as an alternative") are not welcome. This forum is for Arbortext users to communicate with other Arbortext users about working with Arbortext products.

All posts to this forum are moderated.

User avatar
liz
Posts: 355
Joined: Sun May 31, 2015 2:34 am

DMC - What happens during indexing process at a detailed level?

Postby liz » Thu Jan 12, 2017 7:01 pm

Updated: 10/2012

Question:

We were reviewing some data this morning from various update pack time tests, as well as a breakdown of the types of changes in each update pack. During this analysis, we noticed that there doesn't seem to be a direct correlation between size of the update pack and the time it takes to download and index it. For example, we’ve seen test runs where a 3MB update pack took longer to complete than a 93MB update pack.

Can you provide some insight into what happens during the indexing process at a detailed level? We’re specifically interested in what’s going on during the “Updating table of contents and search index” and “Updating contents” steps, and how the updates occur (i.e., does the process scan the file and update the changes in place or does it append changes. Also, what is the time involved (i.e., does an update have a standard time required to make the update, regardless of size)?

Answer:

We reviewed the DMP / DMC documentation on update packs. There is precious little information on the update process, however there are a few statements on how documents get marked as changed:

  1. A document is marked as changed if it has been edited.
  2. A document is marked as changed if it's position in the TOC has been changed.

A case was been opened with PTC Support (Case ID C11102950) asking for any insight they can provide on the runtime behavior of updates.

Findings: DMC Update Pack Operation



Composition



  1. During composition, three lists of updated content are compiled:
    1. Deleted files.
    2. Modified files. A checksum is computed for each file to determine whether it has been modified.
    3. New files.
  2. The index is updated to reflect deleted, modified and new files.
  3. The table of contents is updated to reflect deleted, modified and new files. Note that files that change position in the rebuilt table of contents are flagged as modified.

Deployment



During deployment, existing files are validated against the deleted, modified and new files and then the index and table of contents are rebuilt in memory and then written to the filesystem. Modified files in the document set are completely rewritten. The complexity (i.e. the compute time) of an update is therefore tied to the number of files extant on the local file system, the number of files in an update pack and the number of index and table of contents entries extant in a document set and the number of new index and table of contents entries created by an update pack. Note that the size of update packs is a much less significant factor than all of the preceding ones. There is also a fixed amount of overhead to initialize the Java Runtime Environment (JRE), load the DMC Java classes and launch the lzPack install engine. On older systems, the fixed overhead may be very significant.

Recommendations for more efficient updates



  1. Whenever possible, preserve the position of existing table of contents entries to avoid cascading updates to subsequent entries and files.
  2. Limit the number of index and table of contents entries to the minimal useful set of entries.
  3. Limit the number of files in a document set (i.e. prefer fewer larger files to numerous smaller files).
  4. Avoid unnecessarily recomposing PDF files. Merely saving and recomposing will cause internal fields to be updated, which will change the checksum of the PDF.
  5. Publish HTML chunks to DMC rather than big binary files. Smaller fractions of documents will be updated and the cost of transferring and updating HTML files is significantly less than the same costs for PDF files.

Return to “Arbortext Code Archive (adepters.org) (Public)”

Who is online

Users browsing this forum: No registered users and 29 guests