Cdx files updating on server
At the IA, we have recently switched to building CDX files using the -identity option on the arc-indexer and warc-indexer tools.
The following configuration is required for a Remote Resource Index: which differ only in the capitalization of the letter "i".
This session ID problem can be mitigated by canonicalizing the URLs as they are placed in the index, so the index would contain the following URLs, instead of the original form, which the crawler captured: Currently the Wayback includes only a single reference implementation of a canonicalization scheme, which is currently called Aggressive Url Canonicalizer.
This implementation provides the following canonicalization: These heuristics generally lead to correcting many common URL lookup problems, but in some cases, these operation do the wrong thing, typically by making content which is actually different appear to be the same thing.
This Resource Index implementation assumes a local database of all documents within the Wayback Collection.
The type of database is specified with the source property.
By keeping the original "identity" CDX files, we have been able to test various URL canonicalization strategies without the overhead of re-processing all the ARC/WARC source materials.