Improving bulk import performance

Our partners are developing custom tools (typically console applications) to import external data into ADAM all the time. They do a great job, but performance is sometimes an issue. This blog post lists some important attention points and ways to improve this performance.

Disable the full-text index

Prevent the full-text index from updating during the import. This would take up a fair amount of the system's resources, while the index can still easily be rebuilt afterwards.

To disable the update, run Adam.Core.DatabaseManager.exe DisableFullTextIndex before the import and run Adam.Core.DatabaseManager.exe EnableFullTextIndex to re-enable the full-text index after the import.

Note that this can't be done when importing on a live production server, for obvious reasons.

e.g.:
C:\Program Files\Adam Software\Adam\Adam.Core.DatabaseManager.exe DisableFullTextIndex -database=Test

Update SQL Server statistics

In case you're importing a large number of records (e.g. more than doubling the size of your database), the statistics in SQL Server can rapidly get outdated. Therefore, it's best to run the sp_updatestats stored procedure during the import, after every few thousand records.

This ensures that the update doesn't gradually slow down during a large import. This is much less relevant when you're only increasing the number of records by 5% for example.

e.g.:

Keep fragmentation low

Try to keep the fragmentation of your indexes as low as possible. Check for fragmentation before the import and also a couple of times during the import. By running the sys.dm_db_index_physical_stats management view in SQL Server Management Studio you can see the fragmentation level of all the tables and indexes in your database.
Pay special attention to the avg_fragmentation_in_percent column: this value should be as low as possible, and 0 is best.

e.g.: the following query for example returns the fragmentation info for all objects in YourDatabase:
select * from sys.dm_db_index_physical_stats(DB_ID('YourDatabase'), null, null, null, null)

When you notice fragmentation, the easiest way to reorganize or rebuild all indexes is by creating a Maintenance Plan in SQL Server. You can find more info on how to set this up in the ADAM Admin Guide > Post-installation tasks. In case there are only one or two indexes to defragment, you can simply right-click the Indexes node in the Object Explorer and choose Rebuild or Reorganize depending on the level of fragmentation:

The document Reorganizing and Rebuilding Indexes explains when to perform a rebuild and when to perform a reorganization.

Disable Allow Operator Searches

Having many fields with the Allow Operator Searches property enabled can also cause an import to slow down: each time a record is updated all these indexes need to be updated as well. It could be faster to disable these indexes before the import (assuming you don't need to query for these fields during the import) and re-enabling them afterwards.

This is also not recommended when importing on a live production server.

Run multiple imports concurrently

One of the most expensive operations of an import is preview creation. A preview of a single record is often (depending on the media engine used) done using a single thread, so other cores of that server wouldn't have anything to do at that time. If you write your import tool so that it can import a part of the total import, you could start several runs of the importer simultaneously and have each perform their part. And in case you use the ADAM maintenance framework for your importer this is automatically done for you. All you need to do is start multiple threads of the maintenance manager.

e.g.: if most of your import time is spent creating previews and you're using a quad core CPU, then you will get the most out of your machine by starting 4 threads.

The following command starts 4 threads of the RunMaintenanceJobs command:

C:\Program Files\Adam Software\Adam>Adam.Core.CommandLine.exe RunMaintenanceJobs -threads=4

It's not recommended to start more threads than you have cores, as this may have a negative impact on performance. You could however use this command on multiple ADAM servers simultaneously (if your license allows it).

Use different accounts when running concurrently

If you're running multiple imports concurrently, be aware that the StorageUsed property of the user account executing the import is incremented for each file. To help preventing locking it's best to run each importer with a separate user account. This also ensures that you'll never have transactions waiting for the commit of another transaction which is also trying to increment this StorageUsed value.

Simplify your SearchExpressions

Custom importers usually contain some queries to check whether specific records or classifications already exist or not. Typically this is done by searching for an object and then either trying to load it from the database or by using one of the ADAM helper classes. It's important to know that all SearchExpressions are always translated by ADAM into a single SQL query. So, the simpler the search expression, the simpler the resulting SQL query and the faster it will execute.

By consequence you should try to write your search expressions using ID's instead of names, identifiers or especially labels, not only for records or classifications, but also when searching for any object that you can identify using an ID.

e.g.:
the search expression
Classification = '650FBB19-13CE-4794-A7EF-CDAC4EF7DB7B'

could return the exact same result as the following one
Classification.NamePath = '/Products/Dishwashers/Type 123'

but the first will execute much faster than the second.

Use ReadUncommitted

As mentioned before, running multiple imports concurrently is a good way for gaining performance. But it has a price, because the more imports you run concurrently, the more chance you have that they'll lock each other during the import as they'll probably be querying and updating the same information. To prevent this from happening, run ReadUncommitted queries where possible when querying the database.

e.g.:
the following code tries to get the ID of the record that has a SKU-field with the value 1234. Since it must perform a database scan, there's a chance that the SQL query generated by this code will get blocked by other imports that are currently creating records (because they could be creating a record with SKU value 1234 as well).

C#
1
2
RecordHelper helper = new RecordHelper(app);
Guid? id = helper.GetId(new SearchExpression("sku = 1234"));

The next code does exactly the same thing, but with two very important differences:

  • First, the second argument tells the GetIds-method not to perform a security check. Since there's a very good chance that you're running your import with an account that has access to pretty much everything in the database, you might as well tell ADAM not to perform a security check.
  • Second, the ReadUncommitted argument prevents this query from being blocked by other updates that other importers may be doing.

C#
1
2
3
RecordHelper helper = new RecordHelper(app);
Guid? id = helper.GetId(new SearchExpression("sku = 1234"),
    false, LockMode.ReadUncommitted);

Network

When running multiple imports concurrently, make sure your network or IO doesn't become a bottleneck. Set the Catalog Log Details setting to Verbose, import a bunch of files using your importer and then check the log to see how long each media engine actually took to perform a specific action on a file. If you notice that these operations start to slow down as you run more imports concurrently, you may be facing an IO or network issue. What could help to get around this, is to copy each file locally before importing it so that the network is hit only once for each file and that the IO is more distributed over the different import machines.

These are a few tips to improve your bulk import performance tremendously. Feel free to post any other suggestions in the comments!

Comments

Leave a comment
You must be logged in to post comments.
Sign in now
 
 
Technical
Business
rss feed