• Bug
  • Status: Closed
  • 1 Critical
  • Resolution: Fixed
  • ehcache-core
  • cdennis
  • Reporter: mads1980
  • June 19, 2013
  • 0
  • Watchers: 9
  • August 14, 2013
  • July 22, 2013

Attachments

Description

We migrated our production servers from 2.7.0 to 2.7.1

After some 15-20 minutes, all our servers began to crash for no apparent reason.

Digging deeper, we find that the vm.max_map_count Linux kernel parameter is being exceeded. This causes JVM errors, such as OutOfMemoryError and others. The default max value for this kernel parameter is 65336.

Looking at /proc/PID/maps, over 60K of the 65536 entries are taken by mmap references for localTempSwap files.

FYI, we only have about 80 caches using localTempSwap, with the default diskAccessStripes value (which I believe is 1). So this means about 80 RandomAccessFiles should be opened by localTempSwap caches, which would mean 80 entries in /proc/PID/maps - but we have over 60K entries, and we would probably have more had the server been configured with a higher vm.max_map_count parameter value.

So what seems to be happening is that some change in 2.7.1 is leaking mmap references somehow. 2.7.0 worked fine.

Comments

Fiona OShea 2013-06-19

Can we reproduce here? This seems like it could potentially be a major issue for 4.0.2

Manuel Dominguez Sarmiento 2013-06-19

By the way, setting persistence=”none” instead of “localTempSwap” for the affected caches allowed us to continue in production with 2.7.1

Yi Zhang 2013-06-20

Hi Manuel, can you please provide us some additional information to help us investigate this issue:

  1. How many elements, and what is the total data size of your 80 caches?
  2. What is the number and size on disk of the temp swap files that are created (i.e. the *.data files in your diskstore path location)?
  3. What is your OS?
  4. Can you attach the ehcache logs to this jira?

Thanks!

Manuel Dominguez Sarmiento 2013-06-20

Hi Yi, regarding your questions:

1) The caches are practically empty, and elements do not even start to spill over into the disk store, so the files are basically empty (the elements fit in memory).

2) One localTempSwap file is created per each cache.

3) OS is RHEL 5.4

4) I have attached the relevant initialization logs.

PS: I don’t think this should be relevant, but the 80 caches belong to different webapps, each with its own separate cache manager and disk store. So each webapp has less than 20 localTempSwap caches. The log I’ve attached belong to the webapp that has the most localTempSwap caches.

Manuel Dominguez Sarmiento 2013-06-20

I have attached the contents of /proc/14264/maps for your analysis, which clearly shows the mmap explosion.

Yi Zhang 2013-06-20

Hi Manuel we’re pursuing a few leads, but would like a bit more information. Can you please also send us your ehcache.xml? Are you aware of your app recreating caches for any reason (i.e. CacheManager.add/remove(cache)? Do you have any sample usage code that you can share?

Thanks

Manuel Dominguez Sarmiento 2013-06-20

I have attached ehcache.xml

We are not creating caches dynamically, only via ehcache.xml, so no CacheManager.add/remove

All the affected localTempSwap caches are very low traffic, so this does not have an effect. In fact we think the mmap “explosion” occurred right after application startup. It took minutes for the server to crash because this indirectly caused OutOfMemoryErrors, when trying to open files or create new threads, but we believe the problem was there all along since start-up. Of course we will not try to reproduce this again in production, so I guess the provided diagnostic data is all we have.

Yi Zhang 2013-06-20

Hi, we see some caches listed in the maps file, e.g. 2aaaca73b000-2aaaca73c000 rw-s 00016000 fd:00 4063499 /home/personal-ar/cache/com%002erenxo%002ecms%002eservice%002edevice%002e%0044evice%004danager%002e%0049%004d%0041%0047%0045.data 2aaaca73c000-2aaaca73d000 rw-s 00030000 fd:00 9601028 /home/claro-uy/cache/com%002erenxo%002ecms%002eservice%002edevice%002e%0044evice%004danager%002e%0049%004d%0041%0047%0045.data

with apparent name com.renxo.cms.service.device.Device.Manager.IMAGE

which accounts for most of the entries:

grep “com%002erenxo%002ecms%002eservice%002edevice%002e%0044evice%004danager%002e%0049%004d%0041%0047%0045.data” 14264.maps | wc -l 59482

This cache is not in your ehcache.xml. Do you have any insight into where this comes from and how it gets created/used?

Manuel Dominguez Sarmiento 2013-06-20

Hi, as I mentioned before, we have several *different* webapps on the same server. I had only attached the ehcache.xml for the webapp with the most localTempSwap caches, which incidentally does not contain the cache you are mentioning (DeviceManager.IMAGE) which is taking up most of the mmap entries.

Most of the webapps contain this cache (that is why you will see the same cache under /home/claro-uy, /home/tigo-py, etc.) - global configuration for the ehcache.xml is the same as the provided example, and the cache itself is defined just like the other localTempSwap caches, like this:

<cache name="com.renxo.cms.service.device.DeviceManager.IMAGE"
	maxEntriesLocalHeap="1000"
	eternal="false"
	timeToLiveSeconds="86400">
	<cacheEventListenerFactory class="com.renxo.cms.cache.jgroups.JGroupsCacheReplicatorFactory"
		properties="replicateAsynchronously=true, replicateUpdatesViaCopy=true" />
	<cacheExtensionFactory class="com.renxo.cms.cache.ExpiryThreadCacheExtensionFactory" />
	<bootstrapCacheLoaderFactory class="com.renxo.cms.cache.jgroups.JGroupsBootstrapCacheLoaderFactory"
		properties="bootstrapAsynchronously=true, asyncDelay=120000" />
	<persistence strategy="localTempSwap" />
</cache>

Interestingly enough, this cache NOT being used AT ALL in the affected webapps. Please see our cache stats dump for this cache, taken at the same time of the mmap dump:

    Cache name: com.renxo.cms.service.device.DeviceManager.IMAGE
            Statistics:
                    Hit ratio: N/A
                    In-memory hits: N/A
                    On-disk hits: N/A
                    Hits: 0
                    Misses: 0
                    In-memory hits: 0
                    On-disk hits: 0
                    Object count: 0
                    Approx. in-memory byte size: 0 bytes
            Configuration:
                    Max. elements in memory: 1000
                    Eternal: false
                    Persistence: none
                    Time to idle (seconds): 0
                    Time to live (seconds): 86400
                    Disk expiry thread interval (seconds): 120

It is also interesting that we have several other similar caches with the exact same configuration and no usage at all, but only DeviceManager.IMAGE seems to be taking up the mmap entries:

VendorManager.IMAGE OrganizationManager.IMAGE CategoryManager.IMAGE NetworkOperatorManager.IMAGE etc.

I can only think that the cache name has something to do with it, perhaps because of iteration/collation order?

ilevy 2013-06-24

We have a docs req’d checkmark and a target of 4.0.2, but it’s not clear what to document. Maybe the remedy he used … but doesn’t seem like the cause has been found.

Yi Zhang 2013-06-24

Yea at this point I’m not sure we really understand the cause…rapidly recreating caches does reproduce the increase of maps count (until there is a GC) but they claim they’re not doing that; yet they didn’t seem willing/able to share code or reproduce. Given that they recognize it’s just that one named cache (that’s the same as other caches) that’s causing the problem I still strongly suspect an issue with their set up.

Yi Zhang 2013-06-24

Hi Manuel, our developers have done a code review on the relevant changes from 2.7.0 to 2.7.1 and don’t believe there were any changes that would result in the behavior you are observing. We have also tried different tests with your config file (including that cache name) but do not notice any issues; the caches only use a few hundred memory maps when they are not being constantly recreated. Just to clarify, did you experience this problem just one time; to rule out any network/disk related issues? If the problem is reproducible, can you please attach a test case for us to examine?

Manuel Dominguez Sarmiento 2013-06-24

We do not know how to reproduce the issue, except deploying in production, which of course we cannot attempt to do again.

This was reproduced at least 4 different times, across 8 different servers. 6 of them running RHEL 4.4 and two of them RHEL 5.4 - all attempts at deploying the apps eventually crashed the servers. So it was not casual.

The problem went away by disabling the localTempSwap and replacing with persistence=none.

Fiona OShea 2013-06-24

Unchecking Doc required field. Think the Reporter added it when creating the issue.

Chris Dennis 2013-06-26

I’ve done some more digging in to this issue, and I do believe I have found a scenario which can cause this. If you rapidly clear a cache over and over again then you can cause new mappings to be created faster than the JVM’s GC/reference processing can clear the old ones up. This will cause the process to bump up against the kernel limit. The Ehcache code will handle these mmap failures, but this can also cause internal allocations within the JVM to fail (which is obviously not good). I’ve applied a change in trunk locally that helps to mitigate this effect, although I cannot yet say when that fix will become part of a released artifact.

This discovery does however bring up a number of questions:

Are you repeatedly clearing the specific caches that see these problems (the {{com.renxo.cms.service.device.DeviceManager.IMAGE}} caches)?

This bug also existed in 2.7.0, so I don’t understand (assuming all else is the same) why you didn’t see this behavior in 2.7.0 as well. Are you sure nothing else changed during the upgrade?

This bug only exists in the proprietary/enterprise disk store that is packaged with the Enterprise product - could it be that you were using the open source artifacts in 2.7.0, and you switched to using the proprietary enterprise artifacts when you upgraded to 2.7.1?

Manuel Dominguez Sarmiento 2013-06-26

Hi Chris, I think you’ve nailed it. These are my answers to your questions:

1) Yes we are. When we upgraded our production platform, we also added a distributed cache invalidation scheme for this cache region, and were running a process that continuously caused invalidations and therefore repeated cache.removeAll() ensued. We did not make the connection until you brought this up, but clearly that was the scenario.

2) We probably did not see this before in 2.7.0 because we did not have this invalidation scheme that caused repeated cache.removeAll() calls in our previous platform release.

3) We were using ehcache-ee-2.7.0.jar from the BigMemory Go 4.0.0 release, so we were already using the enterprise artifacts. The issue did not surface before because the repeated removeAll() was introduced together with 2.7.1 in the same software upgrade.

Yakov Feldman 2013-07-30

Hi Chris, Can you please merge this change to 4.0.x and 4.1.x branches. If merge is done please add them in ‘Fixed in revision’ field.

Yakov Feldman 2013-08-14

QA test:com.terracotta.qa.ehcache.test.standalone.StandaloneClearCacheTest/multipleRemoveAll

passed in 4.0.4.

Closing this defect.