Previous 1 2 Next 27 Replies Latest reply: Oct 26, 2010 5:17 PM by kyleh0000
kyleh0000 Level 1 (0 points)
Hi All,

Im trying to find a solution to this problem and so all input is welcome

The problem....

Our OSX Server which is running 10.6.4 and is only doing file sharing continues to crash out and disconnect all users, the only way to recover from this fault is to reboot the server, sometime we need to hard reboot and the server doesnt respond to shh ARD or even input at the console, although this is not always the case it usually requires a hard reset to get it back up

The setup....

2 x OSX Servers running 10.6.4

Server 1:

Open Directory

ARD Task Server
QLA Server
DeployStudio Server

Server 2 -


Server 2 has 2 Gig ethernet ports bound as one virtual interface and is bound to server 1 as a client only and runs nothing else other than AFP

Server 2 host the DeployStudio Share along with homedrives and sharedrives

Our clients login to the machines using there OD account so when ever the server crashes the machines lock up and need to be hard reset to get them up and running again.

We have roughly 250 MAC's and around 400 users in the OD each client machine is connected at a Gig

Testing and things we have tried to resolve the issue...

We had tested osx 10.6.4 for around 2 months before moving it into productions, we rolled out all our client images with this server with no signs of anything wrong, the network through put would be more at least that of a normal work day. the only difference is that on a work day we would have around 200 server connections compared to around 50 or so when we do our images.

We also noticed that other day that when it crashed and we actually able to remote to the server that it was fine at around 3:00 then it suddenly thought it had 7000 connections (yes 7000) and it also thought it was doing 1000MB/sec throughput (yes thats right 1000MB/sec) this continued for around 10 mins then the server crashed.

I am at a bit of a loss to understand why this crashed and I have been trolling the logs to find a solution but so far nothing has turned up, I did notice on the day mentioned above there was a crash log for AFS which give the thread that fails but nothing that i can make any sense of.

here is the start of the log..

Process: AppleFileServer [241]
Path: /System/Library/CoreServices/
Identifier: AppleFileServer
Version: ??? (???)
Code Type: X86-64 (Native)
Parent Process: launchd [1]

PlugIn Path: /usr/sbin/AppleFileServer
PlugIn Identifier: AppleFileServer
PlugIn Version: ??? (???)

Date/Time: 2010-08-05 15:58:34.086 +0930
OS Version: Mac OS X Server 10.6.4 (10F569)
Report Version: 6

Exception Codes: KERNINVALIDADDRESS at 0x0000000000000000
Crashed Thread: 141

it then shows a total of 202 threads with a hole bunch of hex at the end i can provide the entire log if you need it.

also the server does not drop any pings and anytime during the crashes

I would apreaciate any help with this issue and also even other people experience with AFP does this sort of stuff happen to you? or is it just me?

thanks in advanced

Message was edited by: kyleh0000

Mac OS X (10.5.4)
  • InterHmai Level 1 (60 points)
    Can you see the list of connections, who is connected, and from where?

    Was there anything that stands out in the system log when the server crashes?
  • pcolvin15 Level 1 (90 points)
    There was another post about Spotlight causing crashes, including taking out the LDAP database, that was supposed to be fixed in 10.6.4. The cure was to set all server volumes and hard drives as private in spotlight (basically shutting down spotlight). The error you have might be caused by mdworker, which runs the spotlight indexing system.
  • kyleh0000 Level 1 (0 points)
    thanks for the replies, as for the connections last time we had around 200 people connected and the time before around 140 so i have been unable to see any real pattern there, also there is nothing in the system logs to indicate and errors in any logs that corrispond to the apropriate times (except the one mentioned above)

    as for the spotlight i did do a bit more googleing and come across a few items relating to spotlight crashes but these were for systems runing 10.5 rather than 10.6

    however i on the drive home i did think of something that im hopefull will resolve the issue....

    Perviously we have been running with 10.6.3 for a good 10 week solid with no crashes or even a hint of the server falling over, the reason i didnt want to roll the server back was becuase if i had to roll the back the DS server it would have been a huge pain in the *** and require me to bacsily recreate the entire directory to ensure everything was nice and consistant, but then i realised hey we dont need to roll back the DS the only server causeing issue with one simply running AFP which is easy as pie to roll back, all i need to do is a quick backup then rebuild to 10.6.3 and we are done!

    the only concern i have with the this is if i have a DS server running 10.6.4 and a file server running 10.6.3 will there be any issues?

    but i figure if there is a problem with the 2 server authentication then it will only prevent new logins rather than boot everyone off that is already connected, with is much better than what its doing now.

    I just have to organise some down time to roll this server back and ill post my results back here!

    also for the record the reason we went to 10.6.4 was to resolve issues with the software update server and also i have found its much more reliable distributing printers and preferences to out client machines, but as the DS server also runs software update i dont see how we have much to lose in rolling back this server:)
  • kyleh0000 Level 1 (0 points)
    havent had a chacne to roll back to 10.6.3 yet to see if this makes a difference however i did notice that in the crash reports each one starts that at the time of the crash there are exaclty 202 connections to the server...

    im thinking i might try and restrict the connections down to 195 to see if that resolved the issues until we find out the root cause

  • Level 1 (0 points)
    Hi, did you have any further luck with this?

    I have a customer with the exact same issue, whereby, randomly, all the users are kicked out of the server and on occasion, the server has been seen to have upwards of 100 connections from each client. (They are a small office with under 10 users)

    The only thing which appears to be consistent, is that it happens when Time Machine starting a backup.
  • kyleh0000 Level 1 (0 points)
    We have now rolled back the server to osx 10.6.2 and it seems to be stable (1 week so far)

    So it seems like a bug with OSX 10.6.4 however I have already logged ajob with apple and sent them the log files, so far its been about a week and half without a response so im not expecting much of a response, but ill keep following up and post back what they think
  • kyleh0000 Level 1 (0 points)
    ok this does NOT fix this issue, I heard back from apple who where less than helpfull, the admitted this was a fault with the server OS however they gave no indication of when the issue would be resolved only that other users are getting the same issue

    The have said turning off spot light on the server volumes helps, but does NOT resolve the issue, we have done this and it has not helped the issue at all we are still getting random crashes at least 3 times a week if not many more. When we rolled back to OSX 10.6.2 we did a complete server rebuild from scratch the only thing that got installed over and above the base OS was Xtend iSCSI which we use for our backups, which should not affect AFS at all

    Apple would admit any of this in writing, im about to give up on the server OS all together and move the file services over to linux.

    Apple does not have an even remotly stable server os and it should not be advertised as suchl we have had no end of problems with these servers and its seem that apple have no solution and do not even seem to care about.

    if anyone else is having issue with this i encourage you to post here in this thread and hope that someone from apple takes notice but i highly doubt that will happen, I will start a new thread on how to host file services on linux rather than dragging out this thread any more, i can get share drives working on linux but imhaving some issues with home drives, there are several guide on the internet on how to do this so obviously other have come to the same conclusion as me that apple osx server is ****.
  • pcolvin15 Level 1 (90 points)
    Ok. You've got me confused. If I follow correctly you said that you had a stable 10.6.2 to begin with, rolled to 10.6.3, which was also stable, and then rolled to 10.6.4 which is unstable, and nothing works stopping the crashes. You then rolled back to 10.6.2 and are now unstable.

    You may have a corrupted OD and/or LDAP, and the way to fix either one is not to archive/import it into a new build, but to rebuild it from scratch, since you would be importing a corrupted build back in.

    1,000MB throughput is 1GB, which you might expect if you are showing 7,000 connections. However, it doesn't sound like you've looked at the activity monitor when the problem hits, to see which errant process might be taxing the server, or to remove pull the nic cables and see what occurs. If your hit rate suddenly drops through the floor, and your server calms down, then your problem is something external, such as a bad nic, or bad box out there. The problem with a bad nic, or a broadcast storm, etc., is that without a sniffer you'll never see it.
  • kyleh0000 Level 1 (0 points)
    ok recap...

    We had a server running osx 10.6.2, which seemed stabel for around 10 weeks proir to the move.

    We rebuilt both server from scratch to 10.6.4 including recreating the OD, specificly to avoid any problems.

    In an effort to try and resolve the issue of crashing we then rolled the AFP server back to osx 10.6.2 and did a clean build, we didnt touch the OD Master so currently we have a server running just AFP and which is OSX 10.6.2 and a 1 other server running OD, Software update net boot ect ect, which is running osx 10.6.4

    This was stable for around a week and i though we had the problem resolved but it turns out the same crash has occured again, and after finaly hearing back from apple they told me the problem was a bug in OSX that caused these crashes.

    Also just to clarify out MAX user connections will never exceed 300 normaly, what happens is when the server crashes AFP thinks it has 5k+ connections and the server is doing 1 Giga-Byte (not bit) a sec thoughput, this is reported in Server admin.

    the server then dies and requires a reboot.

    Which as our users login to network accounts means that the machines lock up and they cant do anything.

    The only other difference i can see between now and when the users were stable is the fact the clients were running at 100Mb rather than now which they now have gig to desktop, the server through put is definetly more spikey (ie much higher throughput compared to before) besides that and the OSX updates nothing has changed.

    for now im just going to have to get everyone to log onto locla user accounts and just connect to the server as needed, which as you can imaging creates all sorts of problems with security and becomes a nightmare for data managment

    anyway off to build a linux server and wait on apple to pull thir finger out.....

    Message was edited by: kyleh0000

    Message was edited by: kyleh0000
  • pcolvin15 Level 1 (90 points)
    Is your AFP server set as a standalone? If it is, then you shouldn't have any login issues as your clients should be logging in to the OD server.

    From the sounds of it, when the AFP server locks up it's doing a broadcast storm of some sort, swamping the network, and freezing out any network activity. When the server locks up, are the previously logged in clients still able to access other network resources?

    Where are the home directories located?
  • kyleh0000 Level 1 (0 points)
    the afp server is simply bound to the OD Master, so that network users can login and use AFP services

    We did this because of another known bug in MAC OSX which would crash the server if AFP was running on the same box as Directory Services

    The home directories along with share drives are stored on a Raid which is connected to the AFP server. We logged an offical call with apple and sent them our log files and they came back with the fact its a know fault is within the Apple Files Service itself and that they would fix it in a future update, which could be osx 10.6.5 or it could not be, they werent very helpfull at all

    When the crash happens we will see one of 2 things, the AFP service will crash and boot everyone off with the server requiring a reboot to resolve the issue...

    or the AFP service will crash then the service will start up again by itself, users may not notice that at all, however then after the crash the server runs EXTREAMLY slow untill the server is rebooted

    We can access the server via ARD after the server crash so its not like we cant establish any new netowrk connections

    originaly when i first saw this error i thought it might be someone tryng to do a network flood attack or something, when i googled this i found sometimes the ssh port can be used to do this so i turned it off but the crash still occurs.

    im basicly at the point now where i dont think there is aything im going to be able to do to stop the crahes so i need to find an exceptable solutions for the cilients

    your help is greatly apreciated, forgive me if my posts are a bit blunt im just feed up with it all

    Message was edited by: kyleh0000
  • pcolvin15 Level 1 (90 points)
    When you mean "simply bound" do you mean it's set as a replica, or set to connect to a directory server?

    I've found a number of threads that say the AFP problem has existed since 10.5.x days, and where some entries notate that the problem has only occurred where there is one OD server and a number of AFP servers set up as "connected to a directory server". When they ran the extra servers as replicas the problem went away.

    I run two mac minis connected as master/replica and load balanced as to services. My replica is iSCSI connected to a Drobo Pro providing files services in both AFP and SMB modes and have not had the issue you've run into. Have you tried setting the second server up as an OD replica?

    As to the network connections working, if your network is taken out because certain ports are swamped you can get issues similar to yours where some ports are working and others are not.

    Using SA as the definitive network analyzer also is not a good way to go, especially if you think you are having a network issue on the wire side of the nic, as it only monitors at the inside of the port, post driver. PerfMon, in the Windows world, also has the same limitations. You may want to secure a sniffer and monitor your network activity to make sure that you don't have something else helping kicking off the server problem. Neither one, for example, can tell you if your system is locking up because you have a bad nic suffering from excessive retries, which will be cleared when you reboot, and then start up sometime later again.
  • kyleh0000 Level 1 (0 points)
    the server is NOT a replica its just "connected to directory server" this is due to the fact that at the start of the year we found the exact oposite issue, when the server was setup as a directory server and we had AFP running on either the master or the replica the server CPU would lock at 100% and we would have to hard reboot the server, as soon as we removed the replica the problem was resolved. There is also serval threads in the forums that show others with a similar situation.

    As for the network traffic, we have people that monitor the traffic and i know they keep and eye on things as recently we had a virus outbreak that would flood the network and they were the ones that picked it up. However this was a PC virus and the IP address that we got were not from any MAC's, when they told me this i thought maybe this was the cause but the server is still having issues and the virus has been removed. Im not familiar with the tools they use but they seem to be pretty good at what they do so i have never asked.

    If you dont mind I would be interested to know a little bit more about your setup, thing like server OS version average network traffic and number of clients ect, here is a bit more information about our setup

    File Server: OSX 10.6.2, 12GB RAM, 9TB Raid(5+1), 2 x Quad Core Xeon @ 3.06GHZ, 2 x 1Gb eth connected as a Bond
    OD Master: OSX 10.6.4, 10GB RAM, 2 x Quad Core Xeon @ 2.66Ghz, 1 xGb eth

    Clients: 200 x Gig to desktop
    OD Users: 400
    (both of these number are ruff, dont have the doco with me atm)

    Max: 250Mb/s
    Average: 35-40 Mb/s

    Our users login at around 8am and logout at 10pm

    Perviously when the server seemed stable all of the above were the same except for the server versions, both were osx 10.6.2, users had 100Mb/s to the desk and the file server only had on GB ethernet

    The old network traffic was...

    Max: 60Mb/s
    Avearge: 25-30Mb/s

    Keep in mind the MAX values were achived during our test period while we were imaging machines, the test period for the above changes was around 2 months during this time we were imaging around 40-50 machines a day sometimes more than once specificly for load testing with no problems at all

    From the timing of the crashes it was looking like it would crash between around 1:30 to 3:00 this would be when afternoon classes start to crank up along with morning classes returning from break, so basicly peak time. We had 4 crashes in a row that would crash around this time, but last week we had 2 crashes on the same day, on a day when only limited classes were running. So im struggling to build up a consistant patteren of faults.

    Thanks again for your help
  • simoneita Level 1 (0 points)
    I have a similar problem:
    One XServe with AFP, VPN, Mail, Web, iCal.
    AFP crashes randomly (from 2-3 times _a day_ to once a week), with only 10 clients connected and a network throughput lesser than 15MBps.

    When NetBoot and NFS are enabled the AFP service crashes up to 3 times in a day.
    Disabling spotlight didn't help very much.

    Sometimes AFP crashes even if no client is connected (on Sundays...).

    We don't have to reboot the whole server, killing the service is enought (killall -9 AppleFileServer).

    I've opened a bug from apple developer center but still no answer...
Previous 1 2 Next