Posts Tagged ‘service’

The Art of Scaling

Thursday, April 19th, 2012

Note: this is a purely anecdotal posting about our struggles with some performance bottlenecks in the last few months. If you’re not interested in such background information, just skip.

You might have noticed that since about January 2012 using our file and mail servers hasn’t been as smooth as usual. This posting will give you some background information concerning the challenges we encountered and why it took so long to fix them. Let’s begin with the file server.

Way back in the days (i.e. 5 years ago), when the total file server data volume at D-PHYS was about 10 TB, we used individual file server to store this data. When one server was full, we got a bigger one, copied all the data and life was good for another year or two. Today, the file server data volume (home and group shares) is above 150 TB and growing fast and this strategy doesn’t work any longer: individual servers don’t scale and copying this amount of data alone takes weeks. That’s why in 2009 we started migrating the ‘many individual servers’ setup to a SAN architecture in which the file servers are just huge hard drives (iSCSI over Infiniband, for the technically inclined) connected to a frontend server that manages space allocation and the file system. The same is true for the backup infrastructure, where the data volume is even bigger.

This new setup had to be developed, tested and put in place as seamlessly and unobtrusively as possible while ensuring data access at all times (apart from single hour-long migrations). The SAN architecture was implemented for Astro in December 2010 and has been running beautifully ever since. In 2011 we laid the groundwork to adopt this system for the rest of D-PHYS’s home and group shares and after a long and thorough testing period the rollout happened on January 5, 2012. Unfortunately, that’s when things got ugly.

At first, we noticed some exotic file access problems on 32bit workstations. It took us some time to understand that the underlying issue was an incompatibility with the new filesystem using 64-bit addresses for the data blocks. As a consequence we had to replace the filesystem of the home shares. Independently we ran into serious I/O issues with the installed operating system, so we had to upgrade the kernel of the frontend server and move the home directories onto a dedicated server. In parallel, we had to incorporate some huge chunks of group data while always making sure that nightly backups were available. All this necessitated a few more migrations until we finally achieved a stable system on March 28.

The upshot: what we had hoped to be a fast and easy migration turned out to cause a lot of problems and take much longer than anticipated, but now we have a stable and solid setup that will scale up to hundreds or even thousands of TB of data.
See live volume management and usage graphs for our file servers.

As for the mail server, matters are to some extent related and partly just coincidental in time. The IMAP server does need access to the home directories and hence also suffered when their performance was impaired. But even after having solved the file server issues, we still saw single load peaks on the IMAP server that prevented our users from working with their email. Again, we put a lot of time and effort into finding the reason. As of April 13, we’re back to good performance and arrive at the following set of conclusions:

Particular issues:

  • a covertly faulty harddisk in the mail server RAID seems to have impaired performance
  • CPU load of the individual virtual machines on the mail server was not distributed across the available CPU cores in an optimal way

General mail server load:

  • while incoming mail volume doesn’t increase much, outgoing mails have grown 50% in the last year alone
  • more and more sophisticated spam requires more thorough virus and spam scanning, increasing the load on the mail server
  • our users have amassed 1.1 TB of mail storage (up from 400 GB in January 2010), which need to be accessed and organized

Bottom line:

We’d like to thank you for your patience during the last 4 months and apologize for any inconvenience you might have had to endure. In all likelihood the systems will be a lot more stable in the future, but of course we’re constantly working to ensure the D-PHYS IT infrastructure is able to keep up with the fast growing demand of disk space (the data volume has tripled in the last year alone). We’ve learned a lot and we’ll put it to good use.

Emergency file server migration

Thursday, January 12th, 2012

On Jan 5, after weeks of thorough planning and rigorous testing, we performed a migration of the home directories and group shares to our new SAN system. Soon afterwards, the first phone calls started coming in. The initial problem was very exotic and affected very few people (that’s why we had no chance to detect it during the testing period), but the action we took to address it unfortunately caused a cascade of consecutive faults that led to the instabilities you had to endure for one week now and for which we are truly sorry. We now know how to fix the underlying problem, but we cannot operate on the running server. That’s why we have to schedule an

emergency file server migration on Sat, Jan 14, starting at 07:00 and lasting well into the afternoon probably.

During this time, you will not have access to your home or group directories, and also email will only work intermittently. Please stop all running jobs and log out before Saturday morning.

We apologize for the suboptimal performance since Jan 5. You have every right to expect better, but this caught us completely off guard. Thank you for your understanding.

Update, Sat 14:15: mounts and email are up and running again. The problem on 32bit machines still persists, but we have an idea how to fix it on Monday.

Update Fri 20.01: we (hence you) are still suffering from severe stability problems on the file server. We are very hard at work and now have a plan that we really really hope will solve the problems. There will be another migration sometime next week. We’re truly sorry for the inconvenience you have to endure.

Printer Statistics

Wednesday, April 27th, 2011

We developed a real-time monitoring of the number of pages printed in physics department. We keep track of the printed pages on each printer and the total for our two servers (cups and winprinter). The results are plotted on our printer homepage printer.ethz.ch. Clicking on the upper graph shows you more details, as well as the statistics over the last 30days. Moreover you can click on the little “Stats” icon next to each printer, to see its individual statistics.

These statistics not only help us to monitor and debug our printers, but should also make everyone aware of the large amount of pages printed in D-Phys, namely that more than 5000 pages are printed every single workday.

Please be eco-friendly and print only when you need to.

New print server

Friday, April 1st, 2011

We know printing is not always fun, and we hear your complaints. During the past few months we have been busy setting up a new print server that should solve most of today’s problems (hopefully). We will deploy the new system on

Monday, April 4, at 07:30

The only change for you should be faster and more reliable printing. If you do experience any problems when printing, please let us know immediately so we can fix it.

Update 08:45 DNS confusion about old/new print server, we’ll have to wait for the DNS cache to flush

Update April 5, 16:10 migration is done

Give us your feedback

Tuesday, March 8th, 2011


Time for our biennial customer satisfaction poll!

We are service providers, and you are our customers. In order to be able to offer the best IT service possible, it is important for us to know what your needs and requirements are.

We therefore invite you to take this opportunity to tell us what we’re good at, but especially what we still have to improve to best meet your IT needs. You can use the paper ballots outside of HPT D 19, write an e-mail to herzog@phys.ethz.ch, or use the form below to submit your feedback, and you can participate either anonymously or provide your name if you want us to get back to you.

We explicitly invite our D-PHYS students to tell us your opinion! Do we offer what you need in your daily student life? Please let us know.

We also plan to organize a visitation of our HIT server room. So if you’d like to see where your files and e-mails spend their days, please let us know too.

Thanks in advance for sharing your ISG experience with us. We’ll pay heed to what you say!

Tell us what you think!

Friday, February 27th, 2009


We are service providers, and you are our customers. In order to be able to offer the best IT service possible, it is important for us to know what your needs and requirements are.

We therefore invite you to take this opportunity to tell us what we’re good at, but especially what we still have to improve to best meet your IT needs. You can use the paper ballots outside of HPT D 19 or write an e-mail to isg@phys.ethz.ch, and you can participate either anonymously or provide your name if you want us to get back to you.

We explicitly invite our D-PHYS students to tell us your opinion! Do we offer what you need in your daily student life? Please let us know.

Thanks in advance for sharing your ISG experience with us. We’ll pay heed to what you say!

Use your D-PHYS account for OpenID authentification

Friday, November 7th, 2008

OpenID is a distributed authentication infrastructure which allows you to use your preferred authentication provider with a vast number of OpenID enabled services (e.g. many WordPress powered blogs like our completely renewed nic.phys.ethz.ch, Plaxo social networking, Identi.ca microblogging, SourceForge open source project hosting, etc.) without the need to give them your password or remember the password you set there.

Now you can also use your D-PHYS account on all these sites with our new OpenID Authentication provider, you can login on these sites using the URL https://openid.phys.ethz.ch/<YOUR D-PHYS USERNAME> as your OpenID.

(more…)