Published on March 04, 2017

I recently had a customer on the line who was complaining about a script that was running too slowly. It was an import that generally takes several hours.

The source code doesn't come from me and the script accesses multiple databases on different servers.

The usual performance analysis tools were of no use in this case: For at least one of the databases, we don't have sufficient rights, e.g. to switch on and monitor the MySQL Slow Query log.

After taking a look at the source code, we find that the script sends out a great deal of queries and also copies and processes a lot of files. So first, we need to find out what is actually taking so long. Is it really the database? Or at the end the file transfer?

Wrong tracks

The first instinct in such cases is often to check the SQL queries for inefficient patterns and to check the databases for missing indexes. Some of the queries did indeed have such patterns - for example, the following condition seems suspicious right off the bat:

    WHERE `p`.`product_ean` LIKE '%123456789'

As we learn pretty early in school, such queries cannot be covered by an index and are therefore always tied to a table scan. This is fine when there isn't much data, but with larger amounts of data, it's guaranteed to cause problems. But: Which of these queries in the code is actually problematic? And: How could we correct the queries? The above-mentioned case indicates an EAN, and we could make the assumption that there are preceding zeros here that have to be covered. But assumptions can lead to wrong assessments, so we carry on with our fact-checking...

Setting goals

The important thing is not to lose sight of our goal. Which of these inefficient queries is actually problematic? And which of them can we ignore? And how often are all these queries performed? I have long been a fan of the 'strace' tool. 'strace' lays out all the system calls of a program. That way, you can see what effects a program really has on its environment. Since our script presumably runs slowly primarily due to interactions with its peripheral systems, I think this is the right tool for the job.

Debugging

So I launch strace and use the parameter -p $PID to attach it to the already running process. ```bash

strace -p 933950

Process 933950 attached write(1, "Produkt wurde erfolgreich aktual"..., 133) = 133 sendto(5, "33443ntttSELECTnttttentryID AS I"..., 1248, MSG_DONTWAIT, NULL, 0) = 1248 poll([{fd=5, events=POLLIN|POLLERR|POLLHUP}], 1, 1471228928) = 1 ([{fd=5, revents=POLLIN}]) recvfrom(5, "1130+23def6mmdb151t5Entry2ID7"..., 1139, MSG_DONTWAIT, NULL, NULL) = 1139 poll([{fd=5, events=POLLIN|POLLERR|POLLHUP}], 1, 1471228928) = 1 ([{fd=5, revents=POLLIN}]) recvfrom(5, "313def6mmdb151t5Entry10statusID10st"..., 1519, MSG_DONTWAIT, NULL, NULL) = 69 sendto(5, "3043nttttSELECTntttttEAN,nttttt"..., 200, MSG_DONTWAIT, NULL, 0) = 200 poll([{fd=5, events=POLLIN|POLLERR|POLLHUP}], 1, 1471228928) = 1 ([{fd=5, revents=POLLIN}]) recvfrom(5, "1110,23def6mmdb155Track5Track"..., 1450, MSG_DONTWAIT, NULL, NULL) = 443


 Okay, I see a lot of interactions here, but the output stops here and there for a fraction of a second. What's going on here exactly? I'm pretty sure the script is waiting for an answer from a database. Can we check where exactly the script is getting stuck? Turns out we can: `strace` has an option `-c` which, rather than generating an output "live", outputs a statistic at the end (after you stop strace or after the target process is finished). Sounds good. Let's put it to the test:


```bash
# strace -p 933950 -c
Process 933950 attached
^CProcess 933950 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 46.43    0.000970          10        99           poll
 45.14    0.000943         236         4           stat
  3.49    0.000073           5        16           lstat
  3.45    0.000072           1        71           sendto
  0.91    0.000019           0       100           recvfrom
  0.19    0.000004           0        48           write
  0.19    0.000004           0        16           open
  0.05    0.000001           0        22           lseek
  0.05    0.000001           0         6           mmap

And now we spot the first suspect: poll is accessed quite frequently and takes up most of the time! stat, on the other hand, is only accessed 4 times, so we can forget about it for now. poll is used in order to wait for answers from the network. A program needs this system call in order to sleep until data is supplied again by a destination device. But what data? And how do we use this information to find potential solutions to our actual problem? If we take another look at the output above, we see a pattern of system calls: There is always a sendto, followed by poll and finally recvfrom. As the names indicate, these are system calls that are responsible for sending and receiving data via network.

So now we can see precisely how long it takes between a sendto and recvfrom pair. Maybe that will give us a clue! But what's the easiest way to go about it?

Manpage to the rescue!

After reading through the strace man page a bit, I find an option that seems very useful: > -r Print a relative timestamp upon entry to each system call. This records the time difference between the beginning of successive system calls.

This option lets me see how much time has elapsed since the last system call. I think for a bit, and then it hits me: What will happen if I use strace to only display the two system calls sendto and recvfrom and filter out the rest? Let's see: ```bash strace -p 933950 -r -e recvfrom,sendto Process 933950 attached

0.000169 sendto(4, "v13ntttSELECTnttttimage_name"..., 271, MSG_DONTWAIT, NULL, 0) = 271 1.036864 recvfrom(4, "111^23def21sometable1"..., 131, MSG_DONTWAIT, NULL, NULL) = 131 0.000088 recvfrom(4, "0.jpg55376"", 173, MSG_DONTWAIT, NULL, NULL) = 14


 Perfect! And we've already nabbed a suspect! Here we have a query (we see a `SELECT`) and some sort of table name in the output (`recvfrom` row). Interesting. But maybe it was just a statical outlier. Because the other calls need a lot less and I also can't see the entire query. 
## You're surrounded!

 So we know what we need to look for. Now, the trap just needs to be set up correctly, and then we'll see what happens. My plan goes like this: We let `strace` write (option `-o`) in a file, then sort the call-ups by time and should then be able to find the slowest queries, statically speaking.


And while we're at it, we'd also like to see which queries are responsible. The output above shows us that `strace` significantly shortens long string arguments. In the man page, we find the parameter `-s`, which can be used to indicate how much to shorten arguments by. We want the entire data if possible; so I enter an extremely long parameter of 10000 characters. So:


```bash
# strace -p 933950 -r -e sendto,recvfrom -s 10000 -o import_db_stats.txt
Process 933950 attached

We let the trace run for a bit and then end it with Ctrl-C. Then we sort the record numerically and look at the last few rows: ```bash

sort -n import_db_stats.txt | tail -n5

 0.749176 recvfrom(4, "111^23def21sometable1i"product_imagesnimage_namenimage_namef!37523751P53376"244231234567890_1.jpg55376"", 159, MSG_DONTWAIT, NULL, NULL) = 145
 0.769669 recvfrom(4, "111^23def21sometable1i"product_imagesnimage_namenimage_namef!37523751P53376"244231234567890_0.jpg55376"", 172, MSG_DONTWAIT, NULL, NULL) = 145
 0.803661 recvfrom(4, "111^23def21sometable1i"product_imagesn", 69, MSG_DONTWAIT, NULL, NULL) = 69
 0.814297 recvfrom(4, "111^23def21me", 16, MSG_DONTWAIT, NULL, NULL) = 16
 0.947148 recvfrom(4, "111^23def21sometable1i"product_imagesnimage_namenimage_namef!37523751P", 102, MSG_DONTWAIT, NULL, NULL) = 102


 Oh, lovely! So we have the five slowest calls (respectively those with the longest waiting time previously), all are `recvfrom` and from the same database (the first parameter corresponds to the file handle with respect to the socket). Bonus: Four of these calls contain data from the same table! Jackpot! Hang on a second - but, now, how do I see the queries that are slow? The `recvfrom` only pertain to the read-out of the results. We remember - the query is usually sent immediately prior with `sendto`. So all we need to do is search through the (unsorted!) output for the corresponding rows and find the predecessor.


I think that the relative timestamps over the short period of time are precise enough to be used as search parameters. So we pull them from the results above:


```bash
# sort -n import_db_stats.txt | tail -n5 | awk '{print $1}'
0.749176
0.769669
0.803661
0.814297
0.947148

Perfect! And with grep, we can now find their preceding rows. We do this with the help of parameter -B. Since we do not want the results, we filter them right back out again: ```bash

sort -n import_db_stats.txt | tail -n5 | awk '{print $1}' > slowest_times.txt

for time in $(cat slowest_times.txt); do grep -B1 $time import_db_stats.txt | grep -v $time done 0.000091 sendto(4, "v13ntttSELECTnttttimage_namentttFROMnttttproduct_images AS intttJOINnttttproducts AS pntttttUSING (product_id)ntttWHEREnttttimage_name LIKE '12345678896_1.jpg'nttttANDnttttp.product_ean LIKE '%12345678896'ntt", 271, MSG_DONTWAIT, NULL, 0) = 271 0.000222 sendto(4, "v13ntttSELECTnttttimage_namentttFROMnttttproduct_images AS intttJOINnttttproducts AS pntttttUSING (product_id)ntttWHEREnttttimage_name LIKE '1234567890134_0.jpg'nttttANDnttttp.product_ean LIKE '%1234567890134'ntt", 271, MSG_DONTWAIT, NULL, 0) = 271 0.000196 sendto(4, "v13ntttSELECTnttttimage_namentttFROMnttttproduct_images AS intttJOINnttttproducts AS pntttttUSING (product_id)ntttWHEREnttttimage_name LIKE '1234567890189_0.jpg'nttttANDnttttp.product_ean LIKE '%1234567890189'ntt", 271, MSG_DONTWAIT, NULL, 0) = 271 0.000089 sendto(4, "v13ntttSELECTnttttimage_namentttFROMnttttproduct_images AS intttJOINnttttproducts AS pntttttUSING (product_id)ntttWHEREnttttimage_name LIKE '1234567890134_1.jpg'nttttANDnttttp.product_ean LIKE '%1234567890134'ntt", 271, MSG_DONTWAIT, NULL, 0) = 271 0.000167 sendto(4, "v13ntttSELECTnttttimage_namentttFROMnttttproduct_images AS intttJOINnttttproducts AS pntttttUSING (product_id)ntttWHEREnttttimage_name LIKE '1234567890103_0.jpg'nttttANDnttttp.product_ean LIKE '%1234567890103'ntt", 271, MSG_DONTWAIT, NULL, 0) = 271


 Lovely. As we supposed back at the very beginning, it is one of the queries which makes a `LIKE` search with a wildcard start. So now we just have to search for the corresponding query in the code and think about how we can optimise it. But I don't need to explain that any more; today, it was all about searching for a problem in a different way.



## Conclusion

 So we've learned that Linux with `strace` makes for a wonderful tool, not just for finding general problems but also for performance analyses. The nice thing about it is that you can learn a lot about a program, especially if you don't have access to the source code. Until now, I had used `strace` primarily for troubleshooting slow programs or analysing precisely what a program does.


Here, then, is a summary of the most important parameters of `strace`:


- `-o` writes the output in a file
- `-r` shows the relative time since the last indicated (!) system call
- `-s` can be used to set the string-parameter shortening
- `-c` creates call profiling for the calls made
 
 `strace` can do a lot more, so it's worth taking a look at the man page.

News