I’m taking the DG course this semester, and most of the homework involve programming the scheme discussed that week. For the first few assignments, my algorithm was to consider each element in a for loop. It made for some very readable code, but was incredibly slow. Essentially, for each time step, I looped through each element individually and applied the appropriate scheme. It broke a lot of rules for fast computing, especially the one about not using too many for loops if possible.
This worked fine for a conservative equation, as the CFL number is reasonable. When we moved to a diffusive equation, the number of time steps had to be squared. Now the run-time was unacceptable. What I did was first try to optimize what I had. I think my total speedup was roughly 20%… not good.
Finally, I bit the bullet and re-designed the whole code. I relied on matrix multiplication to construct the right hand side (which was actually taking the most time, as it changes each time). The program is far less readable, but the speedup was 100 times.
Moral? Chase after the 80% instead of the 20% speedup from trivial optimization.