Start the work towards XenForo 2.x support, updating the PageNav. Gonna need some more updates as well...
Added tag XenForo 1.x for changeset 9a0785d9c7dd
If there's only one page in a forum, don't error out.
Add a DownloadResult object to standardize logging of download success, or lack thereof. Make sure all download attempts return such an object, so we log them successfully, and log the original HREF, not the updated-to-a-local-version one. Still have some architectural improvements to make, and added a couple TODO items, but it's a step in the right direction.
Fix forum-default avatars not being downloaded due to having a relative URL.
Fix the semi-constants not being initialized properly by the static initializer. Makes more sense for them to be initialized here anyway.
Fix the thread mode not setting the thread URL, and adding links to the DB failing to add the rest it one fails. Also fix only logging successful ones if they redirect at least once... that explains why there were so few of them...
Fix the divide by zero error that caused the first thread to fail to fully download.
Add file type. Not perfect; sometimes it registers a website ending (.com, .uk/, etc.) as a file ending. But those are easy enough to filter out as a human reading the database.
Add the StatusCode to the DB log. Start changing the nomenclature from "BrokenLink" to just "Link" or "ProcessedLink".
Include working links in the DB logging. Also should include the response code (so we can filter based on reason for failed links), file extension if possible to determine. When examining manually, and looking at text logs, it appears we have a fair amount of false-failure results, e.g. a 429 for too many requests, or a -1 somehow in one case for a URL that worked manually.
Update the DDL to have the IsWorking column.
Derive the forum from the URL, rather than being hard-coded.
Remove maxWaitTime logging since it's obsolete and has been for a day and a half, maybe even two.
Validation mode will now check both poorly sanitized and well-sanitized versions of the HTML file name. Unfortunately, I've realized that it's not all that rare for threads to have duplicate names. Which throws off our validation; it sees that the reply count is wrong, and re-downloads the duplicate. And then the original. In at least one case, there's even a third... last one wins. Ugh. Definitely need the slug in the folder name. At least I learned that you can label and break out of a try loop in Java...
Add a "validation mode" that will check if there are the proper # of replies and download them if not. Also downloads missing threads.
Add a "dots mode" to fill in any previously skipped threads whose names end with dots. Going by the Asus forum, this is ballpark 2% of threads.
Remove trailing periods from thread names so it doesn't fail and skip them when archiving.
Solve one of the remaining thread-hang causes. Set both connect *and* read timeouts when getting HttpURLConnections. I hadn't realized read timeouts were a separate thing, but on URLs, such as http://www.dhl-usa.com/images/truck.gif, you will get them. But it'll wait forever by default. I'm not sure if this will solve all remaining thread-hangs, but it'll fix some of them.
Revamp the timeout/dead thread settings so it will only count things as dead if they haven't made any progress in 120 second (with a warning at 60 seconds). This has two benefits: - Threads that are just slow due to connection timeouts or lots of images won't die. - For threads that are really dead, we'll stop after 2 minutes instead of potentially much, much longer if it's a gigantic thread.