[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: wget2 | Stack overflow downloading a deepy nested website (#659)
From: |
Andrew White (@awhite27) |
Subject: |
Re: wget2 | Stack overflow downloading a deepy nested website (#659) |
Date: |
Tue, 30 Apr 2024 10:39:01 +0000 |
Andrew White commented:
https://gitlab.com/gnuwget/wget2/-/issues/659#note_1887101433
This is the stack trace. wget2 was built with the current master with "-g -O0".
I've replaced any text identifying the website with `<removed>`. The functions
calls between `#5` and `#9` keep repeating in the stack trace until I got bored
scrolling.
I originally downloaded the website with wget and was running wget2 with "-nc"
over it to download any new files when it crashed. Doing the same with wget
works fine.
The issue with this website is the URLs are all CGI generated and I estimate,
based on looking at values in the queries and the pages I have downloaded is
the nesting at least 500 deep. Probably the best way to reproduce it is to
write a simple CGI script that generates pages with a link to a URL with an
incrementing counter. eg
`script.cgi?level=1` returns a page with a link to `script.cgi?level=2`.
`script.cgi?level=2` returns a page with a link to `script.cgi?level=3` etc.
```
(gdb) r
Starting program: /usr/local/bin/wget2 -r -l inf -nc -np -p --xattr -a wget.log
<removed>
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x0000555555564870 in write_out (default_fp=<error reading variable: Cannot
access memory at address 0x7fffff7fec38>,
data=<error reading variable: Cannot access memory at address
0x7fffff7fec30>,
len=<error reading variable: Cannot access memory at address
0x7fffff7fec28>,
with_timestamp=<error reading variable: Cannot access memory at address
0x7fffff7fec24>,
colorstring=<error reading variable: Cannot access memory at address
0x7fffff7fec18>,
color_id=<error reading variable: Cannot access memory at address
0x7fffff7fec20>) at log.c:62
62 {
(gdb) list
57 const char *data,
58 size_t len,
59 int with_timestamp,
60 const char *colorstring,
61 wget_console_color color_id)
62 {
63 FILE *fp;
64 int fd = -1;
65
66 if (!data || (ssize_t)len <= 0)
(gdb) bt
#0 0x0000555555564870 in write_out (default_fp=<error reading variable: Cannot
access memory at address 0x7fffff7fec38>,
data=<error reading variable: Cannot access memory at address
0x7fffff7fec30>,
len=<error reading variable: Cannot access memory at address
0x7fffff7fec28>,
with_timestamp=<error reading variable: Cannot access memory at address
0x7fffff7fec24>,
colorstring=<error reading variable: Cannot access memory at address
0x7fffff7fec18>,
color_id=<error reading variable: Cannot access memory at address
0x7fffff7fec20>) at log.c:62
#1 0x0000555555564c2e in write_info (fp=0x7ffff7e21760 <_IO_2_1_stdout_>,
data=0x7fffff7ffd70 "URI content encoding = 'utf-8' (set by server
response)\n", len=56) at log.c:153
#2 0x0000555555564d3e in write_info_stdout (data=0x7fffff7ffd70 "URI content
encoding = 'utf-8' (set by server response)\n", len=56) at log.c:184
#3 0x00007ffff7f5f904 in logger_vprintf_func (logger=0x7ffff7fbf860
<info_logger>, fmt=0x55555557f790 "URI content encoding = '%s' (%s)\n",
args=0x7fffff800da8) at logger.c:47
#4 0x00007ffff7f5f554 in wget_info_printf (fmt=0x55555557f790 "URI content
encoding = '%s' (%s)\n") at log.c:58
#5 0x000055555556e4fc in html_parse (job=0x0, level=0, fname=0x55556e3d2020
<removed>,
html=0x55556e3d2130 "<html> <removed>"..., html_len=34372,
encoding=0x55555557e99c "utf-8", base=0x55556e3cd2f0)
at wget.c:2660
#6 0x000055555556eab2 in html_parse_localfile (job=0x0, level=0,
fname=0x55556e3d2020 , encoding=0x55555557e99c "utf-8", base=0x55556e3cd2f0)
at wget.c:2755
#7 0x0000555555567a13 in parse_localfile (job=0x0, fname=0x55556e3d2020
<removed>,
encoding=0x55555557e99c "utf-8", mimetype=0x7fffff801410 "text/html",
base=0x55556e3cd2f0) at wget.c:558
#8 0x0000555555568e0b in queue_url_from_remote (job=0x0,
encoding=0x55555557e99c "utf-8",
url=0x7fffff801650 <removed>, flags=0, download_name=0x0) at wget.c:923
#9 0x000055555556e902 in html_parse (job=0x0, level=0, fname=0x55556e3c4690
<removed>,
html=0x55556e3c47a0 "<html><removed>"..., html_len=39145,
encoding=0x55555557e99c "utf-8", base=0x55556e3befb0)
at wget.c:2725
#10 0x000055555556eab2 in html_parse_localfile (job=0x0, level=0,
fname=0x55556e3c4690 <removed>, encoding=0x55555557e99c "utf-8",
base=0x55556e3befb0)
at wget.c:2755
#11 0x0000555555567a13 in parse_localfile (job=0x0, fname=0x55556e3c4690
<removed>,
encoding=0x55555557e99c "utf-8", mimetype=0x7fffff801ba0 "text/html",
base=0x55556e3befb0) at wget.c:558
#12 0x0000555555568e0b in queue_url_from_remote (job=0x0,
encoding=0x55555557e99c "utf-8",
url=0x7fffff801de0 <removed>, flags=0, download_name=0x0) at wget.c:923
#13 0x000055555556e902 in html_parse (job=0x0, level=0, fname=0x55556e3b4de0
"<removed>"..., html_len=44322, encoding=0x55555557e99c "utf-8",
base=0x55556e3afe30)
at wget.c:2725
#14 0x000055555556eab2 in html_parse_localfile (job=0x0, level=0,
fname=0x55556e3b4de0 <removed>, encoding=0x55555557e99c "utf-8",
base=0x55556e3afe30)
at wget.c:2755
#15 0x0000555555567a13 in parse_localfile (job=0x0, fname=0x55556e3b4de0
<removed>,
encoding=0x55555557e99c "utf-8", mimetype=0x7fffff802330 "text/html",
base=0x55556e3afe30) at wget.c:558
#16 0x0000555555568e0b in queue_url_from_remote (job=0x0,
encoding=0x55555557e99c "utf-8",
url=0x7fffff802570 <removed>, flags=0, download_name=0x0) at wget.c:923
```
--
Reply to this email directly or view it on GitLab:
https://gitlab.com/gnuwget/wget2/-/issues/659#note_1887101433
You're receiving this email because of your account on gitlab.com.