Pixelschieber
Der Blog für Pixler.
Nehmen Sie die Menschen so wie sie sind.
Es gibt keine anderen.
JCrawler – 1.9 Joomla 1.5/1.6/1.7/2.5
JCrawler – XML Sitemaps for Joomla 1.5/2.5
JCrawler has moved to jcrawler.net
This Joomla Component does the same as the spider from Google. It crawls your Joomla website and write the urls to a XML sitemap. Afterwards you can submit the sitemap to Google, MSN and Ask.com, Yahoo and Moreover.
Features:
- sitemap generation on the fly, no additional plugins needed
- submit sitemap to 5 searchengines (google, yahoo, msn, ask.com, moreover)
- Automatic priority calculation based on internal PageRank (experimental)
- shows bad links, or not accessible sites
- with curl: max 250 parallel connections to spider urls (configurable)
- exclude list
- modification of robots.txt with the location of the sitemap
- crawling with curl of fopen
- stores the config in a xml-file
JCrawler is Opensource, you can download it here: JCrawler
Feel free to test it!
—————————————————-
Now Beta 1.7 is out!
Changelog:
1.7 Beta
- Automatic priority calculation based on internal PageRank (experimental)
- Performance and compatibility improvments
- Submission to search engines via curl
- fixed empty loc-tag problem
- improved link detection
—————————————————
1.6 Beta
- fixed the “There are 0 links in your sitemap” issue
- better throttle control (less serverload)
- changed useragent to the googlebot
- the timeout is now configurable (only with curl)
- improved compatibility with non asci-chars (arrabian, turky,… languages)
- max. connections is now 250
- sh404SEF fix (you can add your url to the whitelist, for better crawling)
- improved curl detection
—————————————————
1.5 Beta
- fixed the “administr” issue
- added useragent: “JCrawler” (only with curl)
- the config is now saved in a xml-file
- extended the excludelist feature
- changed the validator link (validome doesnt accept umlaut-signs)
- max. connections is now 200
—————————————————-
1.4 Beta
- max parallel connections now configurable
- exclude list bug fixed
- curl check improved
- works again with php 4, parse_url problem fixed
- home and site path now chosen correctly
—————————————————-
1.3 Beta
- shows http error code on documents not crawlable
- crawls now your Joomla, installed in a subdirectory
- crawls subdomains
- Update check
- curl and fopen check
- by crawling with curl the crawler has now the referer of your page
- performance improvements
—————————————————-
1.2 Beta:
added: crawling level is changeable
added: crawling method is choosable (fopen or curl)
added: the crawler watches your robots.txt for disallowed urls/paths
small speed improvements
bugfixes:
- Curl rewrite bug
- php signs and code issues fixed (thx to Dave Edwards)
- XML install error
—————————————————–
Beta 1.1
Changes:
- added Exclude list
- added xslt definition
- added modification of robots.txt
- added priority handling
- added search engines Moreover and Yahoo
Bugs:
- fixed the curl-bug
Requirements:
- PHP function fopen (PHP parameter allow_url_fopen=on) must be aviable for submitting the xml Sitemap to the search engines
- PHP Module CURL or fopen must be available (CURL preferred) for crawling
- Joomla 1.5
«Es gibt drei Möglichkeiten, eine Firma zu ruinieren: mit Frauen, das ist das Angenehmste; mit Spielen, das ist das Schnellste; mit Computern, das ist das Sicherste.»
Oswald Dreyer-Eimbcke
Thanks for the information pat, but i am not able to find
<?xml-stylesheet type="text/xsl" href="http://hms.pro.co.in/administrator/components/com_jcrawler/sitemap.xsl"?>line in the file. Please suggestThankssriniHi,
Had some initial problems with the module in php5.2.6, joomla 1.5.4 running on FreeBSD 7.0-RELEASE.
I added “;” to terminate the “echo” statements and changed “<?” to “<?php” where ever I found it in the admin.jcrawler.html.php file
I also added a missing parameter in the curl_multi_remove_handle function call in the admin.jcrawler.php file (function call now reads “curl_multi_remove_handle($mh, $ch[$i+$v]);” at line 276).
It then failed using “feof()” with the error:
feof(): supplied argument is not a valid stream resource in ….admin.jcrawler.php line 285. Rather than trying to sus that out, I installed curl-php5 and it then worked like a charm.
I’ve got diffs if you want them..
Thanks for a useful tool!
ciao
dave
Read my other post mate.
I had the same problem. My fix was to fix up some minor problems in the code and install curl-php5 (so it doesn’t use fopen).
ciao
dave
Dear Patrick,
Works fine! just one point.
Jcrawler assums that Joomla! is installed at the root of a domain. I (like many others) always intall my CMS in a subfolder. Can you please make sure your component uese a relative path rather then an absolute path.
Regards, Herb
hi Herb,
you’re right, its not possible to generate sitemaps for joomlas in subdirs, i’ll do this for next release
thanks for your patience
patrick
I just ha a little problem that the sitemap tool genereated empty urls within the sitemap, this I have solved by changing the code at about line 181 to the following foreach ($urls as $loc){ /* urf-8 encoding */ $loc=htmlspecialchars($loc,ENT_QUOTES,’UTF-8′); if($loc == “”) continue; // <=== THIS I HAVE ADDEDjust to let you knowregardsThomasJoomla Media Web Design
hi, thx for the nice tool !
got a problem with the Exclude list i’ve want to add two words to exclude, but that doesn’t work
i’ve added a line break with <br>
My site is in Turkish and JCrawler sitemaps hasn’t got SO 8859-9 support I guess. What should I do for Turkish Characters
hey, i found the problem, you have “ä”, “ö”, “ü” in your urls, thats the reason…i’ll fix it for the next release!
thx & greets
Patrick
Very Nice, but you cant seem to change the priority for different pages? is it possible or all of the pages uses only one priority?
hi, unfortunatly all pages have the same priority, im looking for a priority algorythm
how to delete this text from sitemap?
This is a XML Sitemap which is supposed to be processed by search engines like Google, MSN Search and YAHOO. It was generated using the CMS-Software Joomla and the JCrawler Component for Joomla by Support-Masters and Pixler. You can find more information about XML sitemaps on sitemaps.org and Google’s list of sitemap programs.
and footer:
Generated with JCrawler Component for Joomla by Patrick Winkler. This XSLT template is released under GPL.
hi, you don’t need to delete this text, this text is not visible to the search engines only to your browser, it’s not included in the sitemap.
greets Patrick
Hi
I could not use curl but I got this error message J 1.5.6. I have fopen on.
Warning: file_put_contents(/hsphere/local/home/yyy/xxx.se/sitemap.xml): failed to open stream: Permission denied in /hsphere/local/home/yyy/xxx.se/libraries/joomla/filesystem/file.php on line 298
Any ideas what permission to set or ?
rgds
hi, you have to set the permissions on /hsphere/local/home/yyy/xxx.se/sitemap.xml to 777
regs patrick
Hi,
Thanks in advance.
I keep getting this
Error
HTTP/1.1 403 FORBIDDEN on url http://russiansinny.com make sure this url is availabe to the crawler
any ideas on how to resolve this.
I know its something stupid so i am sorry.
hi,
hmm is there a special .htaccess which limits the access of jcrawler?
Greets Patrick
Fix it now doing a sitemap.xml manually worked with 666 and then it worked even got a lot of error messages!
could create a xml map and execute it!
Very nice Joomla add-on!
best regards!
Thanks for this component. However, when I install it and run the crawl, the sitemap that is generated does NOT pass validation. It give an error on the character set.Floris
hi Floris,
hmm strange, i generated one of your site it as well and it passed the validation!?
which error exactly?
Greets Patrick
http://www.validome.org/google/validate?url=http://www.toptotaal.nl/sitemap.xml&lang=en&googleTyp=SITEMAP
given error:
Fatal error HTTP-Charset (iso-8859-1) does not match UTF-8.
With best regards,
Floris
…there is a redirection on: http://www.toptotaal.nl/sitemap.xml to http://www.toptotaal.nl/index.php?option=com_xmap&sitemap=1&view=xml&no_html=1
thats why it’s not valid
Greets patrick
Thanks for your reply, indeed there was a redirect. But now I see that .jpg files are indexed. While I configured that that had to be ignored.
Floris
hmm strage again, in mine sitemap of your site are no images indexed.
you dont need to configure something extra to ignore the jpg, gif or png, its all done by default in the presettings.
Just click on components -> Jcrawler and then on submit.
Thats it.
Greets patrick
What about this sitemap than?http://www.toptotaal.nl/sitemap.xml
yes you’re right, i’ve found a bug, it will be fixed in the next release!
Thanks for your help
greets patrick
I have another Q:
I uploaded the sitemap a few day ago, but till now no information is available about indexed url’s (acc to Google webmaster tools)
Before I used Xmap and had almost every url indexed (about 80 percent of the total url’s in the sitemap.)
Do you have any experience with this?
Hey Floris,
Yes it takes a long time until there is information available. And the information is only available when google actually doesn’t crawl your site. When google is spidering urls and the time after, no infos are available. In my experience.
I don’t know how long a “crawl” of the whole site takes.
Ok, another case: The validation fails on the xml index.
We require your Sitemap file to be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters listed in the table below.
Character
Escape Code
Ampersand
&
&
Single Quote
‘
'
Double Quote
”
"
Greater Than
>
>
Less Than
<
<
Source: Google
the sitemap is UTF-8 encoded! your sitemap is valid too.
you can’t validate your sitemap as an sitemap-index:
a sitemap-index is another thing: http://sitemaps.blogspot.com/2005/08/using-sitemap-index-files.html
hi,
thank you for this nice component. I’ve been trying hard to use it, but it semms that it overload my server, so the script gets blocked before it’s able to finish…
Could you possibly allow to set the crawling speed to avoid this problem?
Thank you again
Hey,
hmm with curl you have 40 parallel conections. If possible, try to crawl whith the fopen-method. In the next release there will be an option to configure the number of the parallel connections.
Greets patrick
Hey when i open jCrawler i get these two messages:
Warning: set_time_limit(): Cannot set time limit in safe mode in /srv/www/vhosts/*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 5 Warning: parse_url() expects exactly 1 parameter, 2 given in /srv/www/vhosts/*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.html.php on line 13
and when i hit Start i get these messages with an empty sitemap.xml
Warning: set_time_limit(): Cannot set time limit in safe mode in /srv/www/vhosts//*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 5 Warning: parse_url() expects exactly 1 parameter, 2 given in /srv/www/vhosts//*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 462 Warning: parse_url() expects exactly 1 parameter, 2 given in /srv/www/vhosts//*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 463 Warning: parse_url() expects exactly 1 parameter, 2 given in /srv/www/vhosts//*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 462 Warning: parse_url() expects exactly 1 parameter, 2 given in /srv/www/vhosts//*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 463 Warning: parse_url() expects exactly 1 parameter, 2 given in /srv/www/vhosts//*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 462 Warning: parse_url() expects exactly 1 parameter, 2 given in /srv/www/vhosts//*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 463 Warning: parse_url() expects exactly 1 parameter, 2 given in /srv/www/vhosts//*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 462 Warning: parse_url() expects exactly 1 parameter, 2 given in /srv/www/vhosts//*******/httpdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 463
Any idea where the problem is?
Hello,
Jcrawler is a nice tool.
But I have a liitle problem with it on my domain.
I use Joomla 1.5.6
and the actual Jcrawler 1.3 Beta.
When I start Jcrawler I receive following error:
Warning: parse_url() expects exactly 1 parameter, 2 given in /kunden/xxxx/webseiten/joomla/administrator/components/com_jcrawler/admin.jcrawler.html.php on line 13
I use it with fopen
After pressing the start butten, i receive following errors:
Warning: parse_url() expects exactly 1 parameter, 2 given in /kunden/xxxx/webseiten/joomla/administrator/components/com_jcrawler/admin.jcrawler.php on line 462 Warning: parse_url() expects exactly 1 parameter, 2 given in /kunden/xxxx/webseiten/joomla/administrator/components/com_jcrawler/admin.jcrawler.php on line 463
If I go to the point validate sitemap.xml I get the answer Error no validation.
Warning: parse_url() expects exactly 1 parameter, 2 given in ../administrator/components/com_jcrawler/admin.jcrawler.html.php on line 13
Hey,
with this warning component make empty file mapsite.xml
and display more warnings:
Warning: parse_url() expects exactly 1 parameter, 2 given in /home/ekonit/www/kurtyny/administrator/components/com_jcrawler/admin.jcrawler.php on line 462 Warning: parse_url() expects exactly 1 parameter, 2 given in /home/ekonit/www/kurtyny/administrator/components/com_jcrawler/admin.jcrawler.php on line 463 Warning: parse_url() expects exactly 1 parameter, 2 given in /home/ekonit/www/kurtyny/administrator/components/com_jcrawler/admin.jcrawler.php on line 462 Warning: parse_url() expects exactly 1 parameter, 2 given in /home/ekonit/www/kurtyny/administrator/components/com_jcrawler/admin.jcrawler.php on line 463 Warning: parse_url() expects exactly 1 parameter, 2 given in web/administrator/components/com_jcrawler/admin.jcrawler.php on line 462 Warning: parse_url() expects exactly 1 parameter, 2 given in web/administrator/components/com_jcrawler/admin.jcrawler.php on line 463 Warning: parse_url() expects exactly 1 parameter, 2 given in web/administrator/components/com_jcrawler/admin.jcrawler.php on line 462 Warning: parse_url() expects exactly 1 parameter, 2 given in web/administrator/components/com_jcrawler/admin.jcrawler.php on line 463
everybody with the parse_url() – error which php version do you use?
Edit: thanks, i’ve see that an second parameter of the parse_url function has been added in PHP 5.1.2, so it won’t work in earlier php-versions.
But i’ll change it for the next release to work with oder php-versions
thanks for the help
greets patrick
PHP version 5.0.27-log
Hi,
I changed my php version to 5.2.5.
Now i don’t receive the error message more.
The only thing that I receive is that my page is not validated.
Do you have an idea?
Ok i saw i got PHP Version 4.3.10 i have to update my server and try it again.
HiI have successfully created the sitemap but its not the complete one. is there any limitation that only this much url’s can be included in the sitemap or its some other problem??
Hi Glab,
what do you miss in the sitemap? Images, login-urls, email-links will not included by standard.
Did you try a deeper crawler level?
Greets Patrick
hi,
I am using Jcrawler with joomla 1.5.5.
Installation is ok.
It generate the sitemap.xml file .
But after generating the sitemap.xml it give me a link View my sitemap to view the xml but when i click on that link it generate the following error
Error loading stylesheet: A network error occured loading an XSLT stylesheet:http://localhost/administrator/components/com_jcrawler/sitemap.xsl
if i open the sitemap.xml by clicking on that it show all the link of sitemap.
Anyone can help me?
i dont know if the jcrawler work correctly on a localhost… can you give me the exact link?
or paste the header of the xml file?
hi all;
I try to change some code to solve parse_url() problem for PHP4;
This is my edit:
open admin.jcrawler.php
@462 – 463 =>
$dirname=parse_url($website, PHP_URL_PATH); $host=parse_url($website, PHP_URL_HOST);
change it to =>
$bits = parse_url($website); $dirname=$bits['path']; $host=$bits['host'];
——————————–
open admin.jcrawler.html.php
@ 13 =>
$folder=parse_url($http_host.$_SERVER['SCRIPT_NAME'],PHP_URL_PATH);
change it to =>
$bits = parse_url($http_host.$_SERVER['SCRIPT_NAME']); $folder=$bits['path'];
That’s all; It works for me
thanks for your input, i’ve already fixed this, I’ll soon relase the new version.
greets patrick
I installed it my new site and online user is 150 !!
Runnig bots to my site. Thanks.
yes thats possible, the jcrawler is known as “bot” for webstats
greets patrick
Hello,
Thanks for a great idea. I’m looking forward to it working!
I’m running a 1.5.6 joomla site. When I start JCrawler I get several different sorts of errors:
JFile::read: Unable to open file: ‘/var/www/joo15/robots.txt’
JFolder::create: Could not create directory
your sitemap file is not writable, create an empty sitemap.xml and make a chmod 666 to the file
JFile::read: Unable to open file: ‘/var/www/joo15/robots.txt’
JFolder::create: Could not create directory
Your robots.txt is not writable!
http://www.domain.org/joo15/robots.txt is 666 and http://www.domain.org/sitemap.xml is 666
Where is it trying to create a directory?
best wishes,
terry
Hi Patrick,
thanks for this great tool JCrawler!
I am using Joomla 1.5.5 with VM 1.1.2
Some questions:
1) Is the staving of the settings like priority, exclude pathes, exclude file types … not implemented yet, or do I have to set this in a configuration file. If yes where can I do that?
2) The exclude path works for me only if I set one path only. If I have more than one listed none of these are use for excluding and everything is included in the Sitemap.xml file
3) Curl is red marked as not available, but the generation of the Sitmap.XML file is done correctly…
4) Is it correct, JCrawler is not doing a daily scheduled Sitemap generation? This has to be done manually at the moment?
Thanks for any comments
Tom
Hi Tom,
1. The savings are not yet implemented, there will be an xml file the options
2. there is a bug with the exclude paths, but it should not be necessary to set the whole path
3. yes, it selects automaticly fopen if curl not available, i was not sure if the check_url function really works, so i didn’t disable the curl option in the form
4. yes thats right, im not shure if its usefull if the sitemap is generated scheduled, the serverload is very heavy
greets patrick
hmm can you check is /var/www/joo15/robots.txt correct? is there a robots.txt and sitemap.xml file.
it creates the directory and the file if not found.
greets patrick
Hi Patrick,
, but YES robots.txt is extended by JCrawler with a Sitemap reference and the Sitemap.xml is generated.
I am not sure if you meant me
Thanks for your quick reply
Tom
Dear Tom,
Thanks for the really quick reply!
The robots.txt and sitemap.xml are there.
It may be the problem is that although site is running on a red hat server, it is on a hosting cloud. As far as I can tell from this end, there is non of the usual /var/www/
Instead, I have ‘site number’/web/content/joo15/catalog etc
I put the robots.txt and sitemap.xml in the same places relative to web root.
Perhaps the lack of /var/www/ is the problem?
Many thanks!
Terry
Changed document root to
/mnt/target02/3xxxxxxx/3xxxxxx/www.love.com.au/web/content/ and JCrawler ran made sitemap.xml and modified robots.txt.
Many thanks. It looks a great piece of software!
Best wishes,
Terry
Hi,It works more or less. A few errors of type httpcode: 403 on url … make sure this url is availabe to the crawler and … do exist and are written into the sitemap so I don’t know why these error messages. At least it is writing the menu links and the articles links.I am only missing the links from the paginated pages on the front page in the sitemap.It find only the articles on the front page via read more but is not looking further via the paginator Page-2.html and so on into the other pages.I am using sh404SEF. Joomla! is 1.5.7Greetings,Dave
Same problem here also.
What is the cause?
Hi there,
I am running Joomla 1.5.7 (legacy mode) + JCrawler 1.3Beta. I keep getting these error messages:
Warning: parse_url() expects exactly 1 parameter, 2 given in /[pfad]/administrator/components/com_jcrawler/admin.jcrawler.php on line 462Warning: parse_url() expects exactly 1 parameter, 2 given in /[pfad]/administrator/components/com_jcrawler/admin.jcrawler.php on line 463Warning: parse_url() expects exactly 1 parameter, 2 given in /[pfad]/administrator/components/com_jcrawler/admin.jcrawler.php on line 462Warning: parse_url() expects exactly 1 parameter, 2 given in /[pfad]/administrator/components/com_jcrawler/admin.jcrawler.php on line 463Warning: parse_url() expects exactly 1 parameter, 2 given in /[pfad]/administrator/components/com_jcrawler/admin.jcrawler.php on line 462Warning: parse_url() expects exactly 1 parameter, 2 given in /[pfad]/administrator/components/com_jcrawler/admin.jcrawler.php on line 463Warning: parse_url() expects exactly 1 parameter, 2 given in /[pfad]/administrator/components/com_jcrawler/admin.jcrawler.php on line 462Warning: parse_url() expects exactly 1 parameter, 2 given in /[pfad]/administrator/components/com_jcrawler/admin.jcrawler.php on line 463
Hi Masa,
Yes this is a bug, upgrade at least to PHP 5.1.2 or be patient i’ll release next weekend a new relase and this will be fixed.
Greets Patrick
hi,
I update jcrawler to 1.4 and when i start to crawler, i get error like this :
with curl:
HTTP/1.1 404 Not Found on url http://www.dokuzhukuku.org/administr make sure this url is availabe to the crawler
with fopen:
HTTP/1.1 404 Not Found on url http://www.dokuzhukuku.org/administr make sure this url is availabe to the crawler
hmm, have you modified your httpd.conf or .htacces ?
greets patrick
i never modify my httpd.conf ( my host not allow) but i change some codes in . htacces. But i solve my problem i make this change in code ( i use code of 1.13):
admin.jcrawler.html.php @ 12:
From:
$http_host = substr(JURI::base(),0,strrpos(JURI::base(),”administrator”));
To :
$http_host = ‘http://’ . $_SERVER['HTTP_HOST'];
It looks working
Hi!
but the new Version from 2008-09-23 17:00:00-05 now isno more copatible for sites with php5! because in the fileadmin.jcrawler.php
are php4 variables “array_merge”for php5 Systems you will need only “array”regardspeter
hi peter,
array_merge accepts in php5 only arrays as parameter, in php4 accepts it an array or string as parameter. So its compatible in php4 and php5.
Greets patrick
no, I`ve modified nothing special. But after change from array_merge to array its working now fine. The php version on my Server is 5.2.6 and sql is 5.0.45-community.You also must correct one mistake at admin.jcrawler.html.php:function footer($option){ print “<div align=\”center\” style=\”clear:both;\”><a href=\”index2.php?option=”.$option.”&task=updatecheck\”>Check for update</a><br />Copyright 2008 pixelschieber.ch. <a target=\”_blank\” href=\”http://www.pixelschiber.ch/jcrawler\”>pixelschieber.ch</a></div>”;—> to pixelschieber.chregardsPeter<run the day or the day runs you>
Hi admin,
i start crawler and it works 2-5 minutes (i use curl and max. level is 2) and firefox ask me to download index2.php. (not open scanning finish page for google msn yahoo ping) I have a .xml file (2.7mb and approx. 9000 links) but it not finish scanning. i think it’s timeout or other some thing like timeout
how can know is it finished scanning?
I have another problem
Hi BooRook,
ok, the actual time limit of this script is 9999999 Seconds, but whats your serverlimit? If safemode is set to “On” you have to increase the set_time_limit in your server configuration (php.ini).
The crawling is finished with the submit page for google, msn & co.
Increase the paralell connections and use the exclude-list.
Greets Patrick
I am looking forward to givinh JCrawler a try, but first I have to get around my security settings. In my .htaccess I have blocked all user agents except the popular broswers and the maqin search engines. User agents without a name do not get to see the site either. So, obviously JCrawler only meets a 403 error when crawling my site. So, for me to try JCrawler I need to know the user agent name of the component’s bot, so that I can add this to my white list. What is JCrawler’s user agent name?
hi, actually, there is no useragent, the next version will have the following string as Useragent: JCrawler
Maybe you can allow the access by ip or host?
Greets Patrick
Thanks for the response! I’d rather wait for the new version
when is it planned to come out?
I think it will be about a week, there are other small bugs.
greets patrick
I’m trying to get the sitemap generated. Originally the HTTP:// host had an extra ../administr. I have set this to the correct url by changing admin.jcrawler.html.php. I am now generating an empty sitemap with errors:
Warning: file_get_contents() [function.file-get-contents]: URL file-access is disabled in the server configuration in ../administrator/components/com_jcrawler/admin.jcrawler.php on line 447 Warning: file_get_contents(http://www.alexaicken.com) [function.file-get-contents]: failed to open stream: no suitable wrapper could be found in ../administrator/components/com_jcrawler/admin.jcrawler.php on line 447
I am hosting with DreamHost who have disabled file_get_contents(). They have alternatives at http://wiki.dreamhost.com/index.php/CURL#Alternative_for_file_get_contents.28.29 but I have not managed to get any of them working.
Any help is appreciated!
ok, the administr bug will be fixed in the next release,
but you can take the curl method, despite if there’s a message that curl is not available
Greets patrick
Thanks for getting back. However, all I get is an empty sitemap – no links in the <urlset> tag. I can generate it successfully on another sitemap generating site, so I don’t know what the problem is.
Same thing. Everything looks fine, but jcrawler generates an empty sitemap
hi karys,
i generated from your site a sitemap without any problems, i just opened jcrawler and then clicked on submit, without to change the config
are there no errors? maybe in the server log?
Greets patrick
I have a home server. If i try to access jcrawler through a proxy, by my domain, i don’t get any errors and i don’t get a sitemap. If i try to access my site through it’s local ip, when i push submit it starts the “crawling” image and basically hangs up. I have no option then but to restart apache. If you saw my site you’d probably figure there’s not all that much to crawl
i’d say it could take up to 2 minutes(being extremly pesimistic)
When I try and validate it I get:
Error:
end tag for “urlset” which is not finished
Error Position:
</urlset>
same thing for me. If it changes anything( i think it just might) i don’t have a htaccess file
update on research: JCrawler HAS to crash if you are trying to do a crawl on a ip or on a localhost. Ok, that’s not such a biggy. But you should test it on domains withouth www in front. i think the getURL function might just crash on a domain withouth www in front.
Still no idea why it generates an empty sitemap….
Found out the source of my problem. Curl get’s my router login screen, instead of the site.
I was having problems with curl with your JCrawler. I found a solution with some help on a bulletin board. It required a simple change to your code. Please view the solution here http://www.joomlapraise.com/distribution/viewtopic.php?f=42&t=297
lance, thank you for that small curl-check fix, very helpfull.
I hate regexp
Greets Patrick
PS: will be fixed in the next release
I think you easily could develop Dcrawl for drupal:
http://drupal.org/node/314868
what do you think about it?
yes sounds nice, and would be easy.
but first i have to finish the jcrawler, then i’ll do maybe a “DCrawler”
next feature will store the configuration in a xml file
Greets patrick
I installed JCrawler on Joomla 1.5.7.
It seemed to run fine, but the resulting sitemap was empty; there was nothing between <urlset xmlns…> and </urlset>.
I tried this with both curl and fopen.
Is there some assumption that your program makes about the start page, perhaps? Or have I done something odd?
hmm maybe there’s you url is not correct?
for example: http://www.yourdomain.com/administr
? greets patrick
I am getting these errors:
Warning: feof(): supplied argument is not a valid stream resource in /home/xx/public_html/administrator/components/com_jcrawler/admin.jcrawler.php on line 438Warning: fgets(): supplied argument is not a valid stream resource in /home/xx/public_html/administrator/components/com_jcrawler/admin.jcrawler.php on line 439
Any idea why?
tha same, i think your crawling url is not correct, else try itz with curl, then you get better feedback
greets patrick
I entered http://www.gurrenlagannepisodes.com/ which is a correct URL… why would it not be correct?
is there no “administr” folder? appended on your domain, whats your joomla & php version?
Hi, works fine in XAMPP Test-environment with PHP 5.1.6
But the productive website encounters problems with exact HTTP host (PHP Version 4.4.8):
Instead of writing http://www.mydomain.com it writes
http://www.mydomain.com/administr
What´s wrong here ?
Thanks for your help !
hi Daniel
ok, a little hint more, thank you.
on PHP 5.1.6 it workes fine and the same site on php 4.4.8 has the “administr” issue?
it will be fixed in the next release.
Greets patrick
I added the excluded files to the list and it did leave them out during the crawl however it doesn’t save the list. It would be handy if I could save the list. Other than that it works great!
hi polero, this fetature will come with the next release
greets Patrick
how long does it take jcrawler to crawl a site? It’s been going for 30+ minutes and my site is relatively small currently
Hi jason i crawled you site in 5 minutes on level 3 and 200 parallel connections with curl. in the next version you will able to crawls with 200 paralell connections.
You have a link-cloud on your website that means additional work for the crawler
Greets patrick
Hi,
how can I change the value in the field “HTTP host (read only)”?
There was a wrong URL and I think thats the reason why I get error messages like:
Warning: feof(): supplied argument is not a valid stream resource in /home/xx/public_html/administrator/components/com_jcrawler/admin.jcrawler.php on line 438Warning: fgets(): supplied argument is not a valid stream resource in /home/xx/public_html/administrator/components/com_jcrawler/admin.jcrawler.php on line 439
Thank you for help
Hi,
HTTP Host is detected automaticly, there is a bug and I’ll publish a new, fixed version of jcrawler tomorrow.
Greets Patrick
Hi Patrick,
have you allready publish a new version?
Thanks
hey maila,
no, not yet, i’ve a problem with the character encoding in the excludelist…
wil be next weekend
greets patrick
Hi Patrick,
I installed 1.5 Beta and I can’t change the URL by myself because the URL is also wrong.
If I cklick on start to create a sitemap I get the following message:
Warning: feof(): supplied argument is not a valid stream resource in /var/www/web45/html/lange_nicht_gesehen/administrator/components/com_jcrawler/admin.jcrawler.php on line 505
Thanks for help
Maik
Hi Patrick,
did you have an information for me about the above mentioned error?
thanks
hi maila,
can you give me your url (of your joomla mainsite), and the url displayed in the jcrawler?
greets patrick
Hi patrick,
the URL of my website is http://www.lange-nicht-gesehen.org.
The URL which is displayed is:
“http://www.wundesbehr.de/lange_nicht_gesehen/”
I have two different websites in two different folders on one webspace.
Is this a problem?
Thanks
no should not be a problem…
the crawler takes the url/host of your php.ini, but i’ll do a fix who takes the host on the url based
greets patrick
I have tried running v1.4beta on my development site (localhost) and the run time is unbelievably high. The site uses virtuemart with about fifteen product categories linked in a hierarchy four levels deep with a total of about 35 products (this is the full operational category structure but just enough products for testing). Every page has a menu with the full cascade of product categories. I installed the software, checked that fopen and curl are both installed (WAMP with PHP v5.2.6) and ran an initial test using the default parameter settings. After about 90 minutes of processing (using 98% cpu), I killed the process. I then changed the settings to ‘parallel connections’ = 100 and ‘levels’ = 1 and tried again. After 25 minutes I killed it again. By comparison, Xmap takes 1.8 seconds to map the site and display the results on screen
This has the feel to me that the product is recursing through the virtuemart product structure many times because many links on the home page lead to other pages that also contain the same set of urls. Is there a check that identical links are not being re-processed over and over?
I’ve never seen any output from your component but I look forward to the possibility
Regards Phil
Hi Phil, thanks for this information
There must be an endlessloop, i don’t know where, yet. I never tested the crawler on a localhost! Yes and thats the reason i’ts still beta…
Greets patrick
Thanks Patrick
Phil
I have about 20 lines of error ’httpcode:403… make sure this url is available to the crawler ‘I read my .htaccess file, but i have no idea if my .htaccess file setup is wrong or not.I have sh404sef on with .htaccess rewriting mode on.Is it okay to submit this sitemap to google or yahoo with this error or should i wait till there is no 403 error?Could you kindly give your advise? Sorry for stupid question.
Hi Freefall,
It’s not a stupid question. My advise:
Let the 403-error-urls be in your sitemap and add your site/sitemap in the google webmastertools programm: https://www.google.com/webmasters/sitemaps/
There you’ll see a message, if google find as well 403 errors.
Greets Patrick
I always get HTML 500 Error when I use your product. It doesn’t matter how I configure it.
hi,
try to comment the “RewriteBase /” entry in your .htaccess file, and try again
Greets patrick
Hi,
comment or uncomment “RewriteBase /” in your .htaccess and try again.
Greets patrick
Sorry for my bad English, it is not my mother tongue. The submission doesn’t work, a message appears after the crawling :
JFile::read: Unable to open file: ‘http://www.google.com/webmasters/sitemaps/ping?sitemap=http://www.mysite/sitemap.xml’
JFile::read: Unable to open file: ‘http://webmaster.live.com/ping.aspx?siteMap=http://www.mysite/sitemap.xml’
JFile::read: Unable to open file: ‘http://submissions.ask.com/ping?sitemap=http://www.mysite/sitemap.xml’
JFile::read: Unable to open file: ‘http://api.moreover.com/ping?u=http://www.mysite/sitemap.xml’
JFile::read: Unable to open file: ‘http://search.yahooapis.com/SiteExplorerService/V1/updateNotification?appid=SitemapWriter&url=http://www.mysite/sitemap.xml’
Could you help me please? (thank you for this component)
hi, you cannot submit your sitemap to google & co, because fopen cannot connect to urls (set your php.ini parameter “allow_url_fopen” to “on” or point your browser to this url’s)
Greets patrick
I clicked on these url’s and it seems that submission worked. Thank you Patrick.
Hello, I tried to install your sitemap and got error:Warning: zip_entry_read() [function.zip-entry-read]: The bytes parameter must greater then zero in /home/thisdarn/public_html/bentnail/libraries/joomla/filesystem/archive/zip.php on line 238Also, under “HTTP host (readonly)” in the cpanel the url is not correct & it wont allow me to change?http://www.bentnail.us/administr
Hi Stuart,
perhaps your zip file is damaged?, the “HTTP host (readonly” (http://www.bentnail.us/administr) issue is known and will be fixed in the next version.
Greets Patrick
Hello
I am trying yor component on my site but I receive following errors from jcrawler (extract):
httpcode: 406 on url http://otn1.tourisme-nivelles.be/index.php?option=com_content&view=article&id=202&Itemid=67 make sure this url is availabe to the crawler
httpcode: 406 on url http://otn1.tourisme-nivelles.be/index.php?option=com_content&view=article&id=65&Itemid=76 make sure this url is availabe to the crawler
httpcode: 406 on url http://otn1.tourisme-nivelles.be/index.php?option=com_content&view=article&id=182&Itemid=93 make sure this url is availabe to the crawler
…
and the generated sitemap is the following:
<url> <loc>http://otn1.tourisme-nivelles.be/</loc> <lastmod>2008-10-12T11:15:44Z</lastmod> <priority>0.5</priority> <changefreq>daily</changefreq> </url> <url> <loc>http://otn1.tourisme-nivelles.be/index.php?option=com_content&amp;view=frontpage&amp;Itemid=169</loc> <lastmod>2008-10-12T11:15:44Z</lastmod> <priority>0.5</priority> <changefreq>daily</changefreq> </url> <url> <loc>http://otn1.tourisme-nivelles.be/index.php?option=com_content&amp;view=article&amp;id=202&amp;Itemid=67</loc> <lastmod>2008-10-12T11:15:44Z</lastmod> <priority>0.5</priority> <changefreq>daily</changefreq> </url>
I assume that the “amp;” string after each “?” in the URL is the responsible. How can I get rid of this problem?
Could you help me.
Daniel
Hi,
the “amp;” in your url’s is ok, jcrawler just list your urls as they are in your site.
the problem must be somwhere else, see the response when you point your browser to:
http://otn1.tourisme-nivelles.be/index.php?option=com_content&view=article&id=202&Itemid=67
there is a 406 Not Acceptable error.
Greets patrick
Patrick
Thanks for your quick answer.
As I mentioned, if I remove the “amp;” portion, the URL is correct; ie:
http://otn1.tourisme-nivelles.be/index.php?option=com_content&view=article&id=202&Itemid=67
This url is the same as the one produced and used by joomla in normal use.
My question is why the component needs to add the “amp;” part.
Daniel
I have ran this several times trying to generate a site map. I does generate a site map, but gives errors and Google indicates that several errors occur becuase the www. does not preced the site name.
Additionally, i get errors within the tool like the following:
httpcode: 500 on url http://ultimategayweddings.com/component/banners/click/15.html make sure this url is availabe to the crawler
httpcode: 500 on url http://ultimategayweddings.com/component/banners/click/12.html make sure this url is availabe to the crawler
httpcode: 500 on url http://ultimategayweddings.com/ideasinspirations.html make sure this url is availabe to the crawler
httpcode: 500 on url http://ultimategayweddings.com/contests.html make sure this url is availabe to the crawler
httpcode: 500 on url http://ultimategayweddings.com/component/banners/click/10.html make sure this url is availabe to the crawler
httpcode: 500 on url http://ultimategayweddings.com/component/banners/click/14.html make sure this url is availabe to the crawler
httpcode: 500 on url http://ultimategayweddings.com/partnering.html make sure this url is availabe to the crawler
httpcode: 500 on url http://ultimategayweddings.com/privacy.html make sure this url is availabe to the crawler
httpcode: 404 on url http://ultimategayweddings.com/ceremonies/dp make sure this url is availabe to the crawler
httpcode: 500 on url http://ultimategayweddings.com/style-fashion-grooms/69-hes-got-style.html make sure this url is availabe to the crawler
httpcode: 500 on url http://ultimategayweddings.com/tips-for-grooms/75-two-grooms-tips-a-tricks.html make sure this url is availabe to the crawler
httpcode: 404 on url http://ultimategayweddings.com/function.array-merge make sure this url is availabe to the crawler
Maybe I am doing something wrong. A little guidance would be appreciated.
Here is what google is saying, just one example
Column: 11982 Error: Invalid domain within loc element (http://www.ultimategayweddings.com).. Domain within loc element must match domain mentionned in Sitemap (-Index). Error Position: <loc>http://www.ultimategayweddings.com/ceremonies.html</loc>
Hi rick,
at first if your sitemap is available via: http://ultimategayweddings.com/sitemap.xml
all links has to be without www, or if your sitemap is available via http://www.ultimategayweddings.com/sitemap.xml every link in the sitemap has to start with http://www.ultimategayweddings.com. So thats the rule of sitemaps, thats why the sitemap is not valid
about the 500 errors, are there any entries in your .htaccess file who consern the crawler?
Greets patrick
Hi, Thanks for the great component! Version 1.4 on Joomla 1.5.6 worked fine for me until today: sitemap didn’t pass google validation. That’s why I decided to install the newest version (1.5).I also get an error on that one, the new validator says:Problems with the schema-validity of the target
Invalid per cvc-complex-type.1.2.2: element content failed type check: 0,5 is not a valid decimal literalI think it is the “,” so I changed the value to 0.5, but the component doesn’t seem to save this value
hi yvonne, strange when i change the priority, the value is saved.
yes, you have to use a “.” instead of a “,” for the priority
try to reinstall jcrawler
greets patrick
Thanks patrick,You’re absolutelly right, i guess i was still sleeping that time, i’m sorryThe component works fine if you fill in a “.” instead of the “,”.For your information there are some minor typo’s which maybe you can correct in a future version:- link to your site pixelschieber.ch A HREF is incorrect: pixelschiber.ch/jcrawler- Vadidate my sitemap ofcourse must be validate my sitemapI hope this helps a bit
hi yvonne, thanks for the typos =)
Hi, thanks for this tool. I am using modrewrite to point requests to a subdirectory. This appears to confuse Jcrawler 1.5. When I go into the configuration, HTTP host is set to the actual subdirectory where the Joomla install resides. However, crawling fails because modrewrite is hiding that directory. The live_site variable is set to the URL without the subdirectory yet 1.5 must not check this. I did not have this problem with 1.4 beta. Did something change in 1.5 in how you determine the site’s URL?
Hi Josh,
yes, i changed the method, it takes the website from the php variable $_SERVER['HTTP_HOST'],
but i found a clear solution now, i’ll change it back in the next release. If you want to change it by yourself please write me an email.
Greets Patrick
PS: Sorry for the delayed answer.
Hi Patrick. Thanks for your reply. I’ve just reverted back to 1.4 beta for now. I’ll wait for your next release for the URL fix. Thanks again for this helpful component!Josh
Hi,Jcrawler creates a sitemap – however, when it is completed it redirects to “Internet Explorer cannot display the webpage”. So I am unable to use the other functions.
hmm try to increase the max parallel connections, there is a execution timeout of php.
Greets patrick
Hi,
since I installed (at least I think it’s since then) Joomfish and made my site bilingual JCrawler runs into a Fatal error: Maximum execution time of 30 seconds exceededToo sad, as JCrawler in my Opinion is damn hot.
Any chance to get this running?
GreetsLars
Hi Lars, you have this options:
- Increase the max_execution_time (ask your provider)
- Increase the parallel max connections
- exclude unimportant sites
- reduce the crawling level
yes thats it.
Greets patrick
Hi
I am having the following problem. When i generate the xml it returns a file with no urls inside, and the message:
There are 0 links in your sitemap.
Success, wrote /home/www/kareklorama.gr/sitemap.xml
Success, wrote /home/www/kareklorama.gr//administrator/components/com_jcrawler/config.xml
any solution?
hi sakis,
ah i’didn’t mention, that the crawler doesen’t support-flash based sites, sorry!
it can only crawl html based sites.
greets patrick
Hi
thanks for your reply.
My site is’not a flash based.
Its a joomla site that uses a flash module in some pages.
In this occassion there is a problem??
hmm on the first site is no link in the html source, so the crawler can not detect other links, if there were, im sorry, but jcrawler can not crawl flash items.
greets patrick
Habe das gleiche Problem wie sakis. Hat immer auf Joomla 1.5.7 einwandfrei funktioniert und plötzlich ging nichts mehr. Benutze die neueste Version von jcrawler.There are 0 links in your sitemap.Success, wrote /home/www/kareklorama.gr/sitemap.xmlSuccess, wrote /home/www/kareklorama.gr//administrator/components/com_jcrawler/config.xml Irgendein Tip?
Hello great application. I am however having some problems generating the xml file this is the error I am receiving.
httpcode: 403 on url http://enginehead.com make sure this url is availabe to the crawler
Message
There are 0 links in your sitemap.
Success, wrote /var/www/vhosts/enginehead.com/httpdocs/sitemap.xml
Success, wrote /var/www/vhosts/enginehead.com/httpdocs//administrator/components/com_jcrawler/config.xml
hi oz,
i crawled your size and generated a sitemap with 321 links. There must be an option in your .htaccess file try to comment the following line: “RewriteBase /”
or maybe there is a spam forward? for such crawlers?
Greets patrick
Hello Patrick thank you for replying to my post. I followed your instructions and commented the following line “RewriteBase /” form my .htaccess. I am still receving the above error and my sitemap.xml file continues to appear empty. Could there be another issue?
Hi Patrick,JCrawler works as a charm, no argument about that (or it must be the priority settings i mentioned earlier
)But I have the following question: I installed JCrawler on a site and now i see via the Google Webmaster help Centre that there are 31 external links from your site. If I look at your site I can’t find any. What about those links and why?
http://www.pixelschieber.ch/?attachment_id=9
18 okt. 2008
http://www.pixelschieber.ch/?cat=
19 okt. 2008
http://www.pixelschieber.ch/?tag=pixelschieber
16 okt. 2008
http://www.pixelschieber.ch/anleitung-um-visitenkarten-zu-drucken-4.html
18 okt. 2008
http://www.pixelschieber.ch/hallo-zusammen-3.html
18 okt. 2008
http://www.pixelschieber.ch/jcrawler
13 okt. 2008
http://www.pixelschieber.ch/joomla-seo-tipps-und-tricks-14.html
18 okt. 2008
http://www.pixelschieber.ch/lanneau-du-rhin-jacques-cornu-5.html
16 okt. 2008
http://www.pixelschieber.ch/category/filme
16 okt. 2008
http://www.pixelschieber.ch/category/seo
16 okt. 2008
http://www.pixelschieber.ch/category/uncategorized
16 okt. 2008
http://www.pixelschieber.ch/category/yamaha-r1
13 okt. 2008
http://www.pixelschieber.ch/tag/amerika
16 okt. 2008
http://www.pixelschieber.ch/tag/bridgestone
14 okt. 2008
http://www.pixelschieber.ch/tag/cornu
16 okt. 2008
http://www.pixelschieber.ch/tag/der-standard
16 okt. 2008
http://www.pixelschieber.ch/tag/druck
18 okt. 2008
http://www.pixelschieber.ch/tag/frankreich
14 okt. 2008
http://www.pixelschieber.ch/tag/indesign
17 okt. 2008
http://www.pixelschieber.ch/tag/maschine
13 okt. 2008
http://www.pixelschieber.ch/tag/mopped
16 okt. 2008
http://www.pixelschieber.ch/tag/pdf
17 okt. 2008
http://www.pixelschieber.ch/tag/photoshop
16 okt. 2008
http://www.pixelschieber.ch/tag/pixler
16 okt. 2008
http://www.pixelschieber.ch/tag/plan
18 okt. 2008
http://www.pixelschieber.ch/tag/racingtag
14 okt. 2008
http://www.pixelschieber.ch/tag/runde
13 okt. 2008
http://www.pixelschieber.ch/tag/topposition
13 okt. 2008
http://www.pixelschieber.ch/tag/visitenkarte
17 okt. 2008
http://www.pixelschieber.ch/tag/vorlage
18 okt. 2008
hey yvonne,
thx for the compliment,
“External links” in google webmaster tools means:
The listed site are linking to your site, so on the url’s above there is a link to your site. You posted your site once in a comment
That has nothing to do with jcrawler, but if you like my component you can add maybe a link on your site to mine?
greets patrick
hmm, i know i have posted the url of my website here, but there i see no external links to your site. I haven’t installed JCrawler on my website yet but used it for a customer of mine. It’s their site I am talking about which gives those external links:?
hmm, this site must be linked here (on http://www.pixelschieber.ch/jcrawler) more links to the customers site is good, so the customer get a higher pagerank!
Greets Patrick
On the first part of your reply i must admit you’re right, my mistake, however, a link to the tag jcrawler should be enough, don’t know what it has to do with the rest of the tags.The 2nd part is bull since your subject has nothing to do with mine.Would give you the credits though by a backlink if I still used this component.
Jcrawler is a slick instrument indeed, until just now:
I just watched jcrawler almost crash the host server on which my site is running. Any idea what may be causing this kind of destructive behavior?
oh be carefull!
jcrawler needs a lot of memory (max. 256M) and does a lot of parallel connections with curl! the parallel connections are the destructive behavior! imagine, no webserver likes when somebody comes and opens 250 parallel connections at once, additionally thats maybe a shared hosting with a log of other websites/users on it and old hardware.?
50 parallel connections and a crawling level of 3 should not overload any webserver
Greets Patrick
Hi Patrick, I’ve installed Jcrawler on our clubs website without a hitch… however when I run it it goes through the motions and then gives me a blank page with xxxx/index2.php. It does produce an sitemap.xml but am pretty sure that job’s not complete when it goes to a blank screen. Any ideas/help/advice would be much appreciated. Running with J1.5.6.
Cheers
Gerry
Hi Gerry,
There is a maximum execution time of the PHP Scripts on your server. This execution time is exceeded by the crawling script. Try to reduce the crawling level and increase the maximum of parallel connections. Activate Joomla’s Cache bevore crawling.
Greets Patrick
Hi Patrick – success!I enabled joomla cache, then using curl reduced the number of levels to 1 with parallel set at 50. Sitemap now been successfully submitted to search engines.
. Many thanks
Hi guys,First, thanks for such a great component. no plugins!? i’l be damned its so cool!.i have a problem tough… i use non-english letters in my site’s urls, and when the sitemap is created, instead of all the non-latin chars i get question marks… (???).so right now i actually can’t use the component.. and i really like it.maybe someone knows how to make it “utf-8″ aware, i’m not sure what the problem is…accessing it trough browser is impossible. when its generated in the admin side, it looks like blabla.com/?????thanks
.yandos.
Hi guys,thanks for the great component.i have a problem though…. when i generate a sitemap, all the non-latin chars appear as questionmarks…my previous post was removed after it was published… have i violated some posting rules?sorry if i did.i would be really glad for help with that… i know that in xmap the problem also existed, and was fixed by changing a few lines. i’m trying to find them and will post them asap in case the issues are related…thanks…
Hi again,sorry for double posting, i didn’t see my own post…the xml file in xmap was considered corrupted when i tried to access it, giving this message:
The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.
Access is denied.
after changing these lines in the main php file it worked:echo ‘<loc>’, $this->escapeURL($link) ,’</loc>’.”\n”;changed to:echo ‘<loc>’, $link ,’</loc>’.”\n”;so that made the xml readable… maybe this helps?thanks…
Hi yandos,
ok first i’am developping jcrawler alone
second: this problem is solved in my new build, it’s not yet released.
Actually I haven’t much time for testing, are you interested to test the new build?
Greets Patrick
damn, you do a great job… i’ll be thrilled to try the new beta. i can also give you access to the site for tests, the site is still in beta testing so theres nothing to lose
.
Hi Patrick,
. where can i get it?
i’ll be super happy to join the testing of the new beta
Hello,
first of all thank you for this great component.
today i am getting an error when i tried to run Jcrawler which i did not get before. here is the error:
Error Loading ModulesMySQL server has gone away SQL=SELECT id, title, module, position, content, showtitle, control, params FROM jos_modules AS m LEFT JOIN jos_modules_menu AS mm ON mm.moduleid = m.id WHERE m.published = 1 AND m.access <= 2 AND m.client_id = 1 ORDER BY position, ordering
Can you please help me on what is wrong?
thank you,
VR
Hi Vince,
Thats not a problem with JCrawler. Something is wrong with your Joomla installation. JCrawler doesn’t use any MySQL components, its pure PHP.
Greets Patrick
thanks Patrick for this great component.
by the way the problem went away so not sure if the update release of joomla fixed my problem but thanks for your reply.
VR
I tried Jcrawler 1.5 beta in my web site. However, it couldn’t finish the job. During crawling, it gives an error and it didn’t crawl all entries that my site has.could you help me about this problem?
This is the error I have taken…. help pleaseInternal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator, support@supportwebsite.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.
More information about this error may be available in the server error log.
Apache/1.3.33 Server at http://www.cocukdayaparimkariyerde.com Port 80
hi, check your .htaccess file
I think you installed a SEO (maybe artico) component?
Greets Patrick
Hi!
I installed this little spider and it was working great. I wonder how often I have to crawl my site with this (to update my sitemap)? It’s not doing it automatically. If I add a new link or so, do I have to run it again?
Keep it simple and light in the future. Nice Job.
Hi,
You have to crawl your site if you change your links. You have it to do manually.
Greets Patrick
I suggest cron generation of sitemap. And priority of all of sites must be available to set up.
sorry for english
the priority settings will come, cron generation is not necessary.
The generation of a new sitemap is only necessary, when you change something on your structure of your website.
Greets Patrick
Warning: file_get_contents(http://mydomain) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 401 Unauthorized in C:\xampp\htdocs\website2\administrator\components\com_jcrawler\admin.jcrawler.php on line 514
you get a 401-error on you domain, check your .htaccess
edit: I fixed my acces problem by editing the system32\hosts\etc file to see my domain as 127.0.0.1. 0 errors with level 1 but with level 3 i get:
HTTP/1.1 404 Not Found on url http://pcinfo.mine.nu/function.fopen make sure this url is availabe to the crawler
HTTP/1.1 404 Not Found on url http://pcinfo.mine.nu/function.getimagesize make sure this url is availabe to the crawler
HTTP/1.1 404 Not Found on url http://pcinfo.mine.nu/tcpdf.include make sure this url is availabe to the crawler
HTTP/1.1 404 Not Found on url http://pcinfo.mine.nu/function.include make sure this url is availabe to the crawler
Because I use url rewrite i am not sure which parts to block so the crawler does not go into the admin part.
ok, so this is normal there are only the 404 errors. by the way i never tested the crawler on localhost.
I installed jcrawler and I get an error on pretty much every thing it can find on the site. Here is an example:
httpcode: 400 on url …. make sure this url is availabe to the crawler
another
httpcode: 404 on url … make sure this url is availabe to the crawler
However I do have success writing to sitemap.xml and robots.txt
Any ideas? If it is .htaccess, what do I need to add it it?
can you show me your .htaccess file? (send me an email)
have you installed a SEF component for forwarding urls?
3 emails coming your way. Try to help as much as I can with info.
Patrick
just wanted to let you know that the link in footer of JCrawler confirmation page points to http://www.pixelschiber.ch. You missed the “e”.
Willy
Patrick
it is me again: I just discovered that the component does not generate any map for my site. It discovers any links. It did not generate any error.
I reports success on writing the two file which are however empty.
Any ideas?
Willy
Hi Willy,
thx for the Typo
i crawled your site and got 186 Links, without any problems. can you tell me more about your settings?
Greets Patrick
I am running Joomla 1.5.8. I tested today again, no links found but also no errors.
When I validate the site map with the link, I get an error:
Problems with the schema-validity of the target
http://www.willyneuhaus.ch/joomla/sitemap.xml:3:2: Invalid per cvc-complex-type.1.2.4:
content of urlset is not allowed to end here (1), expecting [u'{http://www.sitemaps.org/schemas/sitemap/0.9}:url']:
I am not a programmer myself!
My website has the following structure:
http://www.willyneuhaus.ch/public_html
and then /joomla and /galerie
My provider installed me a script
Can that have an impact?
Willy
Hey, willy
the postet script should not have an impact on jcrawler. maybe he changed more than that and you can ask him. He knows what he did exactly.
greets Patrick
i forgot to add the script!
Willy
“”"”"”
$url = ‘http://www.willyneuhaus.ch/joomla’;
header(“Location: $url”);
hope this works!
Hi!
On my server curl is not installed so have to us fopen. When running it with one level it is doing fine. when running it with two levels I get the same error like other got already:
Warning: feof(): supplied argument is not a valid stream resource in /var/www/virtual/mydomain.de/htdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 505 Warning: fgets(): supplied argument is not a valid stream resource in /var/www/virtual/mydomain.de/htdocs/administrator/components/com_jcrawler/admin.jcrawler.php on line 506
I excluded ä ü ö and ß but still had this error.
Do you have an idea what to try?
Thanks
Dirk
Hi Dirk,
Can you send me please your Url to my E-Mail address or can you post it here, so i can have a look.
greets Patrick
Hi Patrick,
which URL? Its all in the backend!!
Do you have forum for the jcrawler?
Thx
Dirk
I wanted to ask a question, Jcrawler generated a sitemap which I submitted to google. When I went onto google, it said that there were 598 urls in the site map, however, I have probably between 1,100 to 1,200 articles on the site. What would be causing this discrepancy?
Hi Dean,
This cause of this discrepancy can be the crawling level. If you want all your articles/links indexed you have to set the crawling level to 6 (if your page has that many levels).
Another thing can be the exclude list, and jcrawler does not iclude joomla internal links like registration or lost pasword etc…
Greets Patrick
Hello Patrick,
Not sure what this is but can you please tell why i am getting this error
Error Loading ModulesMySQL server has gone away SQL=SELECT id, title, module, position, content, showtitle, control, params FROM jos_modules AS m LEFT JOIN jos_modules_menu AS mm ON mm.moduleid = m.id WHERE m.published = 1 AND m.access <= 2 AND m.client_id = 1 ORDER BY position, ordering
thank you,
VR
Hi VR,
Your MySQL server is down: “MySQL server has gone away”
But JCrawler has nothing to do with MySQL, it’s pure PHP!
Greets Patrick
I get 31 link and when I use a web based Sitemap maker I get 71 links.. seems it is skipping a lot. I have changed levels and same results.
Any idea why?
okay switched it to fopen and took care of problem.
I still get the HTTP/1.1 403 FORBIDDEN error on all links? Why is this?
hey spencer,
you have installed a SEF tool right? try to uncomment the “RewriteBase /” line in your .htaccess file and try again.
greets patrick
I have problems using your jcrawler and the use of SH404 Sef component. When installed your Great compoent doesn’t generate url after pressing start…
An idea’s how to solve this
Thank you
try to uncomment the “RewriteBase /” line in your .htaccess file and try again.
can you give me your url?
greets patrick
Hi,
This is a great little addon for Joomla! Its working well on some of my other sites, but on one site, I’m getting nothing in the xml file, and this error on the final screen when generating the site map;
Warning: file_get_contents(http://www.kevincoy.com) [function.file-get-contents]: failed to open stream: Connection refused in /home/sites/kevincoy.com/public_html/administrator/components/com_jcrawler/admin.jcrawler.php on line 514
Any help would be appreciated!!!
Kevin
Hi Kevin,
hmm in your case neither fopen with url’s or curl is available.
is there a firewall between jcrawler and your site?
or does your site not run on port 80?
Greets Patrick
The component looks good. My only problem is that it will only create an entry for my contact page and no other pages. I only have 5 pages and I’ve tried changing the levels but nothing works.
Any guidance you can provide will be appreciated.
Marc
Hi Marc,
thank you, i’ve found a bug.
Will be fixed in the next version
Greets Patrick
Hello,
what does mean when your links goes down?
one day i will run jcrawler and shows 800 links and couple days later i run it again and will show 785.
is something getting deleted?
thank you,
VR
I just moved from my own server (win2k3) to byethost (linux) and now another host. On both hosts I cannot get Jcrawler to generate sitemap. The first does not allow for non standard user agents. Is there any way to solve such an issue? I just send an email to my new host if they also sent away unknown user agents.
To be clear: the sitemap gets generated but it has no links in it, jcrawler sees 0 links (but the green stuff and the spam is added to the sitemap.xml).
Thanks in advance!
ps I think a lot of people not running their own apache will have this problem
Hi SdK,
Ok, thats an user agent problem. I’ll change the useragent to “google” in the next version. So there will be less problems with the useragent.
Greets Patrick
Like a couple of others, I am getting the HTTP/1.1 403 FORBIDDEN error on many, but not all the links.
I am not using .htaccess but I am using sh404SEF. If I am not using .htaccess, there is no where to comment out the “RewriteBase /”.
I get roughly 1/2 my links written to xml file and the other 1/2 are the error.
Any ideas?
Thanks in advance.
hello,
it seems my other post has disappeared.
my question in regards is that i notice when i run jcrawler it will show like 800 links. when i run jcrawler another day it will show less say 780 links.
can you tell me why it does this?
thank you,
vince
Hey, your post was not approved yet,
your page is changing, there are a forum, comments etc..
that can be a reason, or due serverload there are some timeouts on urls
greets patrick