[FRIAM] scraping a web site
Nick Thompson
nickthompson at earthlink.net
Wed Jan 4 02:20:23 EST 2017
Dear Robert, Tom, and Marcus,
I am not sure how I would survive in this complicated world without this
ability to ask a quick question of friam and get a quick answer. The
problem I so often face is WHAT QUESTION to ask the web, when I plunge into
it. I had gotten seduced by the dramatic metaphor of "scrape"; indeed,
"migration" is a lot closer to what I am looking for. These tips will help
a lot and I will investigate them.
Your mention of a web archive brought to mind another thought. Years ago, I
did up a website for the "City University of Santa Fe" which I thought was
pretty nifty. However, I was the only one who thought it was nifty, so in
time even I lost interest. And then I forgot to pay my fee to the hosting
service, and they forgot to remind me, and I lost the site's url to some
outfit in Indiana. I assumed I had lost the data too, but your email
suggests the possibility that it still lives some where.
Many, many thanks.
Nick
Nicholas S. Thompson
Emeritus Professor of Psychology and Biology
Clark University
<http://home.earthlink.net/~nickthompson/naturaldesigns/>
http://home.earthlink.net/~nickthompson/naturaldesigns/
From: Friam [mailto:friam-bounces at redfish.com] On Behalf Of Robert J.
Cordingley
Sent: Wednesday, January 04, 2017 12:00 AM
To: The Friday Morning Applied Complexity Coffee Group <friam at redfish.com>
Subject: Re: [FRIAM] scraping a web site
Hi Nick
Your old Earthlink site seems to comprise just about ten 'pages' of content,
with many of those pages (Published Works) listing many bibliographic
citations, each with a link to an image and further link to a pdf document.
Grabbing all the content manually is perhaps tedious but doable. Saving all
the pages as HTML is also doable but don't see a lot of point in that.
Populating your Research Gate website should be possible too with in browser
Copy and Paste - but I'm not familiar with RG - as should any other website
builder, Wix, Squarespace, WordPress as well as hosting company website
builders. I don't know of an automated system but the Internet Archive must
have something and already has multiple captures of past versions of your
site - see
https://web.archive.org/web/20151206005021/http://home.earthlink.net/~nickth
ompson/naturaldesigns/
<https://web.archive.org/web/20151206005021/http:/home.earthlink.net/~nickth
ompson/naturaldesigns/> .
I think what you're really looking for is a web/content migration tool more
so than web scraping tools which tend to be focused on capturing specific
data, say contact information. Vamosa seems to offer a service that should
do exactly what you want, see
http://www.vamosa.com/vamosa-content-migrator-c124 but suspect that's aimed
at large corporate clients. I have no experience with them. Googling
'website migration tools' produces lots of results - some questionable.
Hope this helps.
Thanks, Robert
On 1/3/17 9:49 PM, Nick Thompson wrote:
Dear Phellow Phriammers,
I am in the uncomfortable position of being bound by threads of steel to
Earthlink. Many, MANY, years I go I started a website on Earthlink,
{http://home.earthlink.net/~nickthompson/naturaldesigns/
<http://home.earthlink.net/%7Enickthompson/naturaldesigns/>
}, and put a lot of my writing, and some commentary up on it. The website
creation and editing medium (trellix) was pretty good for its time, and
there are many ways that I find the site quite satisfying. But gradually
Earthlink has withdrawn its support, and now I am not sure I could get in to
edit or change it. Meantime, Research Gate has gotten started, and provides
a somewhat better place to meet the world and archive my stuff. And also,
having the site on earthlink binds me to them and their 22 dollar a month
fee. So. .
I am wondering if there is a way (or a service that would) scrape the
website and, possibly, dump it into a new and more reliable, more website
creation medium? Please, ambulatory knowledge only. I don't want a people
doing deep searches to answer this question .
Thanks, as always .
Nick
Nicholas S. Thompson
Emeritus Professor of Psychology and Biology
Clark University
http://home.earthlink.net/~nickthompson/naturaldesigns/
<http://home.earthlink.net/%7Enickthompson/naturaldesigns/>
============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com
FRIAM-COMIC http://friam-comic.blogspot.com/ by Dr. Strangelove
--
Cirrillian
Web Design & Development
Santa Fe, NM
http://cirrillian.com
281-989-6272 (cell)
Member Design Corps of Santa Fe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://redfish.com/pipermail/friam_redfish.com/attachments/20170104/a50c9127/attachment-0002.html>
More information about the Friam
mailing list