CintaNotes Developer wrote:Currently the algorithm is a follows:
for each note about to be imported:
if a note with exactly same creation date, title and text exists - then skip this note.
By "text exists", do you mean as long as both items have *any* text (but not necessarily the SAME text), with the same date and title, then they are considered duplicates?
That would explain the large number of duplicates. All the imports will have the same date -- I just used today's date because TP doesn't export any dates. Its likely a lot of entries have the same title since I'm not always creative when assigning titles.
CintaNotes Developer wrote:Also, it it interesting how did you get this data from Treepad, how did you prepare the XML?
I export in XML format, selecting "one file" and "text only".
Then I run a Ruby script I'm working on to convert it into CN XML format. The script supplies today's date for the date fields since TP doesn't export that info even though it does retain the info.
I had hoped to be able to export with "HTML" formatting intact, but it turns out that REXML in Ruby chokes on embedded HTML. Hope to re-write again using nokogiri and see if that works better.
A still more future version would retain the tree structure of TP but as a tag/tree (obviously not as full titles). This will require me to recurse the tree structure.
The resulting CN file, after duplicate stripping, is 15 megs. When I type into the CN search engine, it takes about a half-second a stroke. This is good considering that a full-text search in TP takes 20 seconds. My conclusion is that TP must not use any indexing on full-text searches.
Thanks!
Mark