What counts as a duplicate?

Mark S.
Posts: 81
Joined: Thu Aug 09, 2012 3:39 pm
Contact:

What counts as a duplicate?

Postby Mark S. » Thu Nov 01, 2012 8:30 pm

I just took my first stab at importing my Treepad data into Cinta. In the source file, there were 5000+ entries. In the import, Cinta stripped away about 1000 as duplicates. I'm not surprised that there were some duplicates, but I would be amazed to find that 20% of my data is duplicate. Does it look at content, titles, tags?

How does Cinta determine what is a duplicate?

Thanks!
Mark
User avatar
CintaNotes Developer
Site Admin
Posts: 5002
Joined: Fri Dec 12, 2008 4:45 pm
Contact:

Re: What counts as a duplicate?

Postby CintaNotes Developer » Sat Nov 03, 2012 12:54 pm

Currently the algorithm is a follows:

for each note about to be imported:
if a note with exactly same creation date, title and text exists - then skip this note.

Note that tags and links are ignored.

If you think that something is going wrong, could you please
find a note that should have been imported but didn't?

Also, it it interesting how did you get this data from Treepad, how did you prepare the XML?
Alex
Mark S.
Posts: 81
Joined: Thu Aug 09, 2012 3:39 pm
Contact:

Re: What counts as a duplicate?

Postby Mark S. » Sat Nov 03, 2012 5:40 pm

CintaNotes Developer wrote:Currently the algorithm is a follows:
for each note about to be imported:
if a note with exactly same creation date, title and text exists - then skip this note.

By "text exists", do you mean as long as both items have *any* text (but not necessarily the SAME text), with the same date and title, then they are considered duplicates?

That would explain the large number of duplicates. All the imports will have the same date -- I just used today's date because TP doesn't export any dates. Its likely a lot of entries have the same title since I'm not always creative when assigning titles.

CintaNotes Developer wrote:Also, it it interesting how did you get this data from Treepad, how did you prepare the XML?

I export in XML format, selecting "one file" and "text only".

Then I run a Ruby script I'm working on to convert it into CN XML format. The script supplies today's date for the date fields since TP doesn't export that info even though it does retain the info.

I had hoped to be able to export with "HTML" formatting intact, but it turns out that REXML in Ruby chokes on embedded HTML. Hope to re-write again using nokogiri and see if that works better.

A still more future version would retain the tree structure of TP but as a tag/tree (obviously not as full titles). This will require me to recurse the tree structure.

The resulting CN file, after duplicate stripping, is 15 megs. When I type into the CN search engine, it takes about a half-second a stroke. This is good considering that a full-text search in TP takes 20 seconds. My conclusion is that TP must not use any indexing on full-text searches.

Thanks!
Mark
User avatar
CintaNotes Developer
Site Admin
Posts: 5002
Joined: Fri Dec 12, 2008 4:45 pm
Contact:

Re: What counts as a duplicate?

Postby CintaNotes Developer » Mon Nov 05, 2012 6:23 am

No, I mean that two notes are considered duplicates only when they have exactly same creation date AND title AND text)
Alex

Return to “CintaNotes Personal Notes Manager”