I just took my first stab at importing my Treepad data into Cinta. In the source file, there were 5000+ entries. In the import, Cinta stripped away about 1000 as duplicates. I'm not surprised that there were some duplicates, but I would be amazed to find that 20% of my data is duplicate. Does it look at content, titles, tags? 
How does Cinta determine what is a duplicate?
Thanks!
Mark
			
									
									
						What counts as a duplicate?
- 
				Mark S.
- Posts: 81
- Joined: Thu Aug 09, 2012 3:39 pm
- Contact:
- CintaNotes Developer
- Site Admin
- Posts: 5011
- Joined: Fri Dec 12, 2008 4:45 pm
- Contact:
Re: What counts as a duplicate?
Currently the algorithm is a follows:
for each note about to be imported:
if a note with exactly same creation date, title and text exists - then skip this note.
Note that tags and links are ignored.
If you think that something is going wrong, could you please
find a note that should have been imported but didn't?
Also, it it interesting how did you get this data from Treepad, how did you prepare the XML?
			
									
									for each note about to be imported:
if a note with exactly same creation date, title and text exists - then skip this note.
Note that tags and links are ignored.
If you think that something is going wrong, could you please
find a note that should have been imported but didn't?
Also, it it interesting how did you get this data from Treepad, how did you prepare the XML?
Alex
						- 
				Mark S.
- Posts: 81
- Joined: Thu Aug 09, 2012 3:39 pm
- Contact:
Re: What counts as a duplicate?
CintaNotes Developer wrote:Currently the algorithm is a follows:
for each note about to be imported:
if a note with exactly same creation date, title and text exists - then skip this note.
By "text exists", do you mean as long as both items have *any* text (but not necessarily the SAME text), with the same date and title, then they are considered duplicates?
That would explain the large number of duplicates. All the imports will have the same date -- I just used today's date because TP doesn't export any dates. Its likely a lot of entries have the same title since I'm not always creative when assigning titles.
CintaNotes Developer wrote:Also, it it interesting how did you get this data from Treepad, how did you prepare the XML?
I export in XML format, selecting "one file" and "text only".
Then I run a Ruby script I'm working on to convert it into CN XML format. The script supplies today's date for the date fields since TP doesn't export that info even though it does retain the info.
I had hoped to be able to export with "HTML" formatting intact, but it turns out that REXML in Ruby chokes on embedded HTML. Hope to re-write again using nokogiri and see if that works better.
A still more future version would retain the tree structure of TP but as a tag/tree (obviously not as full titles). This will require me to recurse the tree structure.
The resulting CN file, after duplicate stripping, is 15 megs. When I type into the CN search engine, it takes about a half-second a stroke. This is good considering that a full-text search in TP takes 20 seconds. My conclusion is that TP must not use any indexing on full-text searches.
Thanks!
Mark
- CintaNotes Developer
- Site Admin
- Posts: 5011
- Joined: Fri Dec 12, 2008 4:45 pm
- Contact:
Re: What counts as a duplicate?
No, I mean that two notes are considered duplicates only when they have exactly same creation date AND title AND text)
			
									
									Alex
						Return to “CintaNotes Personal Notes Manager”

