help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Locating repetitions of text sequences


From: Heime
Subject: Locating repetitions of text sequences
Date: Sat, 22 Oct 2022 22:31:29 +0000

https://emacs.stackexchange.com/posts/74219/timeline

Currently implementing a function that finds repeating sequences of text, 
length N.

Here is some text

Joseph Rudyard Kipling (30 December 1865 - 18 January 1936)
 was an English novelist, short-story writer, poet, and
 journalist. He was born in British India, which inspired
 much of his work.  English novelist, short-story writer,
 poet, and journalist.

 Kipling's works of fiction include the Jungle Book duology
 (The Jungle Book, 1894; The Second Jungle Book, 1895).  His

poems include "Mandalay" (1890), "Gunga Din" (1890), "The Gods

of the Copybook Headings" (1919), and "The White Man's Burden"
 (1899).

With N=5, the first "Search Sequence" with five components is

--------

Joseph Rudyard Kipling (30 December

--------
Which I match with consecutive "Text Extracts" (each time shifted by one 
component)

--------

Joseph Rudyard Kipling (30 December

Rudyard Kipling (30 December 1865

Kipling (30 December 1865 -

--------

and so on.

Then repeat with again with "Search Sequence"

Joseph Rudyard Kipling (30 December

--------------------

Suppose I now reach the "Search Sequence"

---------

novelist, short-story writer, poet, and

---------

then use the following "Text Extracts"

--------

Kipling (30 December 1865 -

Joseph Rudyard Kipling (30 December

Rudyard Kipling (30 December 1865

(30 December 1865 - 18

December 1865 - 18 January

1865 - 18 January 1936)

--------

continued with

--------

English novelist, short-story writer, poet,

novelist, short-story writer, poet, and

short-story writer, poet, and journalist.

writer, poet, and journalist. Kipling's

--------

where a match is found in the second piece

One then outputs the line number where the match was found, together with the

repeating part.

--------

4- novelist, short-story writer, poet, and

--------

Continuing so till the end of the buffer

Have started with the following function

---------

(defun wseqn ()

"Search buffer for repeating phrases with N number of words."

(interactive)

(let (N x regex-search)

(setq N (read-number "How many words to search?: " 5))

(setq x 1)

(save-excursion

(while

(< x (length (buffer-string))

(save-excursion

(let (p1 p2 (case-fold-search t))

(setq p1 x)

;; After search N words forward, set end point as index

;; of the last char of those words

(dotimes (y N (setq p2 (point)))

(skip-chars-forward "_a-z0-9"))

(setq regex-search

(buffer-substring-no-properties p1 p2)))

(message "regex-search %S" regex-search)

;; Only forward search is necessary. If it was repeated

;; behind, it would have been caught in previous

;; iterations.  This implementation also captures the

;; same repeated phrase by multiple earlier searches.

(save-excursion

(while (search-forward regex-search nil t)

(let (p2)

(setq p2 (point))

(goto-char (- p2 (length regex-search)))

(push-mark p2))))

(setq x (+ x (skip-chars-forward "_a-zA-Z0-9") 1))))))))

reply via email to

[Prev in Thread] Current Thread [Next in Thread]